uniprot logo

News

UniProt release 2024_02

Published March 27, 2024

Headline

CMV infections: plants beaten at their own game

Plants have evolved very efficient processes to protect themselves from herbivore and pathogen attacks. The jasmonate signaling pathway is a nice example of such a process. Jasmonate is lipid-based plant hormone, with a similar structure to animal prostaglandins. In the absence of jasmonate, JAZ (jasmonate ZIM domain) proteins bind to downstream transcription factors, hence silencing the expression of plant defense genes against herbivores. Following mechanical wounding or herbivory, jasmonate biosynthesis is rapidly activated. In the presence of jasmonate or its bioactive derivatives, JAZ proteins are degraded via the proteasome, freeing transcription factors for expression of genes needed for defense. Another indirect result of jasmonate signaling is the emission of volatile jasmonate-derived compounds. These can travel to nearby plants and elevate levels of transcripts related to wound response. This emission can further upregulate jasmonate biosynthesis and cell signaling, thereby inducing nearby plants to prime their defenses in case of herbivory.

Plant defense against viruses is also outstanding. The primary defense against, for instance, cucumber mosaic virus (CMV) is directed by antiviral RNA interference (RNAi) guided by the virus-derived small interfering RNAs. This strategy, known as post-transcriptional gene silencing, limits the accumulation of viral RNAs.

Obviously, viruses have developed counter-defense strategies. CMV encodes a viral suppressor of host RNAi, the 2b protein, which acts at two levels. It directly interacts with both the RNA and protein components of the plant RNA silencing machinery. By binding to double-stranded RNAs, it inhibits the production of viral siRNAs. By interacting with the argonaute protein, it prevents argonaute slicer activity. Once the host defense is neutralized, the virus can go on replicating, but that is not the end of the story. After replication, the virus needs to propagate to other plants, and to do so, it has to attract insect vectors, in the case of CMV, aphids.

At this stage, CMV again uses the 2b protein. This time, the 2b protein acts at the level of the jasmonate signaling pathway, inhibiting its activation. The 2b protein binds JAZ proteins and prevents their degradation, which would otherwise turn on the jasmonate signaling pathway. Through a yet unknown mechanism, the repression of the jasmonate signaling pathway leads to increased odor-dependent aphid attraction. Once the insects settle down on the infected plant, they suck sap and concomitantly pick up viruses to transmit them later on to other plants.

The astonishingly multifunctional CMV 2b protein has been manually annotated in the Swiss-Prot section of UniProtKB and is now publicly available.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Glioma 9
  • Unilateral palmoplantar verrucous nevus

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Lipidation' ('LIPID' in the flat file):

  • GPI-anchor amidated isoleucine

UniProt release 2024_01

Published January 24, 2024

Headline

Vitamin K beyond coagulation

The involvement of vitamin K in blood clotting has been known for about a century now (see UniProt headline). Vitamin K is made by plants, and is found in highest amounts in green leafy vegetables, in which it is directly involved in photosynthesis. In animal cells, the vitamin undergoes the so-called canonical vitamin K cycle, in which it is first reduced into its active form by vitamin K epoxide reductase (VKOR). It then becomes the cosubstrate for vitamin K-dependent gamma-carboxylase (GGCX), which catalyzes gamma-carboxylation of glutamate residues in target proteins, including several blood factor proteins, leading to initiation of the blood coagulation cascade. During the reaction, vitamin K is oxidized, but it is restored back to active vitamin K by VKOR, being continuously recycled.

Anticoagulant drugs, like warfarin, inhibit VKOR, blocking vitamin K recycling and depleting active vitamin K stores. Prescription of warfarin is tricky, as overdosage can cause life-threatening bleeding. The standard clinical treatment for warfarin poisoning is the administration of a high dose of vitamin K, which is well-tolerated and does not exhibit any toxicity. During the treatment, blood clotting is enabled through the action of a warfarin-resistant reductase, whose identity was long elusive.

The solution of the mystery came from a study focused on a different topic, namely ferroptosis. Ferroptosis is an iron-dependent form of cell death, characterized by the accumulation of lipid-based reactive oxygen species that results in membrane damage. It is thought to be one of the most widespread and ancient forms of regulated cell death. Cells have evolved a number of highly efficient redox systems that counteract uncontrolled lipid peroxidation. One of them is mediated by the ferroptosis suppressor protein 1 (FSP1 or AIFM2) in the presence of ubiquinone (also known as coenzyme Q10, CoQ10): reduced CoQ10 traps lipid peroxyl radicals that mediate lipid peroxidation, whereas FSP1 catalyzes the regeneration of CoQ10 using NAD(P)H. Vitamin K shares structural similarities with CoQ10, hence testing FSP1's activity on vitamin K seemed a natural extension. The results are striking: reduced vitamin K produced by FSP1 actively prevents lipid peroxidation. The reaction is not sensitive to warfarin.

The discovery of a new function for vitamin K, as a guardian against ferroptosis, may also have an impact on clinical practice. As a consequence of vasoconstriction, thrombosis, or embolism, tissues can experience a restriction in blood supply, called ischemia, causing a shortage of oxygen that is needed for cellular metabolism. Paradoxically, restoration of blood flow to previously ischemic tissues can cause the exacerbation of cellular dysfunction and death through the induction of oxidative stress and ferroptosis, a phenomenon known as ischemia-reperfusion injury (IRI). In an IRI mouse model, Mishima et al. showed that vitamin K supplementation could prevent tissue-damaging inflammation in the liver or kidneys. Vitamin K could also prevent ferroptosis in neurons.

While waiting for future clinical trials, you can always visit our updated vertebrate FSP1/AIFM2 entries.

UniProtKB news

Cross-references to EMDB

Cross-references have been added to the EMDB database, the Electron Microscopy Data Bank.

EMDB is available at https://www.ebi.ac.uk/emdb.

The format of the explicit links is:

Resource abbreviation EMDB
Resource identifier Resource identifier

Example: Q7Z7F7

Show all entries having a cross-reference to EMDB.

Text format

Example: Q7Z7F7

DR   EMDB; EMDB-16898; -.

XML format

Example: Q7Z7F7

<dbReference type="EMDB" id="EMDB-16898"/>

RDF format

Example: Q7Z7F7

uniprot:Q7Z7F7
  rdfs:seeAlso <http://purl.uniprot.org/emdb/EMDB-16898> .
<http://purl.uniprot.org/emdb/EMDB-16898>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/EMDB> .

Cross-references to JaponicusDB

Cross-references have been added to the JaponicusDB database, the Schizosaccharomyces japonicus model organism database.

JaponicusDB is available at https://www.japonicusdb.org.

The format of the explicit links is:

Resource abbreviation JaponicusDB
Resource identifier Resource identifier
Optional information 1 Gene designation

Example: B6K366

Show all entries having a cross-reference to JaponicusDB.

Text format

Example: B6K366

DR   JaponicusDB; SJAG_03048; cdc2.

XML format

Example: B6K366

<dbReference type="JaponicusDB" id="SJAG_03048">
  <property type="gene designation" value="cdc2"/>
</dbReference>

RDF format

Example: B6K366

uniprot:B6K366
  rdfs:seeAlso <http://purl.uniprot.org/japonicusdb/SJAG_03048> .
<http://purl.uniprot.org/japonicusdb/SJAG_03048>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/JaponicusDB> ;
  rdfs:comment "cdc2" .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease:

  • Cap myopathy 2

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • Aspartate methyl ester
  • N6-acetyl-N6-methyllysine

UniProt release 2023_05

Published November 8, 2023

Headline

Hush, little ribosome

Often when we think of translation, we often focus so intently on the protein we are interested in that we forget that, at the microscopic scale, the machinery involved to produce proteins is huge. Most cells contain millions of ribosomes and egg cells contain even orders of magnitude more. Translation is one of the most energy-costly processes. It has been estimated to require ~5 ATP per peptide bond or ~2,300 ATP per typical protein synthesized. That is why the whole process has to be tightly regulated not only at the level of single proteins, but also on a large scale. For instance, egg cells have to quietly await fertilization, to save energy and protect mRNAs and ribosomes, and to be ready for full-blown action in the first steps of embryogenesis. Translational repression has been extensively studied from the RNA standpoint. At the mRNA level, several mechanisms may account for translation dampening, often linked to 3'UTR sequences, including shortening the poly(A) tail, or interfering with the formation of the eIF4F complex. However, the contribution of ribosomes to the global translation repression process was not investigated until recently.

In zebrafish, as in other vertebrates, translation is repressed in eggs. It resumes about 3 hours after fertilization. At the ribosome level, this is reflected by the presence of almost exclusively monosomes before fertilization. The proportion of polysomes starts to increase from 3 to 6 hours after fertilization onwards. Four proteins have been shown to be associated with the "dormant" ribosomes (monosomes): death-associated protein-like 1 homolog (dap1b), intracellular hyaluronan-binding protein 4 (habp4), elongation factor 2b (eef2b) and eukaryotic translation initiation factor 5A-1 (eif5a), forming two modules, eef2b-habp4 and eif5a-dap1b, that bind distinct sites on the same ribosome.

In active ribosomes, eef2 mediates ribosomal translocation along the mRNA and transiently interacts with the tRNA-mRNA complex at the aminoacyl (A) site of the ribosome. In dormant ribosomes, habp4 binds to the mRNA-entry channel and sequesters eef2 at the A site and blocks the interaction sites of the tRNA-mRNA complex. eif5a normally promotes translation elongation and termination and binds between the exit (E) and peptidyl (P) sites of the ribosome. In dormant ribosomes, dap1b sits in the polypeptide exit tunnel (PET), where newly formed proteins normally emerge, and forms a kind of plug. Its C-terminus occupies the same position as the C-terminal residue of a nascent peptide chain and forms a hydrogen bond with the hypusine residue of eIF5a, a residue that is essential for eif5a function.

Although habp4 binding to the mRNA-entry channel is incompatible with translation, the current hypothesis is that its main function is instead to stabilize ribosomes, and that it is dap1b that predominantly carries the translation repression function.

Can these observations be extended to other animals? Experiments carried out in Xenopus laevis gave similar results. Furthermore, in vitro translation assays in rabbit reticulocyte lysates showed that not only recombinant zebrafish dap1b, but also orthologous proteins in other organisms, including Xenopus, human (where the orthologous gene is called SERBP1), and even C.elegans, all repress translation.

If contrary to ribosomes you did not fall asleep while reading this headline, you can consult the UniProtKB/Swiss-Prot entries for habp4, dap1b/SERBP1, as well as eef2b/EEF2 and eif5a in our new release.

UniProtKB news

Cross-references to Pumba

Cross-references have been added to Pumba, a database of electrophoretic reference migration patterns.

Pumba is available at https://pumba.dcsr.unil.ch.

The format of the explicit links is:

Resource abbreviation Pumba
Resource identifier UniProtKB accession number

Example: P11274

Show all entries having a cross-reference to Pumba.

Text format

Example: P11274

DR   Pumba; P11274; -.

XML format

Example: P11274

<dbReference type="Pumba" id="P11274"/>

RDF format

Example: P11274

uniprot:P11274
  rdfs:seeAlso <http://purl.uniprot.org/pumba/P11274> .
<http://purl.uniprot.org/pumba/P11274>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Pumba> .

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • L-cysteine coenzyme A disulfide
  • Sulfocysteine

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt API news

Additional return fields for cross-references in TSV format

The format of our existing UniProtKB return fields for cross-references in a result table consists of the primary identifier of the external resource and, optionally, the UniProt isoform identifier in square brackets. When an entry has several cross-references to the same external resources, these cross-references are separated by a semi-colon.

We have now introduced additional UniProtKB return fields for all cross-references that have additional identifiers or other information. The format of an individual cross-reference is identical to the format used in the UniProtKB text format (with multiple fields being separated by semi-colons). In order to continue to separate multiple cross-references to the same external resource with a semi-colon, each individual cross-reference is enclosed in double quotes.

Example: A0JNW5 Ensembl cross-references

Existing return field xref_ensembl:

ENST00000279907.12 [A0JNW5-1];ENST00000356828.7 [A0JNW5-2];

New return field xref_ensembl_full:

"ENST00000279907.12; ENSP00000279907.7; ENSG00000111647.13. [A0JNW5-1]";"ENST00000356828.7; ENSP00000349285.3; ENSG00000111647.13. [A0JNW5-2]";

Example: A0JNW5 InterPro cross-references

Existing return field xref_interpro:

IPR026728;IPR026854;

New return field xref_interpro_full:

"IPR026728; BLTP3A/B.";"IPR026854; VPS13-like_N."

UniProt release 2023_04

Published September 13, 2023

Headline

Some like it hot

Peppers of the Capsicum genus in the Solanaceae family are native to the Americas and were among the first crops domesticated about 10,000 B.C.E. Interestingly, five different species were independently domesticated in separate regions of the Americas, suggesting a strong interest from humans, possibly because peppers have been used very early in culinary practices and in medicine. Nowadays, peppers are among the most important vegetable and spice crops in the world. Some 3,000 different pepper types are grown worldwide, that differ not only in the color of their fruits or their shape, but more importantly by their heat profile.

Most solanaceous plants protect themselves from herbivores by producing high levels of toxic alkaloids in their leaves. Peppers have evolved another, somewhat more palatable mechanism, one which also aids their propagation; they produce capsaicinoids in their fruit. While not toxic, capsaicinoids bind and activate transient receptor potential vanilloid 1 (or TRPV1) in the mouth, eliciting a heat sensation ferocious enough to discourage mammalian herbivores. Birds, which do not express TRPV1, do not feel the heat; they are attracted to the small red fruit on wild plants and eat them, thereby disseminating their seeds.

The first step of the biosynthetic pathway of capsaicinoids is the synthesis of vanillin from phenylalanine via the phenylpropanoid pathway. Vanillin can be converted to vanillylamine by the pAMT enzyme, also called VAMT. Vanillylamine then undergoes condensation with a branched fatty acid, catalyzed by PUN1. Any variations in the chemical structure of the capsaicinoids, mainly linked to the structure of the acyl moieties attached to the benzene ring of vanillylamine, change our perception not only of the heat intensity, but more generally the heat profile with its myriad of sensations (irritating, mellow-warming, sharp, developing quickly, receding rapidly, lasting hours, etc.)

However, vanillin can also be converted to vanillyl alcohol. The enzyme catalyzing this reaction has long been elusive. It has been recently identified by Sano et al. as cinnamyl alcohol dehydrogenase 1 (or CAD1), an enzyme well-known to plant biologists, because it is essential for the synthesis of lignin, a major cell wall component. Vanillyl alcohol is also a substrate for PUN1. In this case, the condensation with fatty acids produces capsinoids. These are structurally similar to capsaicinoids, except that they have an ester bond instead of an amide bond between the aromatic ring and the branched fatty acid. This small difference makes them mild. The ratio between capsaicinoids and capsinoids determines capsicum pungency. Low-pungent peppers contain a higher content of capsinoids with small amounts of capsaicinoids. Hence the conversion of vanillin into either vanillylamine or vanillyl alcohol is instrumental to determine spiciness.

Whether you like it hot or not, you may enjoy looking at our brand new UniProtKB/Swiss-Prot entries for capsicum enzymes involved in the biosynthetic pathway of capsaicinoids/capsinoids, pAMT, CAD1 and PUN1, which are publicly available as of this release.

UniProtKB news

Change of the cross-references to TAIR

We have modified our cross-references to the TAIR database, and now use the TAIR locus name as the resource identifier, and present the primary gene symbol in an additional field.

Text format

Example: Q8GW48

Previous format:

DR   TAIR; locus:1005716315; AT4G15802.

New format:

DR   TAIR; AT4G15802; HSBP.

XML format

Example: Q8GW48

Previous format:

<dbReference type="TAIR" id="locus:1005716315">
  <property type="gene designation" value="AT4G15802"/>
</dbReference>

New format:

<dbReference type="TAIR" id="AT4G15802">
  <property type="gene designation" value="HSBP"/>
</dbReference>

RDF format

Example: Q8GW48

Previous format:

uniprot:Q8GW48
  rdfs:seeAlso <http://purl.uniprot.org/tair/locus:1005716315> .
<http://purl.uniprot.org/tair/locus:1005716315>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/TAIR> ;
  rdfs:comment "AT4G15802" .

New format:

uniprot:Q8GW48
  rdfs:seeAlso <http://purl.uniprot.org/tair/AT4G15802> .
<http://purl.uniprot.org/tair/AT4G15802>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/TAIR> ;
  rdfs:comment "HSBP" .

Change of the cross-references to TIGRFAMs: replaced by NCBIfam

We have updated our cross-references to reflect the transfer of the TIGRFAMs database to the National Center for Biotechnology Information (NCBI), which now holds the Creative Commons license to this data and is responsible for maintaining and distributing this intellectual property. Following their transfer to NCBI, TIGRFAMs models are now part of a larger HMM collection, NCBIfam, a collection of HMMs built by NCBI curators. All UniProtKB entries cross-referenced to NCBIfam can be found here.

Change of evidence code for the ProtNLM method

The April 2023 release of the ECO ontology introduced several new terms, including ECO:0008006 - "deep learning method evidence used in automatic assertion". This ECO term accurately describes the ProtNLM method, and we have therefore replaced the formerly used ECO:0000256 term by ECO:0008006 for predictions generated by the ProtNLM method.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Arrhythmogenic right ventricular dysplasia, familial, 2
  • Asperger syndrome, X-linked, 1
  • Asperger syndrome, X-linked, 2
  • Atrioventricular septal defect 3
  • Cap myopathy 1
  • Myopathy, actin, congenital, with excess of thin myofilaments
  • Scapuloperoneal myopathy MYH7-related
  • Spheroid body myopathy

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • N-methylaspartate
  • N-methylhistidine

UniProt website news

Following the SARS-CoV-2 outbreak in 2020, UniProt published all SARS-CoV-2, SARS and COVID-19 related human protein entries in a pre-release COVID-19 website portal and FTP directory for rapid dissemination of the knowledge about this virus and the COVID-19 disease. With this release, both services have been decommissioned.

UniProt release 2023_03

Published June 28, 2023

Headline

The fair price of an (ant) lunch

Ants live everywhere, on all continents except Antarctica, and they represent the largest number of individual organisms on our planet. Their abundance and nutritional value make them an interesting source of food for many different birds, spiders, mammals, lizards, toads, etc. Some animals feed opportunistically on solitary workers, but others attack colonies to feed on the brood (eggs, larvae, and pupae) and hence could represent a threat for ant species survival. This is the case for giant (15-30 mm body length) red bull ants (Myrmecia gulosa), whose colonies seem to be particularly appreciated by the Australian short-beaked echidna (Tachyglossus aculeatus), an ant-eating monotreme.

Like most ants, giant red bull ants are venomous and can cause extremely painful reactions that last for days. Their venom contains amphipathic peptides, called aculeatoxins, that are capable of activating mammalian sensory neurons, and also a unique cysteine-rich peptide called MIITX2-Mg1a (or Mg1a), which is unrelated to the aculeatoxins and whose function was unknown until recently. Mg1a is the major component of M. gulosa venom. It is predicted to adopt an epidermal growth factor (EGF)-like fold and sequence analysis revealed high sequence similarity to EGF-like peptides. Surprisingly the highest similarity scores were not obtained for insect EGF-like hormones, but for marsupial ones, suggesting that the natural target of Mg1a is not the endogenous EGF receptor (EGFR) but a vertebrate one, possibly that of a mammalian predator. When tested in human cells, Mg1a appeared to be a potent ligand of EGFR with an activity comparable to that of human EGF. Mg1a presence in the venom of M. gulosa may reflect its evolution as a defensive weapon, but then what effect could it have on a vertebrate predator? Shallow intraplantar injection of Mg1a in mice did not cause spontaneous nocifensive behavior, but 2 to 4 hours after injection mice developed hypersensitivity, showing a decreased threshold in paw-withdrawal in response to both mechanical and thermal stimuli. This effect lasted several days. The advantages for the ants of such a delayed effect are not obvious. Possibly the long-lasting hypersensitivity may reduce the duration of the attack, discourage future attacks, or enhance the algesic actions of subsequent exposure to other pain-causing venom peptides.

This is not the first example of biological mimicry, where one organism gains a selective advantage by mimicking an aspect of another organism. Cone snails mimic fish insulin in their venom gland. When hunting, they release venom insulin, causing hypoglycemia and consequently knockdown of their prey, facilitating their capture. However, there is more to this study than just a very elegant discovery of convergent evolution. It also shows the importance of EGFR signaling in mammalian pain, an aspect of EGF function that previously had not receive much attention. EGF is mostly known for its role in wound healing. Following an injury, high concentrations of EGF are released at the wound site where they serve a critical role in wound healing, but they may also sensitize sensory neurons innervating the site of injury. The resulting hypersensitivity may ensure its protection during the healing process. In a very different setting, the involvement of EGF in pain has also been observed. EGFR inhibitors, such as erlotinib, are used in anti-cancer therapy to slow tumor growth. In this context, clinical trials reported that a significant proportion of patients experienced pain relief following treatment. We are still far from the production of new generation pain killers, but by unveiling a new signaling pathway in pain, this study paves the way for future drug developments.

As of this release, the newly annotated Mg1a toxin is publicly available in UniProtKB/Swiss-Prot.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Glycosylation' ('CARBOHYD' in the flat file):

  • O-alpha-linked (D-glycero-D-manno-heptose) serine
  • O-alpha-linked (glycero-D-manno-heptose) serine
  • O-alpha-linked (glycero-D-manno-heptose) threonine

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • N-methylvaline
  • N-methylthreonine

Changes in subcellular location controlled vocabulary

Modified subcellular locations:

UniProt release 2023_02

Published May 3, 2023

Headline

Levering the DNA

During the process of genetic recombination, two double-stranded DNA (dsDNA) molecules are separated into four strands, enabling exchange of segments of genetic information. These amazing four-stranded branched DNA structures are called Holliday junctions (HJs). HJs are also formed during DNA double-strand break repair and replication fork rescue, and their resolution is essential for all cellular life forms. The mobile HJ DNA heteroduplexes undergo ATP-dependent branch migration; think of them as a cruciform structure that moves along DNA. The bacterial RuvABC proteins, named after their property to confer cellular resistance to UV light, have been known for years to be instrumental in one of the pathways involved in the resolution of HJs. RuvC is a Holliday junction resolvase, an endonuclease, that acts at specific sequences. At these sites, RuvC makes single-stranded nicks across the cruciform at symmetrical sites, allowing resolution of the 4-branch DNA into 2 double-stranded DNA strands. To find the appropriate cleavage site, the HJ junction undergoes branch migration and RuvAB is essential for the migration.

Recent work by the group of Thomas Marlovits has shed new light on the mechanism by which branch migration occurs. They used time-resolved cryo-electron microscopy to solve distinct intermediates in the assembly and processing of the bacterial HJ. A pair of RuvA homotetramers sandwich and slightly flatten the center of the double-stranded cruciform HJ DNA, with the (potential) HJ resolution site in the middle of the complex. RuvB assembles into an asymmetric hexameric ring with a central pore surrounding DNA. Two RuvB homohexamers assemble on opposite strands of dsDNA as it exits the RuvA-HJ core. The C-terminal domain of 2 RuvA subunits binds 2 adjacent RuvB subunits. RuvB is a ring-shaped AAA+ ATPase motor and coordinated movements among the 4 DNA-disengaged subunits (called the converter) stimulate ATP hydrolysis, nucleotide exchange and ultimately branch migration by these subunits. The converter subunits are trapped on the DNA by their contact with RuvA, thus the energy released during ATP hydrolysis is converted into a lever motion, where RuvA anchors RuvB which then lifts and pulls 2 nucleotides of DNA through the center of the homohexamer. At the same time the complex rotates with the DNA, 60° at a time. When one ATP cycle is over the RuvA subunits disengage from RuvB and bind to the next subunits of the RuvB hexamer; the cycle recommences. Both RuvB homohexamers pull DNA, allowing a continual search for the RuvC recognition sequence. How RuvC scans the DNA is unknown, although it is assumed it displaces one of the RuvA tetramers to resolve the HJ DNA.

The data suggest that the majority of ring-shaped AAA+ motors may function as molecular levers, converting protein conformational changes into lift activity of their central pores, which subsequently facilitates substrate (in this case DNA) translocation. Elucidation of this mechanism will allow other motors to be re-examined and may lead to the design of compounds that target specific states of the motor.

RuvA, RuvB and RuvC have been updated and are available in UniProtKB/Swiss-Prot.

UniProtKB news

Changes in prokaryotic taxonomy

In release 2023_02, prokaryotic taxonomy will be changed in accordance with alterations that have occurred in the NCBI taxonomy database. These changes are due to inclusion of the rank phylum in the International Code of Nomenclature for Prokaryotes (INCP), and are based on a recommendation made by Oren and Garrity, 2021. Examples of the changes include Firmicutes to Bacillota, Crenarchaeota to Thermoproteota, and Proteobacteria to Pseudomonadota. A full list of the phylum name changes is here.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Cortical dysplasia, complex, with other brain malformations 8

UniProt release 2023_01

Published March 1, 2023

Headline

Killing me softly

In order to rapidly get rid of infectious microorganisms our innate immune system relies, among others, on specific intracellular receptors which recognize patterns that are frequently found in pathogens. One of these pattern recognition receptors (PRRs), NLRC4, recognizes many different bacterial pathogens, including Salmonella typhimurium, by sensing bacterial flagellin or structural components of the bacterial type III secretion system that are injected or leaked into the host cell. NLRC4 activation leads to the assembly of inflammasomes, large cytosolic multiprotein complexes, which results in the activation of caspase 1 (CASP1). CASP1 initiates inflammatory responses, including activation of the cytokines IL1B and IL18, and cleavage of gasdermin D (GSDMD). This latter forms pores in the plasma membrane, leading to inflammatory cell death, called pyroptosis, within minutes to hours. At the end of this cascade, the cell is dead. End of story? Well, not really. The activation of inflammasomes does not always lead to pyroptosis, and in certain settings, cells with active inflammasomes and gasdermin pores demonstrate a hypersecretory cytokine phenotype without undergoing cell death. How is this possible?

Yet another aspect of the process was unclear. CASP1 also activates 'apoptotic' caspases, such as CASP3 and CASP7, with CASP3 being the primary apoptotic executioner and CASP7 being considered as an inefficient CASP3 back-up. However, CASP7 pro-apoptotic activity is very weak, if it has any. What then is CASP7's function?

Our understanding of the process made a leap forward thanks to a recent article by Nozaki et al. Using intestinal epithelial cell organoid cultures and transgenic mice, the authors studied S.typhimurium infection. As expected in this setting, NLRC4 activation resulted in CASP1 activation. CASP1 in turn activated GSDMD and CASP7. As expected, GSDMD formed pores at the plasma membrane. The new observation is that CASP7 activated a lysosomal enzyme, sphingomyelin phosphodiesterase, also called ASM or SMPD1, which converts sphingomyelin into ceramide. Sphingomyelin is a major constituent of the plasma membrane in animal cells, allowing the membrane to remain largely flat. The formation of ceramide at the cell surface causes the membrane to naturally invaginate, inducing spontaneous clathrin-independent endocytosis that internalizes GSDMD pores, ensuring local repair of the plasma membrane. But how can cytosolic CASP7 activate a lysosomal protein? The current hypothesis is that GSDMD pores allow a massive influx of calcium, which causes lysosomal exocytosis and delivers pro-SMPD1 to the cell surface. CASP7 could pass from the cytosol through the pores to the extracellular space, where the proteins would meet at the best location for rapid membrane repair.

In the intestinal epithelium, the removal of dying cells proceeds through a unique process called cell extrusion, in order to preserve the integrity of barrier function. During extrusion, the cell fated to die emits the lipid sphingosine-1-phosphate, which causes contraction of an actomyosin ring in the neighboring cells. This contraction acts to squeeze the cell out apically while drawing together neighboring cells and preventing any gaps in the epithelial barrier. In this context, SMPD1-mediated repair of the plasma membrane would delay the rapid pore-driven cell lysis to allow the slower extrusion process to engage.

Interestingly, these findings can be expanded to other cell types and infections. For instance, SMPD1 also mediates the repair of pores made by natural killer cell perforin in hepatocytes infected by Chromobacterium violaceum.

The updated UniProtKB/Swiss-Prot entries for CASP7, SMPD1, and GSDMD are now publicly available.

UniProtKB news

Cross-references to GlyCosmos

Cross-references have been added to GlyCosmos, a portal integrating glycosciences with life sciences.

GlyCosmos is available at https://glycosmos.org/.

The format of the explicit links is:

Resource abbreviation GlyCosmos
Resource identifier UniProtKB accession number
Optional information 1 Glycosylation details

Example: P24387

Show all entries having a cross-reference to GlyCosmos.

Text format

Example: P24387

DR   GlyCosmos; P24387; 1 site, 6 glycans.

XML format

Example: P24387

<dbReference type="GlyCosmos" id="P24387">
  <property type="glycosylation" value="1 site, 6 glycans"/>
</dbReference>

RDF format

Example: P24387

uniprot:P24387
  rdfs:seeAlso <http://purl.uniprot.org/glycosmos/P24387> .
<http://purl.uniprot.org/glycosmos/P24387>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/GlyCosmos> ;
  rdfs:comment "1 site, 6 glycans" .

Change to the cross-references to Gene3D

In 2013, The Gene3D database no longer provided names for their signatures. To be consistent with the other family and domain databases, whenever possible, GeneDB cross-references are presented with names for their signatures again.

Example: Q12933

Text format

Example: Q12933

Previous format:

DR   Gene3D; 1.20.5.170; -; 1.
DR   Gene3D; 3.30.40.10; -; 3.

New format:

DR   Gene3D; 1.20.5.170; -; 1.
DR   Gene3D; 3.30.40.10; Zinc/RING finger domain, C3HC4 (zinc finger); 3.

XML format

Example: Q12933

Previous format:

<dbReference type="Gene3D" id="1.20.5.170">
  <property type="match status" value="1"/>
</dbReference>
<dbReference type="Gene3D" id="3.30.40.10">
  <property type="match status" value="3"/>
</dbReference>

New format:

<dbReference type="Gene3D" id="1.20.5.170">
  <property type="match status" value="1"/>
</dbReference>
<dbReference type="Gene3D" id="3.30.40.10">
  <property type="entry name" value="Zinc/RING finger domain, C3HC4 (zinc finger)"/>
  <property type="match status" value="3"/>
</dbReference>

RDF format

Example: Q12933

Previous format:

uniprot:Q12933
 rdfs:seeAlso <http://purl.uniprot.org/gene3d/1.20.5.170> ,
              <http://purl.uniprot.org/gene3d/3.30.40.10> .
<http://purl.uniprot.org/gene3d/1.20.5.170> rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Gene3D> ;
  up:signatureSequenceMatch <http://purl.uniprot.org/isoforms/Q12933-1#Gene3D_1.20.5.170_match_1> .
<http://purl.uniprot.org/gene3d/3.30.40.10>  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Gene3D> ;
  up:signatureSequenceMatch <http://purl.uniprot.org/isoforms/Q12933-1#Gene3D_3.30.40.10_match_1> ,
    <http://purl.uniprot.org/isoforms/Q12933-1#Gene3D_3.30.40.10_match_2> ,
    <http://purl.uniprot.org/isoforms/Q12933-1#Gene3D_3.30.40.10_match_3> .

New format:

uniprot:Q12933
 rdfs:seeAlso <http://purl.uniprot.org/gene3d/1.20.5.170> ,
              <http://purl.uniprot.org/gene3d/3.30.40.10> .
<http://purl.uniprot.org/gene3d/1.20.5.170> rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Gene3D> ;
  up:signatureSequenceMatch <http://purl.uniprot.org/isoforms/Q12933-1#Gene3D_1.20.5.170_match_1> .
<http://purl.uniprot.org/gene3d/3.30.40.10>  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Gene3D> ;
  rdfs:comment "Zinc/RING finger domain, C3HC4 (zinc finger)" ;
  up:signatureSequenceMatch <http://purl.uniprot.org/isoforms/Q12933-1#Gene3D_3.30.40.10_match_1> ,
    <http://purl.uniprot.org/isoforms/Q12933-1#Gene3D_3.30.40.10_match_2> ,
    <http://purl.uniprot.org/isoforms/Q12933-1#Gene3D_3.30.40.10_match_3> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Prolactin-secreting pituitary adenoma
  • Retinitis pigmentosa autosomal recessive

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Lipidation' ('LIPID' in the flat file):

  • N6-decanoyllysine

UniParc news

Change of the UniParc files distribution

The UniParc database is distributed on the UniProt FTP site in FASTA and XML format as one gzip-compressed file for each format. The size of these files has grown over the years to now more that 100 and 200 Gigabytes, respectively, which makes them difficult to download. We now therefore split these files into sets of 128 smaller files each. The partitioned files are available in the following subdirectories:

https://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/xml/all all sequences in XML format
https://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/fasta/active all active sequences in FASTA format

For release 2023_01, we are providing each format as one file and partitioned into 128 smaller files. From release 2023_02 onwards, only the partitioned files will be provided.

UniProt release 2022_05

Published December 14, 2022

Headline

Plant adaptation to global warming through epigenetic memory

While we humans are organizing large conferences to find ways to limit global warming, plants elaborate sophisticated epigenetic strategies to adapt to it. Plant exposure to heat activates various genes and repetitive elements, that are normally silenced via transcriptional gene silencing, and releases post-transcriptional gene silencing by inhibiting siRNA biogenesis. Heat responses are usually transient and reset quickly to normal conditions, although there is some somatic stress memory with a duration of several days. However, certain plant responses can also be transmitted to the next unstressed generation.

In 2014, Migicovsky et al. reported that extreme heat (50°C)-stressed Arabidopsis thaliana show accelerated flowering and this phenotype was also observed in the immediate unstressed progeny, where it is maternally transmitted. Recently certain molecular aspects of this phenomenon have been unraveled. By comparing gene expression in heat-stressed A. thaliana and in its unstressed second generation progeny, which both exhibit the same early flowering phenotype, Liu et al. observed that the SGS3 gene was down-regulated in both generations. SGS3 is essential for the biogenesis of trans-acting siRNAs. The resulting decrease in siRNAs leads to the up-regulation of HTT5, a protein required for early flowering. SGS3 down-regulation was mediated by the E3 ubiquitin-protein ligase SGIP1, which is up-regulated following heat stress and targets SGS3 to proteasomal degradation. Liu et al. explored further upstream to find the regulators of this cascade of events and found two genes whose expression was induced by heat. They are the heat shock transcription factor HSFA2 that directly activates the H3K27me3 demethylase REF6. REF6 in turn derepresses the HSFA2 gene, which together establish a heritable feedback loop. REF6 and HSFA2 activate SGIP1 and HTT5 transcription in concert. In conclusion, H3K27me3 demethylation and HSFA2 activation might be key events in the pathway of heat sensing in plants. The actual primary heat sensors are still elusive and the transmission of thermomemory remains a mystery. It should be noted, however, that it is not the first time that epigenetic memory transmission was reported. In C. elegans, exposure to bisphenol A causes the derepression of an epigenomically silenced transgene in the germline for 5 generations. This effect was associated with a reduction of the repressive marks H3K9me3 and H3K27me3.

Accelerating reproductive development has a cost for A. thaliana. HTT5 is not only involved in early flowering, but it also attenuates disease resistance. In this context, evolution has chosen growth over defense.

The updated UniProtKB/Swiss-Prot entries for A. thaliana REF6, HSFA2, as well as their downstream targets SGIP1, SGS3 and HTT5 are available in the knowledgebase.

UniProtKB news

Cross-references to AGR

Cross-references have been added to the Alliance of Genome Resources (AGR).

AGR is available at https://alliancegenome.org/.

The format of the explicit links is:

Resource abbreviation AGR
Resource identifier Resource identifier

Example: Q20646

Show all entries having a cross-reference to AGR.

Text format

Example: Q20646

DR   AGR; WB:WBGene00000467; -.

XML format

Example: Q20646

<dbReference type="AGR" id="WB:WBGene00000467"/>

RDF format

Example: Q20646

uniprot:Q20646
  rdfs:seeAlso <http://purl.uniprot.org/agr/WB:WBGene00000467> .
<http://purl.uniprot.org/agr/WB:WBGene00000467>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/AGR> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2022_04

Published October 12, 2022

Headline

The (phenolic) Rings of Power

When walking in the countryside, you may have noticed very small insects with very delicate white wings, looking a little like tiny butterflies, flitting around when a leaf or a stem is disturbed. In spite of their harmless appearance, whiteflies, including Bemisia tabaci, are a pest that not only causes damage by feeding on plant sap, but is also a vector for a multitude of plant viruses and causes major crop loss around the world. Whiteflies are able to attack over 600 plant species. Luckily many plants have developed efficient anti-herbivore chemicals for their defense. These include phenolic glycosides, which strongly affect growth, development, and behavior of insect herbivores. Phenolic glycosides are also toxic to the plants that produce them, but plants detoxify them by attaching a malonyl group to phenolic glucosides, a reaction catalyzed by malonyltransferases. Malonylated glycosides can then be safely stored in plant vacuoles. Incidentally this process is also used by plants for xenobiotic detoxification, and can lead to herbicide resistance, for instance.

In tomato leaves, some 290 different phenolic glycosides have been identified, several of which have been shown to be toxic to B. tabaci only when ingested at higher doses than those found in plant phloem. Where does this insect resistance to physiological levels of phenolic glycosides come from? Thorough examination of B. tabaci genes led to the discovery of BtPMaT1, a gene encoding an insect glucoside malonyltrasferase. BtPMaT1 is widely expressed in B. tabaci at all stages of life, with highest levels in the adult gut, a perfect location for food detoxification. Feeding B. tabaci on transgenic tomatoes, expressing BtPMaT1 gene fragments that efficiently silence BtPMaT1, leads to increased whitefly mortality and decreased malonylated flavonoid glycosides in honeydew, convincingly demonstrating that BtPMaT1 catalyzes malonylation of phenolic glycosides in vivo.

B. tabaci BtPMaT1 gene has no homologue in other arthropods, not even in the related greenhouse whitefly, Trialeurodes vaporariorum, but it carries significant similarity to plant enzymes. Phylogenetic analysis shows BtPMaT1 clusters within a group of plant BAHD acyltransferases. All evidence points to one conclusion: BtPMaT1 has been horizontally acquired from plants. The horizontal transfer must have taken place after the divergence from Trialeurodes (some 86 million years ago). Horizontal transfer is a well-known process in prokaryotes, but we are just beginning to appreciate its importance in the adaptive evolution of eukaryotes.

Happily, you don't have to wait 80 million years to find the newly created UniProtKB/Swiss-Prot BtPMaT1 entry which is now available in our knowledgebase.

UniProtKB news

Google protein name predictions

UniProt has collaborated with the groups of Max Bileschi and Lucy Colwell at Google Research to predict names for UniProtKB/TrEMBL proteins. The UniProt 2021_02 release data were used to train a model called ProtNLM based on the T5X framework. The model uses a shared vocabulary that encodes both protein sequences and their text descriptions (T5 methodology, Raffel et al. 2020). Free-text UniProt protein name(s) are produced as output. Expert biocurators manually evaluated a subset of model-predicted protein names chosen at random and informed model-building with stratified confidence scores. An automated verification tool also checked whether a predicted name occurs as a substring of the full UniProt entry for any protein belonging to the same UniRef50 2022_01 cluster.

Starting from release 2022_04, these name predictions will be used as the recommended name for all TrEMBL entries whose name would otherwise be "Uncharacterized protein" (49,292,040 entries in this release). The source of these protein names is indicated in their evidence tags and can be used to retrieve the corresponding entries with this query.

Protein embeddings

Protein embeddings are a way to encode functional and structural properties of a protein, mostly from its sequence only, in a machine-friendly format (vector representation). Generating such embeddings is computationally expensive, but once computed they can be leveraged for different tasks, such as sequence similarity search, sequence clustering, and sequence classification.

Starting from release 2022_04, UniProt is providing raw embeddings (per-protein and per-residue using the ProtT5 model) for UniProtKB/Swiss-Prot and the reference proteomes of the model organisms listed below:

  • Homo sapiens
  • Mus musculus
  • Rattus norvegicus
  • Arabidopsis thaliana
  • Caenorhabditis elegans
  • Escherichia coli
  • SARS-CoV-2

You can download the embeddings from our Downloads page.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • ADP-ribosylthreonine

UniProt release 2022_03

Published August 3, 2022

Headline

Not just for proteins: new targets for ADP-ribosylation

ADP-ribosylation is a modification which transfers ADP-ribose from NAD(+) onto substrates. It is involved in a number of crucial functions, ranging from stress responses elicited, for example, by DNA damage and viral infection, to intra-and extracellular signaling, chromatin and transcriptional regulation, protein biosynthesis, and cell death. ADP-ribosylation of proteins has been extensively studied, and numerous enzymes catalyzing this reaction have been identified in eukaryotes. For instance the poly [ADP-ribose] polymerase PARP1, which is activated in response to DNA damage, or the poly [ADP-ribose] polymerase tankyrases, such as TNKS, which are involved in processes such as the Wnt signalling pathway, telomere length regulation and vesicle trafficking. A number of bacterial toxins are protein ADP-ribosyltransferases, such as diphtheria toxin or Pseudomonas aeruginosa exotoxin A, which ADP-ribosylate eukaryotic elongation factor 2, arresting protein synthesis or SpvB from Salmonella dublin, which ADP-ribosylates host actin, preventing its polymerization and causing actin depolymerization, cytoskeleton destruction and cytotoxicity.

But ADP-ribosylation is not just for proteins. It has been known since at least 1999 that DNA can be modified. For over 20 years, this process was investigated by the group of Keiji Wakabayashi, who was working on pierisin, a cytotoxin found in late-stage larvae and pupae of the cabbage white butterfly Pieris rapae. Pierisin ADP-ribosylates double-stranded DNA (dsDNA) on the N2 group of deoxyguanosine. ADP-ribosylation of DNA is detected as DNA damage by the cell machinery and leads to apoptosis. This process is thought to protect the butterfly embryo against parasitic wasps. Later on it was shown that mammalian PARP1, the well-characterized protein ADP-ribosyltransferase mentioned above, can also modify DNA strand break termini.

Recently DarT, another DNA-targeted ADP-ribosyltransferase, has been discovered in Mycobacterium tuberculosis. DarT belongs to a novel toxin-antitoxin (TA) system. It recognizes and modifies a specific thymidine in single-stranded DNA (ssDNA) at the origin of chromosomal replication, which is perceived as severe DNA damage, leading to growth arrest and eventually cell death. This effect is reversed by the DarG antitoxin, which specifically removes ADP-ribosylated DNA adducts and functions as a noncanonical DNA repair factor. What is the benefit of the DarT-DarG TA system for M. tuberculosis? It is known that carefully controlled, slow and nonreplicating growth states are key for M.tuberculosis, resulting in persistent, potentially life-long infection and antibiotic tolerance. M.tuberculosis has over 80 TA modules, which are thought to contribute to persistence. DarTG is the most recent addition to this arsenal.

Finally, ADP-ribosylation can also occur on single nucleosides/nucleotides. The soil bacterium Streptomyces coelicolor secretes a guanine-specific ADP-ribosyltransferase, called ScARP. ScARP ADP-ribosylates guanosine and deoxyguanosine, as well as their phosphorylated derivatives GMP, dGMP, GDP, GTP, dGTP, and cyclic GMP. Extracellular ScARP probably finds its guanine substrates from surrounding cells. Many bacteria use guanosine derivatives in response to environmental changes, to regulate their motility, for instance. ScARP may dysregulate these signals by direct ADP-ribosylation of guanosine derivatives, or by synthesizing ADP-ribosylguanosine as a mimicking or competitively inhibiting molecule. The presence of ADP-ribosylguanosine in the nucleotide pool might also induce genomic instability and gene mutation in competitor bacteria living near S.coelicolor. Chemical/biological warfare at a microscopic level!

The non-protein targeted ADP-ribosyltransferases mentioned above have all been recently updated and are available in this release of UniProtKB. While annotating these enzymes, we created many new Rhea reactions, which also required the integration of new molecules at ChEBI. The interconnectedness of databases grows quickly these days!

UniProtKB news

Annotation of biologically relevant ligands in UniProtKB using ChEBI

UniProtKB provides descriptions of the nature and binding sites of biologically relevant ligands that are essential for protein function, such as activators, inhibitors, cofactors, and substrates. We have replaced the existing textual descriptions of these ligands with their equivalents from the ChEBI (Chemical Entities of Biological Interest) ontology to provide high quality computationally tractable annotation of biologically relevant ligands and their binding sites in proteins in UniProtKB. This enhanced dataset provides improved support for efforts to study and predict functionally relevant interactions between proteins and small molecule ligands.

The impact of this change on the UniProtKB data model is described in the following subsections.

Deprecation of 'Calcium binding', 'Metal' and 'Nucleotide binding' annotation types

Historically, UniProtKB has described a few classes of ligand binding sites with dedicated annotation types to make it easier to query them. This has been the case for 'Calcium binding', 'Metal binding' and 'Nucleotide binding'. With the switch to the ChEBI ontology for ligand classification this is no longer necessary, and we have deprecated these annotation types and converted the existing data to 'Binding site' annotations.

Structuring of 'Binding site' annotations

We have structured 'Binding site' annotations in a way that allows us to standardize the description of a ligand, and optionally the bound part of the ligand, with the ChEBI ontology. The vast majority of ligands that are curated in UniProKB are small molecules that can be represented by a ChEBI entity. A minority of curated ligands are macromolecules. These are not within the scope of ChEBI, but ChEBI contains a limited set of high-level terms (such as DNA or RNA) that we use to curate such ligands. For macromolecules and chelating structures, we sometimes (if known and of interest) also describe the bound part of the ligand with a separate ChEBI entity.

In the subsections that describe the new representation of 'Binding site' annotations in different UniProtKB formats we use the following placeholders for annotation values:

Ligand:

Placeholders for values that describe the molecule (small or macro), ion or chelating structure that is bound by the protein:

  • LigandName: The UniProt ChEBI name of the ligand, or the word "substrate". Mandatory.
  • LigandId: The ChEBI identifier of the ligand. Mandatory, except when the LigandName is "substrate".
  • LigandLabel: A label used to distinguish individual instances of a ligand when a protein binds multiple ligands of the same chemical nature. Optional.
  • LigandNote: A free text note that provides further details about the ligand. Optional.

Ligand part:

Placeholders for values that describe the specific part of the ligand that is bound by the protein (e.g. the iron atom in a heme, a specific type of amino-acid residue in a protein, the 3'-CCA end of a tRNA molecule):

  • LigandPartName: The UniProt ChEBI name of the ligand part. Optional.
  • LigandPartId: The ChEBI identifier of the ligand part. Optional.
  • LigandPartLabel: A label used to distinguish individual instances of a ligand part when a protein binds multiple parts of the same chemical nature that are part of the same ligand. Optional.
  • LigandPartNote: A free text note that provides further details about the ligand part. Optional.

Additional information:

Placeholders for values that provide additional information:

  • Note: Free text note about the binding residue(s). Optional.
  • Evidences: List of evidences that support the annotation. Optional.

Note also that we continue to use 'Binding site' annotations primarily to describe individual amino acid residues that bind a ligand, but in order to standardize the existing ligand binding descriptions of the annotation types that have been deprecated, as well those of some 'Region' annotations, 'Binding site' annotations may now also describe a range of amino acids when the exact ligand binding residues are unknown or where adjacent ligand binding residues had been grouped in the past.

Text format

We have made the following changes to the UniProtKB text format to standardize the description of a ligand, and optionally the bound part of the ligand, with the ChEBI ontology.

  • We have deprecated the feature types CA_BIND, METAL and NP_BIND.
  • We have introduced eight new qualifiers for the BINDING feature type to describe a ligand (/ligand, /ligand_id, /ligand_label, /ligand_note) and a ligand part (/ligand_part, /ligand_part_id, /ligand_part_label, /ligand_part_note).
  • The representation of the annotation Note and the Evidences remains unchanged.
FT   BINDING         x[..y]
FT                   /ligand="LigandName"
FT                   /ligand_id="ChEBI:LigandId"
FT                   /ligand_label="LigandLabel"
FT                   /ligand_note="LigandNote"
FT                   /ligand_part="LigandPartName"
FT                   /ligand_part_id="ChEBI:LigandPartId"
FT                   /ligand_part_label="LigandPartLabel"
FT                   /ligand_part_note="LigandPartNote"
FT                   /note="Note"
FT                   /evidence="Evidences"

Example: Q9H5X1

Previous format:

FT   METAL           89
FT                   /note="Zinc 1; shared with dimeric partner"
FT   METAL           89
FT                   /note="Zinc 2; shared with dimeric partner"
FT   METAL           123
FT                   /note="Zinc 1; shared with dimeric partner"
FT   METAL           123
FT                   /note="Zinc 2; shared with dimeric partner"

New format:

FT   BINDING         89
FT                   /ligand="Zn(2+)"
FT                   /ligand_id="ChEBI:CHEBI:29105"
FT                   /ligand_label="1"
FT                   /ligand_note="ligand shared between dimeric partners"
FT   BINDING         89
FT                   /ligand="Zn(2+)"
FT                   /ligand_id="ChEBI:CHEBI:29105"
FT                   /ligand_label="2"
FT                   /ligand_note="ligand shared between dimeric partners"
FT   BINDING         123
FT                   /ligand="Zn(2+)"
FT                   /ligand_id="ChEBI:CHEBI:29105"
FT                   /ligand_label="1"
FT                   /ligand_note="ligand shared between dimeric partners"
FT   BINDING         123
FT                   /ligand="Zn(2+)"
FT                   /ligand_id="ChEBI:CHEBI:29105"
FT                   /ligand_label="2"
FT                   /ligand_note="ligand shared between dimeric partners"

Example: P39186

Previous format:

FT   METAL           79
FT                   /note="Iron (heme C 1 axial ligand); via tele nitrogen"
FT   METAL           97
FT                   /note="Iron (heme C 1 axial ligand); via tele nitrogen"
FT   METAL           114
FT                   /note="Iron (heme C 2 axial ligand); via tele nitrogen"
FT   METAL           137
FT                   /note="Iron (heme C 2 axial ligand); via tele nitrogen"
FT   BINDING         93
FT                   /note="Heme C 1; covalent"
FT                   /evidence="ECO:0000269|PubMed:21419779"
FT   BINDING         96
FT                   /note="Heme C 1; covalent"
FT                   /evidence="ECO:0000269|PubMed:21419779"
FT   BINDING         133
FT                   /note="Heme C 2; covalent"
FT                   /evidence="ECO:0000269|PubMed:21419779"
FT   BINDING         136
FT                   /note="Heme C 2; covalent"
FT                   /evidence="ECO:0000269|PubMed:21419779"

New format:

FT   BINDING         79
FT                   /ligand="heme c"
FT                   /ligand_id="ChEBI:CHEBI:61717"
FT                   /ligand_label="1"
FT                   /ligand_part="Fe"
FT                   /ligand_part_id="ChEBI:CHEBI:18248"
FT                   /note="axial binding residue"
FT                   /evidence="ECO:0000269|PubMed:21419779,
FT                   ECO:0007744|PDB:3ML1"
FT   BINDING         93
FT                   /ligand="heme c"
FT                   /ligand_id="ChEBI:CHEBI:61717"
FT                   /ligand_label="1"
FT                   /note="covalent"
FT                   /evidence="ECO:0000269|PubMed:21419779,
FT                   ECO:0007744|PDB:3ML1"
FT   BINDING         96
FT                   /ligand="heme c"
FT                   /ligand_id="ChEBI:CHEBI:61717"
FT                   /ligand_label="1"
FT                   /note="covalent"
FT                   /evidence="ECO:0000269|PubMed:21419779,
FT                   ECO:0007744|PDB:3ML1"
FT   BINDING         97
FT                   /ligand="heme c"
FT                   /ligand_id="ChEBI:CHEBI:61717"
FT                   /ligand_label="1"
FT                   /ligand_part="Fe"
FT                   /ligand_part_id="ChEBI:CHEBI:18248"
FT                   /note="axial binding residue"
FT                   /evidence="ECO:0000269|PubMed:21419779,
FT                   ECO:0007744|PDB:3ML1"
FT   BINDING         114
FT                   /ligand="heme c"
FT                   /ligand_id="ChEBI:CHEBI:61717"
FT                   /ligand_label="2"
FT                   /ligand_part="Fe"
FT                   /ligand_part_id="ChEBI:CHEBI:18248"
FT                   /note="axial binding residue"
FT                   /evidence="ECO:0000269|PubMed:21419779,
FT                   ECO:0007744|PDB:3ML1"
FT   BINDING         133
FT                   /ligand="heme c"
FT                   /ligand_id="ChEBI:CHEBI:61717"
FT                   /ligand_label="2"
FT                   /note="covalent"
FT                   /evidence="ECO:0000269|PubMed:21419779,
FT                   ECO:0007744|PDB:3ML1"
FT   BINDING         136
FT                   /ligand="heme c"
FT                   /ligand_id="ChEBI:CHEBI:61717"
FT                   /ligand_label="2"
FT                   /note="covalent"
FT                   /evidence="ECO:0000269|PubMed:21419779,
FT                   ECO:0007744|PDB:3ML1"
FT   BINDING         137
FT                   /ligand="heme c"
FT                   /ligand_id="ChEBI:CHEBI:61717"
FT                   /ligand_label="2"
FT                   /ligand_part="Fe"
FT                   /ligand_part_id="ChEBI:CHEBI:18248"
FT                   /note="axial binding residue"
FT                   /evidence="ECO:0000269|PubMed:21419779,
FT                   ECO:0007744|PDB:3ML1"

Example: Q9H6S0

Previous format:

FT   REGION          1294..1296
FT                   /note="N6-methyladenosine residue binding"
FT                   /evidence="ECO:0000250|UniProtKB:Q9Y5A9"
...
FT   BINDING         1310
FT                   /note="N6-methyladenosine residue"
FT                   /evidence="ECO:0000250|UniProtKB:Q96MU7"
FT   BINDING         1360
FT                   /note="N6-methyladenosine residue"
FT                   /evidence="ECO:0000250|UniProtKB:Q96MU7"

New format:

FT   BINDING         1294..1296
FT                   /ligand="RNA"
FT                   /ligand_id="ChEBI:CHEBI:33697"
FT                   /ligand_part="N(6)-methyladenosine 5'-phosphate residue"
FT                   /ligand_part_id="ChEBI:CHEBI:74449"
FT                   /evidence="ECO:0000250|UniProtKB:Q9Y5A9"
FT   BINDING         1310
FT                   /ligand="RNA"
FT                   /ligand_id="ChEBI:CHEBI:33697"
FT                   /ligand_part="N(6)-methyladenosine 5'-phosphate residue"
FT                   /ligand_part_id="ChEBI:CHEBI:74449"
FT                   /evidence="ECO:0000250|UniProtKB:Q96MU7"
FT   BINDING         1360
FT                   /ligand="RNA"
FT                   /ligand_id="ChEBI:CHEBI:33697"
FT                   /ligand_part="N(6)-methyladenosine 5'-phosphate residue"
FT                   /ligand_part_id="ChEBI:CHEBI:74449"
FT                   /evidence="ECO:0000250|UniProtKB:Q96MU7"
XML format

We have made the following changes to the UniProtKB XSD to standardize the description of a ligand, and optionally the bound part of the ligand, with the ChEBI ontology.

  • We have deprecated the feature types calcium-binding region, metal ion-binding site and nucleotide phosphate-binding region.
  • We have introduced two new elements, ligand and ligandPart, and corresponding types, ligandType and ligandPartType. The two types have the same structure that consists of the following four elements:
    • name for the value LigandName / LigandPartName
    • dbReference for a cross-reference to a ChEBI record (LigandId / LigandPartId)
    • label for the value LigandLabel / LigandPartLabel
    • note for the value LigandNote / LigandPartNote
  • The representation of the annotation Note and the Evidences remains unchanged.
<feature type="binding site" description="Note" evidence="Evidences">
      <location>
        ...
      </location>
      <ligand>
        <name>LigandName</name>
        <dbReference type="ChEBI" id="LigandId"/>
        <label>LigandLabel</label>
        <note>LigandNote</note>
      </ligand>
      <ligandPart>
        <name>LigandPartName</name>
        <dbReference type="ChEBI" id="LigandPartId"/>
        <label>LigandPartLabel</label>
        <note>LigandPartNote</note>
      </ligandPart>
    </feature>

The XSD changes are highlighted in red color below:

<!-- Feature definition begins -->
    <xs:complexType name="featureType">
    ...
        <xs:sequence>
        ...
            <xs:element name="location" type="locationType">
                <xs:annotation>
                    <xs:documentation>Describes the sequence coordinates of the annotation.</xs:documentation>
                </xs:annotation>
            </xs:element>
            <xs:element name="ligand" type="ligandType" minOccurs="0">
                <xs:annotation>
                    <xs:documentation>Describes the chemical entity that is bound in annotations that describe binding sites.</xs:documentation>
                </xs:annotation>
            </xs:element>
            <xs:element name="ligandPart" type="ligandPartType" minOccurs="0">
                <xs:annotation>
                    <xs:documentation>Describes the specific part of a molecule that is bound in annotations that describe binding sites.</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:sequence>
        <xs:attribute name="type" use="required">
        ...
            <xs:simpleType>
                <xs:restriction base="xs:string">
                    <xs:enumeration value="active site"/>
                    <xs:enumeration value="binding site"/>
                    <!-- <xs:enumeration value="calcium-binding region"/> -->
                    ...
                    <!-- <xs:enumeration value="metal ion-binding site"/> -->
                    ...
                    <!-- <xs:enumeration value="nucleotide phosphate-binding region"/> -->
                    ...
                </xs:restriction>
            </xs:simpleType>
        </xs:attribute>
    ...
    </xs:complexType>
    ...
    <xs:complexType name="ligandType">
        <xs:annotation>
            <xs:documentation>Describes a ligand.</xs:documentation>
        </xs:annotation>
        <xs:sequence>
            <xs:element name="name" type="xs:string"/>
            <xs:element name="dbReference" type="dbReferenceType" minOccurs="0"/>
            <xs:element name="label" type="xs:string" minOccurs="0"/>
            <xs:element name="note" type="xs:string" minOccurs="0"/>
        </xs:sequence>
    </xs:complexType>

    <xs:complexType name="ligandPartType">
        <xs:annotation>
            <xs:documentation>Describes a ligand part.</xs:documentation>
        </xs:annotation>
        <xs:sequence>
            <xs:element name="name" type="xs:string"/>
            <xs:element name="dbReference" type="dbReferenceType" minOccurs="0"/>
            <xs:element name="label" type="xs:string" minOccurs="0"/>
            <xs:element name="note" type="xs:string" minOccurs="0"/>
        </xs:sequence>
    </xs:complexType>

Example: Q9H5X1

Previous format:

<feature type="metal ion-binding site" description="Zinc 1; shared with dimeric partner">
      <location>
        <position position="89"/>
      </location>
    </feature>
    <feature type="metal ion-binding site" description="Zinc 2; shared with dimeric partner">
      <location>
        <position position="89"/>
      </location>
    </feature>
    <feature type="metal ion-binding site" description="Zinc 1; shared with dimeric partner">
      <location>
        <position position="123"/>
      </location>
    </feature>
    <feature type="metal ion-binding site" description="Zinc 2; shared with dimeric partner">
      <location>
        <position position="123"/>
      </location>
    </feature>

New format:

<feature type="binding site">
      <location>
        <position position="89"/>
      </location>
      <ligand>
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
        <label>1</label>
        <note>ligand shared between dimeric partners</note>
      </ligand>
    </feature>
    <feature type="binding site">
      <location>
        <position position="89"/>
      </location>
      <ligand>
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
        <label>2</label>
        <note>ligand shared between dimeric partners</note>
      </ligand>
    </feature>
    <feature type="binding site">
      <location>
        <position position="123"/>
      </location>
      <ligand>
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
        <label>1</label>
        <note>ligand shared between dimeric partners</note>
      </ligand>
    </feature>
    <feature type="binding site">
      <location>
        <position position="123"/>
      </location>
      <ligand>
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
        <label>2</label>
        <note>ligand shared between dimeric partners</note>
      </ligand>
    </feature>

Example: P39186

Previous format:

<feature type="metal ion-binding site" description="Iron (heme C 1 axial ligand); via tele nitrogen">
      <location>
        <position position="79"/>
      </location>
    </feature>
    <feature type="metal ion-binding site" description="Iron (heme C 1 axial ligand); via tele nitrogen">
      <location>
        <position position="97"/>
      </location>
    </feature>
    <feature type="metal ion-binding site" description="Iron (heme C 2 axial ligand); via tele nitrogen">
      <location>
        <position position="114"/>
      </location>
    </feature>
    <feature type="metal ion-binding site" description="Iron (heme C 2 axial ligand); via tele nitrogen">
      <location>
        <position position="137"/>
      </location>
    </feature>
    <feature type="binding site" description="Heme C 1; covalent" evidence="2">
      <location>
        <position position="93"/>
      </location>
    </feature>
    <feature type="binding site" description="Heme C 1; covalent" evidence="2">
      <location>
        <position position="96"/>
      </location>
    </feature>
    <feature type="binding site" description="Heme C 2; covalent" evidence="2">
      <location>
        <position position="133"/>
      </location>
    </feature>
    <feature type="binding site" description="Heme C 2; covalent" evidence="2">
      <location>
        <position position="136"/>
      </location>
    </feature>

New format:

<feature type="binding site" description="axial binding residue" evidence="2 5">
      <location>
        <position position="79"/>
      </location>
      <ligand>
        <name>heme c</name>
        <dbReference type="ChEBI" id="CHEBI:61717"/>
        <label>1</label>
      </ligand>
      <ligandPart>
        <name>Fe</name>
        <dbReference type="ChEBI" id="CHEBI:18248"/>
      </ligandPart>
    </feature>
    <feature type="binding site" description="covalent" evidence="2 5">
      <location>
        <position position="93"/>
      </location>
      <ligand>
        <name>heme c</name>
        <dbReference type="ChEBI" id="CHEBI:61717"/>
        <label>1</label>
      </ligand>
    </feature>
    <feature type="binding site" description="covalent" evidence="2 5">
      <location>
        <position position="96"/>
      </location>
      <ligand>
        <name>heme c</name>
        <dbReference type="ChEBI" id="CHEBI:61717"/>
        <label>1</label>
      </ligand>
    </feature>
    <feature type="binding site" description="axial binding residue" evidence="2 5">
      <location>
        <position position="97"/>
      </location>
      <ligand>
        <name>heme c</name>
        <dbReference type="ChEBI" id="CHEBI:61717"/>
        <label>1</label>
      </ligand>
      <ligandPart>
        <name>Fe</name>
        <dbReference type="ChEBI" id="CHEBI:18248"/>
      </ligandPart>
    </feature>
    <feature type="binding site" description="axial binding residue" evidence="2 5">
      <location>
        <position position="114"/>
      </location>
      <ligand>
        <name>heme c</name>
        <dbReference type="ChEBI" id="CHEBI:61717"/>
        <label>2</label>
      </ligand>
      <ligandPart>
        <name>Fe</name>
        <dbReference type="ChEBI" id="CHEBI:18248"/>
      </ligandPart>
    </feature>
    <feature type="binding site" description="covalent" evidence="2 5">
      <location>
        <position position="133"/>
      </location>
      <ligand>
        <name>heme c</name>
        <dbReference type="ChEBI" id="CHEBI:61717"/>
        <label>2</label>
      </ligand>
    </feature>
    <feature type="binding site" description="covalent" evidence="2 5">
      <location>
        <position position="136"/>
      </location>
      <ligand>
        <name>heme c</name>
        <dbReference type="ChEBI" id="CHEBI:61717"/>
        <label>2</label>
      </ligand>
    </feature>
    <feature type="binding site" description="axial binding residue" evidence="2 5">
      <location>
        <position position="137"/>
      </location>
      <ligand>
        <name>heme c</name>
        <dbReference type="ChEBI" id="CHEBI:61717"/>
        <label>2</label>
      </ligand>
      <ligandPart>
        <name>Fe</name>
        <dbReference type="ChEBI" id="CHEBI:18248"/>
      </ligandPart>
    </feature>

Example: Q9H6S0

Previous format:

<feature type="region of interest" description="N6-methyladenosine residue binding" evidence="3">
      <location>
        <begin position="1294"/>
        <end position="1296"/>
      </location>
    </feature>
    ...
    <feature type="binding site" description="N6-methyladenosine residue" evidence="2">
      <location>
        <position position="1310"/>
      </location>
    </feature>
    <feature type="binding site" description="N6-methyladenosine residue" evidence="2">
      <location>
        <position position="1360"/>
      </location>
    </feature>

New format:

<feature type="binding site" evidence="3">
      <location>
        <begin position="1294"/>
        <end position="1296"/>
      </location>
      <ligand>
        <name>RNA</name>
        <dbReference type="ChEBI" id="CHEBI:33697"/>
      </ligand>
      <ligandPart>
        <name>N(6)-methyladenosine 5'-phosphate residue</name>
        <dbReference type="ChEBI" id="CHEBI:74449"/>
      </ligandPart>
    </feature>
    <feature type="binding site" evidence="2">
      <location>
        <position position="1310"/>
      </location>
      <ligand>
        <name>RNA</name>
        <dbReference type="ChEBI" id="CHEBI:33697"/>
      </ligand>
      <ligandPart>
        <name>N(6)-methyladenosine 5'-phosphate residue</name>
        <dbReference type="ChEBI" id="CHEBI:74449"/>
      </ligandPart>
    </feature>
    <feature type="binding site" evidence="2">
      <location>
        <position position="1360"/>
      </location>
      <ligand>
        <name>RNA</name>
        <dbReference type="ChEBI" id="CHEBI:33697"/>
      </ligand>
      <ligandPart>
        <name>N(6)-methyladenosine 5'-phosphate residue</name>
        <dbReference type="ChEBI" id="CHEBI:74449"/>
      </ligandPart>
    </feature>
RDF format

We have made the following changes to the UniProt RDF schema ontology to standardize the description of a ligand, and optionally the bound part of the ligand, with the ChEBI ontology.

  • We have deprecated the classes Calcium_Binding_Annotation, Metal_Binding_Annotation and NP_Binding_Annotation.
  • We have introduced two new properties, ligand and ligandPart, whose rdfs:domain is the Binding_Site_Annotation class. The nature of a ligand or ligand part is described as one of:
    • an rdfs:subClassOf of the corresponding ChEBI class (when a ChEBI identifier is available)
    • a class identified by an entry URI fragment whose identifier is generated from the ligand (part) name (when no ChEBI identifier is available)
  • A partOf statement is used to link a ligand part class to the ligand class that contains the part. As a consequence, the partOf property will loose its rdfs:domain and rdfs:range.
  • The label and note of a ligand (LigandLabel, LigandNote) or ligand part (LigandPartLabel, LigandPartNote) are described at the corresponding class level with rdfs:label and rdfs:comment statements, respectively.
  • The representation of the annotation Note and the Evidences remains unchanged.

In the examples below, the parts of the annotation whose structure has not changed (details of sequence range, evidences) are omitted.

Example: Q9H5X1

Previous format:

<Q9H5X1> rdf:type up:Protein ;
...
  up:annotation ...
    <Q9H5X1#SIP205C589B1B82C19D> ,
    <Q9H5X1#SIP0FDA08578DA645F1> ,
    <Q9H5X1#SIPF931DC9BFE3EE1EA> ,
    <Q9H5X1#SIP9FE49AAED1489740> ,
...
<Q9H5X1#SIP205C589B1B82C19D>
  rdf:type up:Metal_Binding_Annotation ;
  rdfs:comment "Zinc 1; shared with dimeric partner" ;
  up:range range:22862455408963886tt89tt89 .
...
<Q9H5X1#SIP0FDA08578DA645F1>
  rdf:type up:Metal_Binding_Annotation ;
  rdfs:comment "Zinc 2; shared with dimeric partner" ;
  up:range range:22862455408963886tt89tt89 .
<Q9H5X1#SIPF931DC9BFE3EE1EA>
  rdf:type up:Metal_Binding_Annotation ;
  rdfs:comment "Zinc 1; shared with dimeric partner" ;
  up:range range:22862455408963886tt123tt123 .
...
<Q9H5X1#SIP9FE49AAED1489740>
  rdf:type up:Metal_Binding_Annotation ;
  rdfs:comment "Zinc 2; shared with dimeric partner" ;
  up:range range:22862455408963886tt123tt123 .

New format:

<Q9H5X1> rdf:type up:Protein ;
...
  up:annotation ...
    <Q9H5X1#SIPEF929039F510ECEF> ,
    <Q9H5X1#SIP920EDE1AF7D0C789> ,
    <Q9H5X1#SIP72E42A3FB2BD4B9A> ,
    <Q9H5X1#SIP34635D3021CE14FA> ,
...
<Q9H5X1#SIPEF929039F510ECEF>
  rdf:type up:Binding_Site_Annotation ;
  up:ligand <Q9H5X1#SIPA9D42F70EC978DA2> ;
  up:range range:22862455408963886tt89tt89 .
...
<Q9H5X1#SIP920EDE1AF7D0C789>
  rdf:type up:Binding_Site_Annotation ;
  up:ligand <Q9H5X1#SIP2113E97BF945072A> ;
  up:range range:22862455408963886tt89tt89 .
<Q9H5X1#SIP72E42A3FB2BD4B9A>
  rdf:type up:Binding_Site_Annotation ;
  up:ligand <Q9H5X1#SIPA9D42F70EC978DA2> ;
  up:range range:22862455408963886tt123tt123 .
...
<Q9H5X1#SIP34635D3021CE14FA>
  rdf:type up:Binding_Site_Annotation ;
  up:ligand <Q9H5X1#SIP2113E97BF945072A> ;
  up:range range:22862455408963886tt123tt123 .
...
<Q9H5X1#SIPA9D42F70EC978DA2>
  rdfs:subClassOf <http://purl.obolibrary.org/obo/CHEBI_29105> ;
  rdfs:label "1" ;
  rdfs:comment "ligand shared between dimeric partners" .
<Q9H5X1#SIP2113E97BF945072A>
  rdfs:subClassOf <http://purl.obolibrary.org/obo/CHEBI_29105> ;
  rdfs:label "2" ;
  rdfs:comment "ligand shared between dimeric partners" .

Example: P39186

Previous format:

<P39186> rdf:type up:Protein ;
...
  up:annotation ...
    <P39186#SIP2FA351DB0B3C2F16> ,
    <P39186#SIP68C53D2ADFC188FB> ,
    <P39186#SIP293BA29AFBE6705D> ,
    <P39186#SIP04086D3B0200EFDB> ,
    <P39186#SIP421DBD60EE668E9A> ,
    <P39186#SIPBE78F7D7D1DFDB96> ,
    <P39186#SIPF42DD8F9E97D4C04> ,
    <P39186#SIP67D0B11099583DF4> ,
...
<P39186#SIP2FA351DB0B3C2F16>
  rdf:type up:Metal_Binding_Annotation ;
  rdfs:comment "Iron (heme C 1 axial ligand); via tele nitrogen" ;
  up:range range:22574318868772398tt79tt79 .
...
<P39186#SIP68C53D2ADFC188FB>
  rdf:type up:Metal_Binding_Annotation ;
  rdfs:comment "Iron (heme C 1 axial ligand); via tele nitrogen" ;
  up:range range:22574318868772398tt97tt97 .
...
<P39186#SIP293BA29AFBE6705D>
  rdf:type up:Metal_Binding_Annotation ;
  rdfs:comment "Iron (heme C 2 axial ligand); via tele nitrogen" ;
  up:range range:22574318868772398tt114tt114 .
...
<P39186#SIP04086D3B0200EFDB>
  rdf:type up:Metal_Binding_Annotation ;
  rdfs:comment "Iron (heme C 2 axial ligand); via tele nitrogen" ;
  up:range range:22574318868772398tt137tt137 .
...
<P39186#SIP421DBD60EE668E9A>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "Heme C 1; covalent" ;
  up:range range:22574318868772398tt93tt93 .
...
<P39186#SIPBE78F7D7D1DFDB96>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "Heme C 1; covalent" ;
  up:range range:22574318868772398tt96tt96 .
...
<P39186#SIPF42DD8F9E97D4C04>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "Heme C 2; covalent" ;
  up:range range:22574318868772398tt133tt133 .
...
<P39186#SIP67D0B11099583DF4>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "Heme C 2; covalent" ;
  up:range range:22574318868772398tt136tt136 .

New format:

<P39186> rdf:type up:Protein ;
...
  up:annotation ...
    <P39186#SIP6B05083539025B03> ,
    <P39186#SIP26EC788A934C8F47> ,
    <P39186#SIPE6A28D6EA4C9FD5D> ,
    <P39186#SIPFBD274DC3C6C2D4D> ,
    <P39186#SIP35F10B423DCD5ABF> ,
    <P39186#SIP2B84A319ABAB6D10> ,
    <P39186#SIP3DF7879BD48FFB15> ,
    <P39186#SIP4B946529653D1444> ,
...
<P39186#SIP6B05083539025B03>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "axial binding residue" ;
  up:ligand <P39186#SIP03157F908F4B7F18> ;
  up:ligandPart <P39186#SIP85C37483D3E0B42A> ;
  up:range range:22574318868772398tt79tt79 .
...
<P39186#SIP26EC788A934C8F47>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "covalent" ;
  up:ligand <P39186#SIP03157F908F4B7F18> ;
  up:range range:22574318868772398tt93tt93 .
...
<P39186#SIPE6A28D6EA4C9FD5D>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "covalent" ;
  up:ligand <P39186#SIP03157F908F4B7F18> ;
  up:range range:22574318868772398tt96tt96 .
...
<P39186#SIPFBD274DC3C6C2D4D>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "axial binding residue" ;
  up:ligand <P39186#SIP03157F908F4B7F18> ;
  up:ligandPart <P39186#SIP85C37483D3E0B42A> ;
  up:range range:22574318868772398tt97tt97 .
...
<P39186#SIP35F10B423DCD5ABF>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "axial binding residue" ;
  up:ligand <P39186#SIP997F95171519AEFE> ;
  up:ligandPart <P39186#SIP428C6123B1B272B9> ;
  up:range range:22574318868772398tt114tt114 .
...
<P39186#SIP2B84A319ABAB6D10>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "covalent" ;
  up:ligand <P39186#SIP997F95171519AEFE> ;
  up:range range:22574318868772398tt133tt133 .
...
<P39186#SIP3DF7879BD48FFB15>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "covalent" ;
  up:ligand <P39186#SIP997F95171519AEFE> ;
  up:range range:22574318868772398tt136tt136 .
...
<P39186#SIP4B946529653D1444>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "axial binding residue" ;
  up:ligand <P39186#SIP997F95171519AEFE> ;
  up:ligandPart <P39186#SIP428C6123B1B272B9> ;
  up:range range:22574318868772398tt137tt137 .
...
<P39186#SIP03157F908F4B7F18>
  rdfs:subClassOf <http://purl.obolibrary.org/obo/CHEBI_61717> ;
  rdfs:label "1" .
<P39186#SIP85C37483D3E0B42A>
  rdfs:subClassOf <http://purl.obolibrary.org/obo/CHEBI_18248> ;
  up:partOf <P39186#SIP03157F908F4B7F18> .
<P39186#SIP997F95171519AEFE>
  rdfs:subClassOf <http://purl.obolibrary.org/obo/CHEBI_61717> ;
  rdfs:label "2" .
<P39186#SIP428C6123B1B272B9>
  rdfs:subClassOf <http://purl.obolibrary.org/obo/CHEBI_18248> ;
  up:partOf <P39186#SIP997F95171519AEFE> .

Example: Q9H6S0

Previous format:

<Q9H6S0> rdf:type up:Protein ;
...
  up:annotation ...
    <Q9H6S0#SIP1A82C4FF56746BB8> ,
...
    <Q9H6S0#SIPBD815A286DC38CC5> ,
    <Q9H6S0#SIPC5465EB9289C0B15> ,
...
<Q9H6S0#SIP1A82C4FF56746BB8>
  rdf:type up:Region_Annotation ;
  rdfs:comment "N6-methyladenosine residue binding" ;
  up:range range:22862455425413166tt1294tt1296 .
...
<Q9H6S0#SIPBD815A286DC38CC5>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "N6-methyladenosine residue" ;
  up:range range:22862455425413166tt1310tt1310 .
...
<Q9H6S0#SIPC5465EB9289C0B15>
  rdf:type up:Binding_Site_Annotation ;
  rdfs:comment "N6-methyladenosine residue" ;
  up:range range:22862455425413166tt1360tt1360 .

New format:

<Q9H6S0> rdf:type up:Protein ;
...
  up:annotation ...
    <Q9H6S0#SIP422D37614A6D5721> ,
    <Q9H6S0#SIP181E7990DC5C18D8> ,
    <Q9H6S0#SIP21050F3BB73E81A2> ,
...
<Q9H6S0#SIP422D37614A6D5721>
  rdf:type up:Binding_Site_Annotation ;
  up:ligand <Q9H6S0#SIPC276DB5E970331C9> ;
  up:ligandPart <Q9H6S0#SIP93E43149D18FF63B> ;
  up:range range:22862455425413166tt1294tt1296 .
...
<Q9H6S0#SIP181E7990DC5C18D8>
  rdf:type up:Binding_Site_Annotation ;
  up:ligand <Q9H6S0#SIPC276DB5E970331C9> ;
  up:ligandPart <Q9H6S0#SIP93E43149D18FF63B> ;
  up:range range:22862455425413166tt1310tt1310 .
...
<Q9H6S0#SIP21050F3BB73E81A2>
  rdf:type up:Binding_Site_Annotation ;
  up:ligand <Q9H6S0#SIPC276DB5E970331C9> ;
  up:ligandPart <Q9H6S0#SIP93E43149D18FF63B> ;
  up:range range:22862455425413166tt1360tt1360 .
...
<Q9H6S0#SIPC276DB5E970331C9>
  rdfs:subClassOf <http://purl.obolibrary.org/obo/CHEBI_33697> .
<Q9H6S0#SIP93E43149D18FF63B>
  rdfs:subClassOf <http://purl.obolibrary.org/obo/CHEBI_74449> ;
  up:partOf <Q9H6S0#SIPC276DB5E970331C9> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Brachydactyly-mental retardation syndrome
  • Neuropathy, hereditary, with or without age-related macular degeneration
  • Opitz GBBB syndrome 2

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • Dityrosine (Tyr-Tyr) (interchain with Y-...)

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • ADP-ribosylhistidine

Changes to keywords

Modified keyword:

UniProt release 2022_02

Published May 25, 2022

Headline

Prenylation for antiviral activity

We are not equal before viral infections. Take SARS-CoV-2 and COVID-19. During the pandemic the severity of the disease ranged from no symptoms at all to death, for at least 15 million individuals. Many factors can contribute to these differences, from environmental, such as deficiencies in diet, exercise, or sleep, to genetic ones, such as those contributing to the efficiency of the type I interferon response. Among the latter, 2'-5'-oligoadenylate synthase 1 (OAS1) has been the focus of several recent studies.

OAS1 is a sensor of viral double-stranded RNA (dsRNA). Binding to dsRNA induces a conformational change in OAS1 and its activation. Activated OAS1 catalyzes the polymerization of adenosine triphosphate (ATP) into a second messenger, 2'-5'-oligoadenylate (2'-5'A), which then binds monomeric ribonuclease L (RNase L) leading to its dimerization and subsequent activation. RNase L degrades cellular and viral RNA and blocks viral replication. OAS1 is expressed in the respiratory tract, a location perfectly suited to protect from SARS-CoV-2 infections, and binds two conserved stem loops in SARS-CoV-2 5'-untranslated region (UTR) with great specificity.

But there is more to it. OAS1 gene encodes at least 4 different proteins generated by alternative splicing. The predominant isoforms, called p46 and p42, have a common catalytic activity, but differ in the inclusion of the terminal exon, where a prenylation site is encoded. Contrary to p42, p46 is geranylgeranylated and this makes a great deal of difference. In cultured cells, only p46, not p42, can inhibit SARS-CoV-2, unless a prenylation site is engineered into p42, suggesting that targeting to membranes is crucial for OAS1 effectiveness. Preliminary clinical investigations also show that COVID-19 patients who do not express p46 are statistically more likely to develop severe COVID-19, i.e., to be admitted to the intensive care unit.

The explanation may lie in the fact that positive-strand RNA viruses, such as flaviviruses, picornaviruses, and coronaviruses, including SARS-CoV-2, replicate their RNA within modified host organelles of the endomembrane system. Membrane-bound OAS1 colocalizes with these organelles, where it can ideally detect viral RNAs and initiate their destruction. In support of this hypothesis, OAS1 effectiveness was demonstrated not only against SARS-CoV-2, but also against other viruses that form replicative organelles, while there was no inhibition of other respiratory RNA viruses, such as human parainfluenza virus type 3 and human respiratory syncytial virus, which do not use this type of vesicle.

The alternative splicing event which leads to p46 production and efficient antiviral response hangs by a thread, or rather by a single nucleotide variation within the last OAS1 intron, A>G (rs10774671). The presence of G at this location creates a splice acceptor site, allowing the inclusion of the last exon, hence that of the crucial prenylation site, while in the presence of A, the last exon is skipped. Unfortunately, the A allele is the major one in the overall human population with a frequency of 66%, although in some ethnic groups the prevalence of the G allele is higher, in people of African ancestry, for instance, its frequency is about 58%. Fortunately, prenylated OAS1 is likely only one of multiple interferon-stimulated genes that mediate the inhibition of SARS-CoV-2.

As of this release, the updated OAS1 entry is publicly available in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to AlphaFoldDB

Cross-references have been added to AlphaFoldDB, a database of protein structure predictions for the human proteome and other key proteins of interest.

AlphaFoldDB is available at https://alphafold.ebi.ac.uk/.

The format of the explicit links is:

Resource abbreviation AlphaFoldDB
Resource identifier UniProtKB accession number

Example: P41182

Show all entries having a cross-reference to AlphaFoldDB.

Text format

Example: P41182

DR   AlphaFoldDB; P41182; -.

XML format

Example: P41182

<dbReference type="AlphaFoldDB" id="P41182"/>

RDF format

Example: P41182

uniprot:P41182
  rdfs:seeAlso <http://purl.uniprot.org/alphafolddb/P41182> .
<http://purl.uniprot.org/alphafolddb/P41182>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/AlphaFoldDB> .

Version numbers for identifiers in Ensembl cross-references

In UniProtKB cross-references to Ensembl, we are now using version numbers for Ensembl identifiers wherever possible.

Text format

Example: A6NEM1

Previous format:

DR   Ensembl; ENST00000618348; ENSP00000481078; ENSG00000197978.
DR   Ensembl; ENST00000620574; ENSP00000479589; ENSG00000274320.

New format:

DR   Ensembl; ENST00000618348.2; ENSP00000481078.1; ENSG00000197978.10.
DR   Ensembl; ENST00000620574.2; ENSP00000479589.1; ENSG00000274320.2.

XML format

Example: A6NEM1

Previous format:

<dbReference type="Ensembl" id="ENST00000618348">
  <property type="protein sequence ID" value="ENSP00000481078"/>
  <property type="gene ID" value="ENSG00000197978"/>
</dbReference>
<dbReference type="Ensembl" id="ENST00000620574">
  <property type="protein sequence ID" value="ENSP00000479589"/>
  <property type="gene ID" value="ENSG00000274320"/>
</dbReference>

New format:

<dbReference type="Ensembl" id="ENST00000618348.2">
  <property type="protein sequence ID" value="ENSP00000481078.1"/>
  <property type="gene ID" value="ENSG00000197978.10"/>
</dbReference>
<dbReference type="Ensembl" id="ENST00000620574.2">
  <property type="protein sequence ID" value="ENSP00000479589.1"/>
  <property type="gene ID" value="ENSG00000274320.2"/>
</dbReference>

RDF format

Example: A6NEM1

Previous format:

uniprot:A6NEM1
  rdfs:seeAlso <http://rdf.ebi.ac.uk/resource/ensembl.transcript/ENST00000618348> ,
               <http://rdf.ebi.ac.uk/resource/ensembl.transcript/ENST00000620574> .
<http://rdf.ebi.ac.uk/resource/ensembl.transcript/ENST00000618348>
  rdf:type up:Transcript_Resource ;
  up:database <http://purl.uniprot.org/database/Ensembl> ;
  up:translatedTo <http://rdf.ebi.ac.uk/resource/ensembl.protein/ENSP00000481078> ;
  up:transcribedFrom <http://rdf.ebi.ac.uk/resource/ensembl/ENSG00000197978> .
<http://rdf.ebi.ac.uk/resource/ensembl.transcript/ENST00000620574>
  rdf:type up:Transcript_Resource ;
  up:database <http://purl.uniprot.org/database/Ensembl> ;
  up:translatedTo <http://rdf.ebi.ac.uk/resource/ensembl.protein/ENSP00000479589> ;
  up:transcribedFrom <http://rdf.ebi.ac.uk/resource/ensembl/ENSG00000274320> .

New format:

uniprot:A6NEM1
  rdfs:seeAlso <http://rdf.ebi.ac.uk/resource/ensembl.transcript/ENST00000618348.2> ,
               <http://rdf.ebi.ac.uk/resource/ensembl.transcript/ENST00000620574.2> .
<http://rdf.ebi.ac.uk/resource/ensembl.transcript/ENST00000618348.2>
  rdf:type up:Transcript_Resource ;
  up:database <http://purl.uniprot.org/database/Ensembl> ;
  up:translatedTo <http://rdf.ebi.ac.uk/resource/ensembl.protein/ENSP00000481078.1> ;
  up:transcribedFrom <http://rdf.ebi.ac.uk/resource/ensembl/ENSG00000197978.10> .
<http://rdf.ebi.ac.uk/resource/ensembl.transcript/ENST00000620574.2>
  rdf:type up:Transcript_Resource ;
  up:database <http://purl.uniprot.org/database/Ensembl> ;
  up:translatedTo <http://rdf.ebi.ac.uk/resource/ensembl.protein/ENSP00000479589.1> ;
  up:transcribedFrom <http://rdf.ebi.ac.uk/resource/ensembl/ENSG00000274320.2> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • 2-(cystein-S-ylcarbonyl)-3-methyl-4-(glutam-5-yloxy)methylindole (Glu-Cys)

New term for the feature key 'Lipidation' ('LIPID' in the flat file):

  • N6-stearoyl lysine

Changes in subcellular location controlled vocabulary

New subcellular location:

Changes to keywords

New keyword:

UniParc news

Version numbers for identifiers in Ensembl cross-references

In UniParc cross-references to Ensembl, we are now using version numbers for Ensembl identifiers wherever possible.

XML format

Example: UPI000442D01A

Previous format:

<dbReference type="Ensembl" id="ENSP00000481078" version_i="1" active="Y" created="2014-07-21" last="2021-10-18">
  <property type="NCBI_taxonomy_id" value="9606"/>
  <property type="gene_name" value="GOLGA6L9"/>
  <property type="proteome_id" value="UP000005640"/>
  <property type="component" value="Chromosome 15"/>
</dbReference>

New format:

<dbReference type="Ensembl" id="ENSP00000481078" version_i="1" active="Y" version="1" created="2014-07-21" last="2021-10-18">
  <property type="NCBI_taxonomy_id" value="9606"/>
  <property type="gene_name" value="GOLGA6L9"/>
  <property type="proteome_id" value="UP000005640"/>
  <property type="component" value="Chromosome 15"/>
</dbReference>

RDF format

Example: UPI000442D01A

Previous format:

uniparc:UPI000442D01A
  rdf:type up:Sequence ;
  up:sequenceFor uniprot:A6NEM1 , isoform:A6NEM1-1 , ensembl:ENSP00000481078 .

ensembl:ENSP00000481078
  up:database <http://purl.uniprot.org/database/Ensembl> ;
  up:version 1 .

New format:

uniparc:UPI000442D01A
  rdf:type up:Sequence ;
  up:sequenceFor uniprot:A6NEM1 , isoform:A6NEM1-1 , ensembl:ENSP00000481078.1 .

ensembl:ENSP00000481078.1
  up:database <http://purl.uniprot.org/database/Ensembl> ;
  up:version 1 ;
  up:reportedVersion 1 .

UniProt release 2022_01

Published February 23, 2022

Headline

A phospholipase for clear vision

The eye is a fascinating and complex organ, made of three main tissue types, the cornea, lens and retina. The eye lens is a transparent and biconvex structure which allows the passage of light, focusing it on the retina. To achieve their specific refractive properties, lens cells differentiate to contain an extraordinarily high concentration of crystallin proteins. They also have to get rid of organelles, such as endoplasmic reticulum, lysosomes and highly diffracting nuclei and mitochondria, to ensure clear vision. This process is under tight spatiotemporal control, proceeding from the center of the lens to the periphery to form an organelle-free zone. Failure of this process leads to a cloudy lens, in other words a cataract.

Some components in the eye lens differentiation process have been identified. The transcription factor HSF4 has been previously shown to be instrumental, although its precise action is unclear. It has also been shown that the degradation of nuclear DNA requires lysosomal DNASE2B. But until recently the process of organelle clearance remained quite elusive. Macroautophagy seemed to be a sensible hypothesis, as it is involved in organelle clearance during red blood cell differentiation for instance. However, this hypothesis has been discarded, as knockout mice lacking Atg5 or Pik3c3, two genes required for autophagy, exhibit normal organelle degradation during lens development.

Recently, Morishita et al. achieved a real breakthrough with a very efficient, though cumbersome approach. Using zebrafish as an animal model, they knocked out, one by one, all the genes that are highly expressed in the lens. In the plaat1 knockout, the disappearance of mitochondria and the endoplasmic reticulum in the lens was almost completely suppressed. The knockout of the orthologous gene in mouse, called Plaat3, gave a similar result: there was almost no degradation of mitochondria, endoplasmic reticulum or lysosomes in the developing lens. PLAAT proteins are phospholipases, that exhibit both phospholipase A1 (PLA1) and A2 (PLA2) activities, and hence they catalyze the calcium-independent release of fatty acids from the sn-1 or sn-2 position of glycerophospholipids. In the context of lens, the lipase activity was required for rupture of organelles. During lens differentiation, PLAAT cytosolic enzymes are recruited to the organelles to be degraded after a mysterious membrane-damaging event. The cause of the membrane damage that initiates PLAAT recruitment and the cascade of organelle membrane degradation is unknown, but it is dependent upon HSF4 expression. One hypothesis is that high levels of expression of cytosolic proteins induced by HSF4, such as crystallins, may cause the membrane damage. What is the fate of the proteins released during organelle degradation? Are they degraded through the ubiquitin-proteasome pathway? What happens to the free fatty acids that are generated by PLAAT activity? These questions and many others remain open, but some (refracted) light has been shed on an until recently obscure process.

UniProtKB/Swiss-Prot entries for plaat1, PLAAT3 and HSF4 have been updated with new information.

UniProtKB news

Cross-references to MANE-Select

Cross-references have been added to the MANE-Select (Matched Annotation between NCBI and EBI) transcript resource. Matched Annotation from NCBI and EMBL-EBI (MANE) is a collaboration between the NCBI and the EMBL-EBI with the goal to provide a minimal set of matching RefSeq and Ensembl transcripts of human protein-coding genes.

MANE Select aims at providing one high-quality representative transcript per protein-coding gene that is well-supported by experimental data and represents the biology of the gene.

MANE-Select is available at https://www.ensembl.org/info/genome/genebuild/mane.html and https://www.ncbi.nlm.nih.gov/refseq/MANE/.

The format of the explicit links is:

Resource abbreviation MANE-Select
Resource identifier Ensembl transcript sequence ID
Optional information 1 Ensembl protein sequence ID
Optional information 2 RefSeq nucleotide sequence ID
Optional information 3 RefSeq protein sequence ID

Example: Q9Y4I1

Show all entries having a cross-reference to MANE-Select.

Cross-references to MANE-Select may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Text format

Example: Q9Y4I1

DR   MANE-Select; ENST00000399233.7; ENSP00000382179.4; NM_001382347.1; NP_001369276.1. [Q9Y4I1-3]

XML format

Example: Q9Y4I1

<dbReference type="MANE-Select" id="ENST00000399233.7">
  <property type="protein sequence ID" value="ENSP00000382179.4"/>
  <property type="RefSeq nucleotide sequence ID" value="NM_001382347.1"/>
  <property type="RefSeq protein sequence ID" value="NP_001369276.1"/>
  <molecule id="Q9Y4I1-3"/>
</dbReference>

RDF format

Example: Q9Y4I1

uniprot:Q9Y4I1
  rdfs:seeAlso <http://purl.uniprot.org/mane-select/ENST00000399233.7> .

<http://purl.uniprot.org/mane-select/ENST00000399233.7>
  rdf:type up:Transcript_Resource ;
  up:database <http://purl.uniprot.org/database/MANE-Select> ;
  owl:sameAs <http://rdf.ebi.ac.uk/resource/ensembl.transcript/ENST00000399233.7> ;
  owl:sameAs <http://purl.uniprot.org/refseq/NM_001382347.1> ;
  uniprot:translatedTo <http://purl.uniprot.org/mane-select/ENST00000399233.7#translation> ;
  rdfs:seeAlso isoform:Q9Y4I1-3 .

<http://purl.uniprot.org/mane-select/ENST00000399233.7#translation>
  owl:sameAs <http://rdf.ebi.ac.uk/resource/ensembl.protein/ENSP00000382179.4> ;
  owl:sameAs <http://purl.uniprot.org/refseq/NP_001369276.1> .

Removal of the cross-references to GeneDB

Cross-references to GeneDB have been removed.

Change of ECO code associated with the GOA evidence 'Inferred from Electronic Annotation' (IEA)

Following a decision of the Gene Ontology (GO), we now associate, in Gene Ontology annotations, the evidence type 'Inferred from Electronic Annotation' (IEA) with the Evidence Ontology (ECO) term

'computational evidence used in automatic assertion', identified with ECO:0007669

replacing the previously used

'evidence used in automatic assertion', identified with ECO:0000501.

This affects the RDF formats of UniProtKB/GOA data.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Glomerulocystic kidney disease with hyperuricemia and isosthenuria
  • Medullary cystic kidney disease 2
  • Waardenburg syndrome 2, with ocular albinism, autosomal recessive
  • SC phocomelia syndrome

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • Cyclopeptide (Ile-Lys)

New terms for the feature key 'Lipidation' ('LIPID' in the flat file):

  • O-hexanoyl serine
  • 3'-farnesyl-2',N2-cyclotryptophan

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • ADP-riboxanated arginine
  • 3'-bromotyrosine

RDF news

Change of the RDF representation of cross-references to RefSeq

We have modified the RDF description of our cross-references to protein resources in the RefSeq database to provide explicit links to their corresponding RefSeq nucleotide sequences.

Previously, our cross-references showed the identifier of a RefSeq nucleotide sequence through an rdfs:comment property with a string literal value. We have replaced this by a translatedFrom property that links the RefSeq protein to a resource that corresponds to its coding sequence (CDS) and, similarly to the description of EMBL-CDS resources, this CDS links through a locatedOn property to the transcript or genomic sequence that contains it.

Note that the translatedFrom property is not the inverse of the translatedTo property that is used, for instance, in the description of Ensembl cross-references. The latter is used to link mature transcripts (whose sequence includes untranslated regions in addition to the CDS) to proteins.

Example: Q00266

Previous format:

uniprot:Q00266
  rdfs:seeAlso <http://purl.uniprot.org/refseq/NP_000420.1> .

<http://purl.uniprot.org/refseq/NP_000420.1>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/RefSeq> ;
  rdfs:comment "NM_000429.2" .

New format:

uniprot:Q00266
  rdfs:seeAlso <http://purl.uniprot.org/refseq/NP_000420.1> .

<http://purl.uniprot.org/refseq/NP_000420.1>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/RefSeq> ;
  up:translatedFrom <http://purl.uniprot.org/refseq/NM_000429.2#NP_000420.1_CDS> .

<http://purl.uniprot.org/refseq/NM_000429.2#NP_000420.1_CDS>
  up:locatedOn <http://purl.uniprot.org/refseq/NM_000429.2> .

Change of the RDF representation of cross-references of tissues

To be consistent between the RDF representation of tissues with the other controlled vocabularies we provide, the tissues cross-references now use the rdfs:comment predicate instead of the previously used rdfs:label predicate.

Example: TS-0430

Previous format:

<http://purl.uniprot.org/tissues/430>
  rdfs:seeAlso <http://purl.uniprot.org/po/0000262> .
<http://purl.uniprot.org/po/0000262>
  rdfs:label "trichoblast" ;
  up:database <http://purl.uniprot.org/database/PO> .

New format:

<http://purl.uniprot.org/tissues/430>
   rdfs:seeAlso <http://purl.uniprot.org/po/0000262> .
<http://purl.uniprot.org/po/0000262>
   rdfs:comment "trichoblast" ;
   up:database <http://purl.uniprot.org/database/PO> .

UniProt release 2021_04

Published November 17, 2021

Headline

ZTGC: bacteriophages reinvent the DNA alphabet

In 1977, Kirnos et al. isolated a DNA virus acting on blue-green algae, called cyanophage S-2L, from water samples taken in the outskirts of Leningrad (known today as Saint Petersburg). They reported a weird DNA composition for this virus. Instead of the usual adenine (6-aminopurine or A), the bacteriophage contained 2,6-diaminopurine (DAP), also known as 2-aminoadenine, abbreviated Z in recent publications. The presence of the additional amino group in adenine changes several DNA features. The most striking difference is that DAP forms 3 hydrogen bonds when paired with thymine, instead of the usual 2 bonds between adenine and thymine, hence forming a stronger, more stable pair. The presence of DAP confers selective advantages to bacteriophages, such as resistance to most restriction enzymes. Consequently, these viruses can evade the host bacterial defense system based on foreign DNA degradation.

In view of the complete substitution of adenine by DAP, Kirnos et al. proposed that DAP was incorporated into S-2L phage DNA as a ready-made nucleotide, rather than formed at the polynucleotide level as a result of amination of adenine residues within DNA. However, it took almost half a century to unravel the pathway of DAP synthesis and its incorporation into DNA. This pathway intermingles viral and bacterial enzymes in tight collaboration. The first step of DAP synthesis involves the condensation of aspartate with deoxyguanylate to form N6-succino-2-amino-2'-deoxyadenylate (dSMP). The reaction is catalyzed by phage PurZ. dSMP is then converted into 2-amino-2'-deoxyadenosine monophosphate (dZMP) by an enzyme encoded by purB. dZMP is further phosphorylated by guanylate kinase (gmk) and by nucleoside diphosphate kinase (ndk) to produce 2-amino-2'-deoxyadenosine-5'-triphosphate (dZTP). All three enzymes, encoded by the purB, gmk and ndk genes, are provided by the bacterial host. dZTP is incorporated into phage DNA by viral DpoZ polymerase, which shares similarity with the Klenow fragment of Escherichia coli DNA polymerase I. The selective incorporation of dZTP rather than dATP into phage DNA is still a matter of debate. Pezo et al. showed that DpoZ has a strong preference for dZTP over dATP. However, Czernecki and Zhou proposed another mechanism of dATP exclusion, in which another phage enzyme, a dATPase called DatZ, would break up dATP, while preserving dZTP. The removal of dATP (and of its precursor dADP) from the nucleotide pool of the host would prevent the incorporation of adenine into the phage genome.

If cyanophage S-2L looked like a strange outlier 50 years ago, nowadays its amazing DNA composition appears to be more widespread than anticipated. PurZ has been identified in over a hundred bacteriophages, that infect a variety of microorganisms, including cyanobacteria, proteobacteria and actinobacteria, and the presence of Z-genomes have been experimentally proven in several of them. DAP usage to convey genetic information may be ancient. Phylogenic trees of PurZ and DpoZ indicate that DAP may have been used among siphoviruses since the evolutionary divergence of actinobacteria, cyanobacteria, and proteobacteria some 3.5 billion years ago. Moreover DAP, along with other nucleobases, has been found in meteorites, suggesting it could have been available before the advent of life for constructing the first genetic molecules.

As of this release, representative entries for the proteins involved in viral Z-genome manufacturing have been manually annotated and are publicly available. These entries include viral PurZ, DNA polymerase DpoZ and dATPase DatZ and bacterial PurB, Gmk and Ndk. Viral PurZ proteins can also be identified with a newly created HAMAP family profile.

It should be noted that DAP has been first abbreviated 'D', but more recently the abbreviation 'Z' took over and DNA containing this nucleotide was called 'Z-DNA'. This latter nomenclature may be misleading. The term 'Z-DNA' classically refers to a specific left-handed double helical structure in which the helix winds to the left in a zigzag pattern, instead of to the right, like the more common B-DNA form. This type of structure does not contain 2,6-diaminopurine.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • 4-(1-hydroxyethyl)-7-isoleucino-2-(threonin-O3-ylcarbonyl)-7,8-dihydroquinolin-8-ol (Ile-Thr)

New term for the feature key 'Lipidation' ('LIPID' in the flat file):

  • Phosphatidylserine amidated glycine

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • 5-glutamyl glycine
  • ADP-alpha-D-ribosylarginine

UniProt release 2021_03

Published June 2, 2021

Headline

The importance of being disordered

Intrinsically disordered regions are protein regions that lack a fixed or ordered three-dimensional structure, typically in the absence of interaction partners. They are thought to often drive liquid-liquid phase separation within cells. This process leads to the formation of biomolecular condensates devoid of a surrounding lipid membrane, which are responsible for extensively compartmentalizing eukaryotic cells. Liquid-liquid phase separation plays crucial roles in fundamental cellular processes, such as the stress response, among others. Indeed eukaryotic cells react to a wide range of stresses by assembling proteins and mRNAs into massive ribonucleoprotein stress granules (SGs) that regulate mRNA translation and degradation. Interestingly mRNAs modified through the methylation of adenosine (m6A) are enriched in SGs. The m6A modification is very common, with around 25% of mRNAs containing at least one m6A. It is recognized by a protein domain, called YTH, which is conserved from yeast to humans. In humans and mice, there are 3 such proteins, YTHDF1, YTHDF2 and YTHDF3. All 3 paralogs contain a C-terminal YTH domain and disordered regions are predicted in their N-terminal moieties. These proteins colocalize with stress granules.

Could YTHDF proteins and their modified RNA partners be active players in the process of stress granule formation? In vitro, RNAs containing at least 4 copies of m6A motifs, but not those with a single motif, dramatically enhance phase separation of YTHDF proteins. When transfected into cells, multivalent m6A RNA oligos promote localization of YTHDF proteins to stress granules. Moreover, depletion of YTHDF proteins inhibits SG formation and recruitment of mRNAs to SGs. Both the N-terminal intrinsically disordered region and the C-terminal m6A-binding YTH domain of YTHDF proteins are required for SG formation. It has been proposed that polymethylated mRNAs may act as a multivalent scaffold for the binding of YTHDF proteins, juxtaposing their low-complexity domains and thereby leading to phase separation.

As of this release, we have introduced predictions of long, intrinsically disordered regions into the UniProtKB annotation pipeline, using the MobiDB-lite method. YTHDF entries have been updated and are now publicly available.

UniProt knowledge graph news

Using UniProt in a public cloud is now easier than ever, as UniProt is now available as part of the AWS open data program. AWS has published a "tutorial on how you can use the public UniProt data in
your own private graph database":https://aws.amazon.com/blogs/industries/exploring-the-uniprot-protein-knowledgebase-with-aws-open-data-and-amazon-neptune/ using AWS Neptune. This shows how the I (for Interoperability) in FAIR UniProt works with commercial cloud-based tools. Of course the UniProt Knowledge Graph remains freely available at sparql.uniprot.org.

UniProtKB news

Introduction of MobiDB-lite predictions for intrinsically disordered regions

Starting with this release of UniProt, we import predictions of intrinsically disordered regions and regions of compositional bias, generated with the MobiDB-lite method, in UniProtKB/Swiss-Prot. These computationally generated annotations are represented as Region and Compositional bias annotations tagged with the evidence code ECO:0000256 (sequence model evidence used in automatic assertion) and source SAM:MobiDB-lite. Entries with disordered regions can be retrieved using these queries:

annotation:(type:region disordered) AND reviewed:yes
annotation:(type:compbias) AND reviewed:yes

In cases where there is experimental evidence for intrinsically disordered regions, curators can modify the predicted results, or add new expert-curated regions, using the appropriate experimental evidence attribution.

Predictions from the MobiDB-lite method are also available for protein sequences in UniProtKB/TrEMBL - tagged with the same evidence code and source as in reviewed entries. They can be obtained by using these queries:

annotation:(type:region disordered) AND reviewed:no
annotation:(type:compbias) AND reviewed:no

See also:

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease

  • Amyotrophic lateral sclerosis 17

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Glycosylation' ('CARBOHYD' in the flat file):

  • N-alpha-linked (Rha) arginine

Modified term for the feature key 'Glycosylation' ('CARBOHYD' in the flat file):

  • N-linked (GlcNAc) arginine -> N-beta-linked (GlcNAc) arginine

Changes in subcellular location controlled vocabulary

New subcellular locations:

Changes to keywords

New keyword:

UniProt release 2021_02

Published April 7, 2021

Headline

With a little help from my friend

One of the early responses to eukaryotic DNA damage is ADP-ribosylation of histones and other proteins in the vicinity of the lesion. This post-translation modification, which is important for the decompaction of chromatin and the recruitment of repair factors, can be catalyzed by two related enzymes: PARP1 or PARP2. In the absence of binding partners, the PARPs catalyze ADP-ribosylation on aspartate or glutamate residues. However, in the context of DNA damage, ADP-ribosylation predominantly occurs on serine residues. The switch in amino acid specificity requires the presence of a protein cofactor HPF1. Its mechanism was unraveled recently through the study of the X-ray crystal structure of the PARP2-HPF1 complex.

Under resting, or low stress conditions, the interaction between PARP1 or PARP2 and HPF1 is limited by an inhibitory domain in PARP. In response to high and acute levels of DNA damage, PARP binds DNA and undergoes conformational changes, the inhibitory domain is unfolded, and the interaction with HPF1 is stabilized. Within the complex, PARP2 and HPF1 form a new joint active site with an HPF1 glutamate residue positioned at the very core of the enzyme. Mutagenesis of this glutamate residue to alanine does not affect HPF1 binding to PARP2, but does impair serine ADP-ribosylation. This glutamate within the active site is thought to allow the deprotonation of the serine residue, making it a favorable target for nucleophilic attack. This step is dispensable when the substrate is an aspartate or glutamate residue that is deprotonated at neutral pH. In addition to its active role in catalysis per se, HPF1 may also participate in substrate recognition. Indeed, a putative peptide-binding canyon may form at the interface of HPF1 and PARP2. This site seems perfectly suited for binding lysine-serine motifs that are highly enriched among serine-ADP-ribosylation substrates in vivo.

This discovery goes beyond the excitement of unraveling a new mechanism in a crucial process in DNA repair. It may also have an impact in a clinical setting. PARP small inhibitors have been approved for the treatment of BRCA-negative breast, ovarian and fallopian tube cancers, and it has been previously shown that human cells lacking HPF1 exhibit sensitivity to DNA damaging agents and PARP inhibition. Therefore it may be of interest to re-evaluate the potency and selectivity of existing PARP inhibitors in the presence of HPF1-PARP1/2 complexes and to develop new drugs that would specifically interfere with HPF1-mediated PARP1/2 activity, but not with PARP1/2 enzymatic activity on non-serine residues.

As of this release, HPF1, PARP1 and PARP2 entries have been updated and are publicly available.

UniProt website news

Visualization of subcellular location annotation using SwissBioPics

As of this release, we are using the SwissBioPics library of interactive biological images for the visualization of subcellular location data to enhance the representation of UniProt and Gene Ontology (GO) subcellular location annotations.

SwissBioPics covers cell types from all kingdoms of life - ranging from muscle, neuronal and epithelial cells of animals, to the rods, cocci, clubs, spirals and other more exotic forms of bacteria and archaea. A reusable web component and an API allow website developers to visualize subcellular location data (in the form of GO cellular component or UniProt subcellular location identifiers) on these images. The code and technical documentation are available at npmjs.com.

Examples:

UniProtKB news

Change of evidence codes for combinatorial evidence

When UniProt adopted the Evidence Code Ontology (ECO) in 2014, we chose to use the concepts ECO:0000244 in manual assertions and ECO:0000213 in automatic assertions, respectively, for information inferred from a combination of experimental and computational evidence. These two ECO concepts have in fact a broader meaning that includes combinations of any type of evidence, and meanwhile the ECO has been extended with concepts that exactly reflect our usage. We have therefore replaced ECO:0000244 by ECO:0007744 and ECO:0000213 by ECO:0007829.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • O-di-AMP-tyrosine
  • O-tri-AMP-tyrosine
  • 5-glutamyl dopamine
  • 5-glutamyl noradrenaline
  • 5-glutamyl serotonin

Modified term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • L-isoglutamyl histamine -> 5-glutamyl histamine

Changes in subcellular location controlled vocabulary

New subcellular locations:

Modified subcellular location:

UniProt release 2021_01

Published February 10, 2021

(Almost) all about that CBASS

Bacteria and archaea, like every other living organisms, must defend themselves against viral (bacteriophage or phage) attack. It is estimated that there are 1031 bacteriophages on the planet found in all habitats; sea water, animal gastrointestinal tracts, oceanic basement... A number of anti-phage defense mechanisms are known: restriction nucleases, an Argonaute-like system, CRISPR, and more are being discovered as we explore less well-known archaea and bacteria. A new phage defense mechanism has been recently discovered and is being characterized, the CBASS system (Cyclic oligonucleotide-Based Anti-phage Signalling System).

At the heart of each CBASS system is a cyclic nucleotide synthase. These belong to the cGAS/DncV-like nucleotidyltransferase family (abbreviated CD-NTase), and they make cyclic di-, tri-, and possibly even tetranucleotides. The cyclic nucleotides activate an effector protein, encoded adjacently to the CD-NTase. The effectors have a variety of activities: nucleases that degrade all DNA, including viral DNA (NucC, Cap4), an NAD+ hydrolase that presumably depletes cellular NAD+ (Cap12), phospholipases that are capable of degrading the cell membrane (CapV, CapE), and transmembrane proteins that may form pores (Cap13) have all been observed, although not all have been characterized yet.

Transforming CBASS into bacteria without a natural CBASS system protects the population against phages. The effectors lead to cell death before the phage have completed an infection cycle, thus protecting adjacent uninfected cells from infection. There are at least 4 types of CBASS systems, classified on basis of the presence or absence of ancillary genes encoded in the same locus. Type I systems have no extra genes. In type II systems, the ancillary proteins Cap2 and Cap3 are not necessary for protection against all phages, but they enlarge the range of phages against which the system is active. In type III systems, the ancillary proteins Cap6 and Cap7 control the activity of the system; Cap7 is required to activate the CD-NTase, while Cap6 prevents the Cap7-CD-NTase association. A subtype has a second Cap7-like gene called Cap8 which may act as a scaffold for complex assembly. Type IV systems are rare and so far uncharacterized. They occur mostly in Firmicutes and archaea.

At the request of Dr. Philip Kranzusch, the lead author on many of these papers, characterized CBASS proteins have been annotated. We'd like to thank Dr. Kranzusch, members of his lab, and Dr. R. Sorek for answering annotation questions. We also thank you for your interest in our knowledgebase and remind you that we are always looking for your input on our entries.

UniProtKB news

Change of the cross-references to EuPathDB: renamed to VEuPathDB

We have updated our cross-references to reflect the name change of EuPathDB into VEuPathDB, following the inclusion of genomes of invertebrate vectors of human pathogens.

Removal of the cross-references to VectorBase

Direct cross-references to VectorBase have been removed, as VectorBase is now part of the VEuPathDB resources.

Removal of the cross-references to UniCarbKB

Cross-references to UniCarbKB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes in subcellular location controlled vocabulary

New subcellular locations:

The humsavar.txt file lists all missense variants annotated in UniProtKB/Swiss-Prot human entries. It provides a variant classification which is intended for research purposes only, not for clinical and diagnostic use. Variants were previously classified into the categories Disease, Polymorphisms and Unclassified. We have renamed these categories to follow the terminology recommended by the American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) (Richards et al. PubMed:25741868):

Previous category New category Description
Disease LP/P likely pathogenic or pathogenic
Polymorphism LB/B likely benign or benign
Unclassified US uncertain significance

Previous format:

Main      Swiss-Prot             AA             Type of
gene name AC         FTId        change         variant       dbSNP          Disease name
_________ __________ ___________ ______________ _____________ ______________ _____________________
...
CC2D2A    Q9P2K1     VAR_075698  p.Trp1182Arg   Disease       rs386833755    Joubert syndrome 9 (JBTS9) [MIM:612285]
CC2D2A    Q9P2K1     VAR_076881  p.Ser117Arg    Unclassified  rs186264635    Joubert syndrome 9 (JBTS9) [MIM:612285]
CC2D2A    Q9P2K1     VAR_076882  p.Lys507Glu    Polymorphism  rs144439937    Joubert syndrome 9 (JBTS9) [MIM:612285]
...

New format:

Main      Swiss-Prot             AA             Variant
gene name AC         FTId        change         category dbSNP          Disease name
_________ __________ ___________ ______________ ________ ______________ _____________________
...
CC2D2A    Q9P2K1     VAR_075698  p.Trp1182Arg   LP/P     rs386833755    Joubert syndrome 9 (JBTS9) [MIM:612285]
CC2D2A    Q9P2K1     VAR_076881  p.Ser117Arg    US       rs186264635    Joubert syndrome 9 (JBTS9) [MIM:612285]
CC2D2A    Q9P2K1     VAR_076882  p.Lys507Glu    LB/B     rs144439937    Joubert syndrome 9 (JBTS9) [MIM:612285]
...

Also in line with the ACMG/AMP guidelines, we have at the same time deprecated the keyword "Polymorphism" and renamed the keyword "Disease mutation" to "Disease variant". This was done because the terms 'polymorphism' and 'mutation', which have been widely used, often lead to confusion due to incorrect assumptions of pathogenic and benign effects, respectively.

Entries with variant annotations can be retrieved on the UniProt website with the query annotation:(type:variant).

UniProt FTP and website news

Changes to the FTP repository for reference proteomes

We currently distribute the UniProt reference proteomes on our FTP site in four taxonomic division folders (Archaea, Bacteria, Eukaryota and Viruses) and provide, for each proteome, its sequences in FASTA format and mappings from UniProt identifiers and gene names to those found in other databases. Starting from this release, we also publish the full protein records for a proteome in the UniProtKB text and XML format, and we have at the same time introduced a subfolder for each proteome that groups all its files in order to reduce the number of files in the taxonomic division folders.

Example: UP000005640

Previous FTP folder and files:

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/

  • UP000005640_9606.fasta.gz
  • UP000005640_9606_additional.fasta.gz
  • UP000005640_9606_DNA.fasta.gz
  • UP000005640_9606_DNA.miss.gz
  • UP000005640_9606.idmapping.gz
  • UP000005640_9606.gene2acc.gz

New FTP folder and files:

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/

  • UP000005640_9606.dat.gzUniProtKB text format
  • UP000005640_9606.xml.gzUniProtKB XML format
  • UP000005640_9606.fasta.gz
  • UP000005640_9606_additional.fasta.gz
  • UP000005640_9606_DNA.fasta.gz
  • UP000005640_9606_DNA.miss.gz
  • UP000005640_9606.idmapping.gz
  • UP000005640_9606.gene2acc.gz

Browser support change: End of support for Microsoft Internet Explorer

We have phased out support for Microsoft's Internet Explorer (IE 11), since our user base for this web browser is continuously decreasing and has significantly fallen below 1.5%. Please switch to a more recent and supported browser in order to enjoy the full functionalities of the UniProt website.

Even if you are not using IE 11, it may be a good idea to look at the list of the browsers we support and consider switching to the most recent version of your browser for the best user experience and security.

UniProt release 2020_06

Published December 2, 2020

Headline

Venoms, gold mines for new antiprotozoal drugs

Neglected diseases are typically tropical infections, which are common in low-income populations in developing regions of Africa, Asia, and the Americas and affect more than one billion people. Among them, Chagas disease, also known as American trypanosomiasis, affects an estimated 6 to 7 million people worldwide. This disease is caused by the parasitic protozoan Trypanosoma cruzi. The disease is typically transmitted to humans and other mammals by the bite of infected triatomine insects, also known as kissing bugs or vampire bugs. Once inside the host, the protozoan invades cells near the site of inoculation, where it differentiates into an intracellular parasitic form called the amastigote. Amastigotes multiply, differentiate into trypomastigotes which burst the host cell, and are released into the circulation, from where they can invade a variety of tissues and repeat the same infectious cycle in new sites. In the early stage of infection, symptoms are mild, if any, and may include fever, swollen lymph nodes, headaches, or swelling at the site of the bite. After a few weeks, untreated individuals enter the chronic phase of disease and most do not show any further symptoms. However, in the long term (10-30 years after the initial illness), chronic infection may lead to cardiomyopathies, digestive tract pathologies, and up to 10% of people experience nerve damage.

There are only two drugs currently used to treat Chagas disease, benznidazole and nifurtimox. Unfortunately they are rarely beneficial during the chronic phase of the disease and can cause severe adverse effects. Moreover, resistance to these drugs is emerging in various parasitic strains. In this context, the observation that crude venom from a Brazilian ant called Dinoponera quadriceps had antichagasic activity and low toxicity in vitro offers one hope for the development of alternative therapies for Chagas disease. A venom component bearing this activity was identified as a 23-amino acid long peptide, called M-poneratoxin-Dq3a (M-PONTX-Dq3a). This peptide induced trypanosome necrosis and acted on the three parasitic forms: epimastigote (found in the gut of the vector insect), infectious trypomastigote and intracellular amastigote, suggesting that M-PONTX-Dq3a could be beneficial during the chronic phase of the infection. These effects were observed at concentrations low enough to avoid any toxicity for mammalian host cells contrary to benznidazole.

Ants are not the only organisms that can give us a hand in the fight against T. cruzi. Snakes, such as Bothrops atrox and Crotalus durissus terrificus, and wasps, such as Polybia paulista, have also been shown to produce antichagasic toxins. All venom toxins characterized so far were active against the three T. cruzi forms at concentrations that did not harm host cells. Their modes of action could differ, but all exhibited a high selectivity index (a ratio that measures the window between cytotoxicity and antimicrobial activity), which clearly qualifies them for further study in the development of new drugs.

Animal toxins with antichagasic properties have been manually annotated in UniProtKB/Swiss-Prot and are publicly available as of this release.

UniProtKB news

Removal of the cross-references to KO

Cross-references to KO (KEGG Orthology) have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2020_05

Published October 7, 2020

Headline

PCK1 vacillating between gluconeogenesis and lipogenesis

Lipid metabolism is a tightly regulated process, which relies on sterol regulatory element-binding (SREBPs) transcription factors for the activation of genes involved in the synthesis of fatty acids, triglycerides, and cholesterol. In resting conditions, when the levels of cholesterol are sufficient, SREBPs are present as precursors in the endoplasmic reticulum (ER) membrane in complex with a protein called SCAP. SCAP also interacts with a member of the ER resident INSIG protein family. The interaction with INSIG ensures that the SCAP-SREBP complex is retained in the ER. When cholesterol levels drop, INSIG and SCAP no longer bind and the SCAP-SREBP complex is transported to the Golgi apparatus, where SREBP is cleaved and its cytosolic N-terminal transcription factor domain is freed to translocate to the nucleus and activate transcription.

There are cells, however, that do not bother about regulation and just desperately need lipogenesis to proliferate. These are cancer cells. How do they achieve the synthesis of enough lipids in normal sterol levels, in conditions in which lipogenesis would normally be inhibited? The answer came from a study in hepatocellular carcinoma (HCC) cell lines. In these cells, activation with IGF1, a stimulus critical for HCC development, leads to a cascade of phosphorylation, which starts with AKT1. Activated AKT1 in turn phosphorylates PCK1, which is then translocated from the cytosol to the ER and phosphorylates INSIG. This impairs INSIG binding to the SCAP-SREBP complex, resulting in the release of the complex from the ER to the Golgi apparatus, with subsequent SERBP activation.

The discovery of PCK1 involvement in this process comes as a surprise. PCK1 is not known to be involved in lipogenesis regulation, nor to have a kinase activity. It is a rate-limiting enzyme of gluconeogenesis, a pathway that generates glucose from certain non-carbohydrate carbon substrates. In this pathway, it converts oxaloacetate into phosphoenolpyruvate. PCK1 phosphorylation by AKT1 uncovers its cryptic kinase activity and redirects it to a completely different process. Cancer cells favor glycolysis to provide energy and metabolic intermediates, and suppress gluconeogenesis. For these cells, PCK1 diversion is necessary for cell growth, as expression of a PCK1 nonphosphorylatable mutant inhibits cell proliferation, and is costless, as inhibition of gluconeogenesis is not deleterious. Although the results have been produced in HCC cells, this process may also play a role in other cancer types. Indeed, AKT1 activation has been shown in melanoma, glioblastoma and non-small cell lung carcinoma cells.

PCK1-induced activation of SREBP1 may also occur in normal hepatocytes. In mice that were refed with glucose after 24 hours of fasting, phosphorylation of AKT1, PCK1, and INSIG proteins, as well as cleavage of SREBP1, were markedly enhanced in normal liver. These observations suggest that in vivo blood glucose levels may regulate the PCK1-mediated phosphorylation of INSIG proteins and the activation of SREBP. This might point toward a potential mechanism underlying overnutrition-promoted nonalcoholic fatty liver diseases.

As of this release, proteins involved in this new regulatory pathway have been updated and are available in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to CPTC

Cross-references have been added to CPTC, the CPTAC Antibody Portal. This portal serves as a National Cancer Institute (NCI) community resource that provides access to a large number of standardized renewable affinity reagents (to cancer-associated targets) and accompanying characterization data.

CPTC is available at https://proteomics.cancer.gov/antibody-portal.

The format of the explicit links is:

Resource abbreviation CPTC
Resource identifier UniProtKB accession number
Optional information 1 Number of antibodies

Example: P31751

Show all entries having a cross-reference to CPTC.

Text format

Example: P31751

DR   CPTC; P31751; 6 antibodies.

XML format

Example: P31751

<dbReference type="CPTC" id="P31751">
  <property type="antibodies" value="6 antibodies"/>
</dbReference>

RDF format

Example: P31751

uniprot:P31751
  rdfs:seeAlso <http://purl.uniprot.org/cptc/P31751> .
<http://purl.uniprot.org/cptc/P31751>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/CPTC> ;
  rdfs:comment "6 antibodies" .

Cross-references to BMRB

Cross-references have been added to the BMRB database, the Biological Magnetic Resonance Data Bank. BMRB collects, annotates, archives, and disseminates spectral and quantitative data derived from NMR spectroscopic investigations of biological macromolecules and metabolites.

BMRB is available at https://bmrb.io/.

The format of the explicit links is:

Resource abbreviation BMRB
Resource identifier UniProtKB accession number

Example: A1D240

Show all entries having a cross-reference to BMRB.

Text format

Example: A1D240

DR   BMRB; A1D240; -.

XML format

Example: A1D240

<dbReference type="BMRB" id="A1D240"/>

RDF format

Example: A1D240

uniprot:A1D240
  rdfs:seeAlso <http://purl.uniprot.org/bmrb/A1D240> .
<http://purl.uniprot.org/bmrb/A1D240>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/BMRB> .

Cross-references to PCDDB

Cross-references have been added to the PCDDB database, the Protein Circular Dichroism Data Bank. PCDDB is a public repository that archives and freely distributes circular dichroism (CD) and synchrotron radiation CD spectral data and their associated experimental metadata.

PCDDB is available at https://pcddb.cryst.bbk.ac.uk/.

The format of the explicit links is:

Resource abbreviation PCDDB
Resource identifier UniProtKB accession number

Example: Q70Q12

Show all entries having a cross-reference to PCDDB.

Text format

Example: Q70Q12

DR   PCDDB; Q70Q12; -.

XML format

Example: Q70Q12

<dbReference type="PCDDB" id="Q70Q12"/>

RDF format

Example: Q70Q12

uniprot:Q70Q12
  rdfs:seeAlso <http://purl.uniprot.org/pcddb/Q70Q12> .
<http://purl.uniprot.org/pcddb/Q70Q12>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/PCDDB> .

Cross-references to SASBDB

Cross-references have been added to the SASBDB database, the Small Angle Scattering Biological Data Bank. SASBDB is a fully searchable curated repository of freely accessible and downloadable experimental scattering data, which are deposited together with the relevant experimental conditions, sample details, derived models and their fits to the data.

SASBDB is available at https://www.sasbdb.org/.

The format of the explicit links is:

Resource abbreviation SASBDB
Resource identifier UniProtKB accession number

Example: Q15326

Show all entries having a cross-reference to SASBDB.

Text format

Example: Q15326

DR   SASBDB; Q15326; -.

XML format

Example: Q15326

<dbReference type="SASBDB" id="Q15326"/>

RDF format

Example: Q15326

uniprot:Q15326
  rdfs:seeAlso <http://purl.uniprot.org/sasbdb/Q15326> .
<http://purl.uniprot.org/sasbdb/Q15326>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SASBDB> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Arthrogryposis, distal, 8
  • Ectodermal dysplasia, anhidrotic, with immunodeficiency, osteopetrosis and lymphedema
  • Immunodeficiency, NEMO-related, without anhidrotic ectodermal dysplasia
  • Recurrent isolated invasive pneumococcal disease 1
  • Recurrent isolated invasive pneumococcal disease 2

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Glycosylation' ('CARBOHYD' in the flat file):

  • O-alpha-linked (GlcNAc) threonine
  • O-linked (Glc) threonine

Changes to keywords

New keyword:

UniProt release 2020_04

Published August 12, 2020

Headline

Inflammation: the Good, the Bad and the Ugly

In a world filled with (microbial) foes, inflammation is our best friend, helping us fight against invading microorganisms, promoting wound healing and tissue regeneration. When it gets out of hand, however, it becomes our worst enemy. For instance, inflammation is required for mice to fight Influenza virus A (IAV) infections, but animals challenged with high doses of virus experience elevated levels of inflammation, destruction of airway epithelia and eventually die, unless they are deficient in MLKL protein, an endogenous protein that induces cell death in response to orthomyxovirus infection.

The mechanism leading to this fatal issue has been recently elucidated. When IAV infects cells and replicates, newly formed IAV RNA duplexes can adopt the Z-confirmation. These very peculiar RNAs are recognized by ZBP1 protein as a pathogen-associated molecular pattern (PAMP) and initiate a cascade of reactions. Activated ZBP1 recruits RIPK3, which in turn can induce one of two forms of programmed cell death, apoptosis and pro-inflammatory necroptosis. In the presence of RIPK1, apoptosis is favored. In the absence of RIPK1, RIPK3 phosphorylates MLKL, which then induces necroptosis. Programmed cell death is an effective mechanism of IAV clearance that not only eliminates infected cells to limit virus spread but also serves to catalyze adaptive immune responses. When cell death is unrestrained, or when the mode of cell death is primarily necrotic, then injury and severe illness ensue, despite virus clearance.

The dark side of inflammation also includes chronic inflammatory diseases and although progress has been made in this field, the causes that initiate pathogenic inflammatory responses still remain elusive. Could ZBP1 be a player in this deadly game? In other words, can ZBP1 be activated in the absence of any viral infections? An answer to these questions came from a recent publication by Jiao et al.. The authors showed that endogenous retroelements (EREs) produce double-stranded RNAs (dsRNAs) adopting a Z-confirmation. These dsRNAs are recognized by ZBP1 and activate it. EREs derive either from ancient retroviral infections or from active retrotransposons and constitute over half of the human and mouse genomes. Most dsRNAs identified by Jiao et al. originate from B2 and Alu short interspersed nuclear elements, followed by long interspersed nuclear elements (specifically, L1 elements) and long terminal repeat elements. Hence, these old mates of ours may still be able to mimic viral infections. Indeed in RIPK1 knockout mice, deregulation of ERE expression may trigger ZBP1-dependent skin inflammation. This mechanism may be relevant for the pathogenesis of inflammatory pathologies in humans, particularly in patients with variations in proteins that inhibit the ZBP1-MLKL-induced necroptotic pathway, such as RIPK1.

As of this release, ZBP1, RIPK3, and MLKL have been updated and are available in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to GlyGen

Cross-references have been added to the GlyGen Computational and Informatics Resource for Glycoscience. GlyGen retrieves information from multiple international data sources, integrates and harmonizes this data and makes it searchable.

GlyGen is available at https://www.glygen.org.

The format of the explicit links is:

Resource abbreviation GlyGen
Resource identifier UniProtKB accession number
Optional information 1 Glycosylation details

Example: P07766

Show all entries having a cross-reference to GlyGen.

Text format

Example: P14210

DR   GlyGen; P14210; 5 sites, 8 N-linked glycans (4 sites), 1 O-linked glycan (1 site).

XML format

Example: P14210

<dbReference type="GlyGen" id="P14210">
  <property type="glycosylation" value="5 sites, 8 N-linked glycans (4 sites), 1 O-linked glycan (1 site)"/>
</dbReference>

RDF format

Example: P14210

uniprot:P07766
  rdfs:seeAlso <http://purl.uniprot.org/glygen/P14210> .
<http://purl.uniprot.org/glygen/P14210>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/GlyGen> ;
  rdfs:comment "5 sites, 8 N-linked glycans (4 sites), 1 O-linked glycan (1 site)" .

Cross-references to PathwayCommons

Cross-references have been added to the PathwayCommons, a resource that aims to collect and disseminate biological pathway and interaction data. Data is collected from partner databases and is represented in the BioPAX standard. By representing data in BioPAX, Pathway Commons is able to provide a detailed representation of a variety of biological concepts including: Biochemical reactions; gene regulatory networks; and genetic interactions; transport and catalysis events; and physical interactions involving proteins, DNA, RNA and small molecules and complexes.

PathwayCommons is available at http://www.pathwaycommons.org.

The format of the explicit links is:

Resource abbreviation PathwayCommons
Resource identifier UniProtKB accession number

Example: P01042

Show all entries having a cross-reference to PathwayCommons.

Text format

Example: P01042

DR   PathwayCommons; P01042; -.

XML format

Example: P01042

<dbReference type="PathwayCommons" id="P01042"/>

RDF format

Example: P01042

uniprot:P01042
  rdfs:seeAlso <http://purl.uniprot.org/pathwaycommons/P01042> .
<http://purl.uniprot.org/pathwaycommons/P01042>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/PathwayCommons> .

Change to the cross-references to GlyConnect

We have introduced an additional field in the cross-references to the GlyConnect protein glycosylation database, a platform integrating sources of information to help characterize the molecular components of protein glycosylation. This allows us to provide some additional information, e.g. about the nature of the glycans and the number of glycosylation sites.

The format of the explicit links is:

Resource abbreviation GlyConnect
Resource identifier Resource identifier
Optional information 1 Glycosylation details

Text format

Example: P12763

Previous format:

DR   GlyConnect; 22; -.

New format:

DR   GlyConnect; 22; 49 N-Linked glycans (3 sites), 13 O-Linked glycans (6 sites).

XML format

Example: P12763

Previous format:

<dbReference type="GlyConnect" id="22"/>

New format:

<dbReference type="GlyConnect" id="22">
  <property type="glycosylation" value="49 N-Linked glycans (3 sites), 13 O-Linked glycans (6 sites)"/>
</dbReference>

This change does not affect the XSD, but may nevertheless require code changes.

RDF format

Example: P12763

Previous format

uniprot:P12763
  rdfs:seeAlso <http://purl.uniprot.org/glyconnect/22> .
<http://purl.uniprot.org/glyconnect/22>
  rdf:type <http://purl.uniprot.org/core/Resource> ;
  <http://purl.uniprot.org/core/database> <http://purl.uniprot.org/database/GlyConnect> .

New format:

uniprot:P12763
  rdfs:seeAlso <http://purl.uniprot.org/glyconnect/22> .
<http://purl.uniprot.org/glyconnect/22>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/GlyConnect> ;
  rdfs:comment "49 N-Linked glycans (3 sites), 13 O-Linked glycans (6 sites)" .

New automatic annotation system ARBA (Association-Rule-Based Annotator)

In the UniProt Automatic Annotation pipeline which enhances unreviewed (UniProtKB/TrEMBL) records by enriching them with automatic classification and annotation, the SAAS (Statistical Automatic Annotation System) component has been superseded by the more powerful ARBA (Association-Rule-Based Annotator). ARBA is a multiclass learning system trained on expertly annotated entries in UniProtKB/Swiss-Prot. ARBA uses rule mining techniques to generate concise annotation models with the highest representativeness and coverage for annotation, based on the properties of InterPro group membership and taxonomy. ARBA employs a data exclusion set that censors data not suitable for computational annotation (such as specific biophysical or chemical properties) and generates human-readable rules for each release. ARBA rules can be browsed and searched via the website.

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • Isoaspartyl glycine isopeptide (Gly-Asp)

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • 2-oxo-5,5-dimethylhexanoate
  • 3-methylisoleucine
  • 3-methylvaline
  • 3-methyl-D-valine
  • N4-methyl-D-asparagine
  • 3-hydroxyvaline (Thr)
  • (3S)-3-methylglutamine
  • 3-hydroxy-D-valine
  • (3R)-N4-methyl-3-hydroxy-D-asparagine
  • 3,3-dimethylmethionine
  • Diphosphoserine
  • Diphosphothreonine

Modified terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • Lactic acid -> D-lactate
  • 3-hydroxyvaline -> 3-hydroxyvaline (Val)

UniProt release 2020_03

Published June 17, 2020

Headline

Mitochondrial call for help

Life is a continuous chain of issues that have to be addressed for survival and our cells know all about that. Hypoxia, amino acid deprivation, glucose deprivation, viral infection, endoplasmic reticulum stress are but a few examples. That's why eukaryotic cells have developed an elaborate signaling pathway, called the integrated stress response (ISR), which is activated in the cytosol in response to a range of physiological changes and pathological conditions. Stress stimuli that activate ISR all converge on the phosphorylation of the alpha subunit of eukaryotic translation initiation factor 2 (EIF2A). Depending upon the stress stimulus, the reaction is catalyzed by one of the four following kinases: GCN2/EIF2AK4, PERK/EIF2AK3, PKR/EIF2AK2 and HRI/EIF2AK1. EIF2A phosphorylation leads to attenuation in 5' cap-dependent protein synthesis, while promoting the translation of selected mRNAs that harbor a short upstream open reading frame in their 5'-untranslated region, including those for transcription factors ATF4, ATF5 or DDIT3 (also known as CHOP). ISR is primarily a survival program, but exposure to severe stress can also lead to apoptosis.

Mitochondrial stress also strongly induces the expression of ATF4 and DDIT3, and hence triggers ISR, but the pathway signaling mitochondrial stress to the cytosol was elusive until the publication of two articles in March of this year. DDIT3 induction requires at least two mitochondrial proteins: OMA1 and DELE1, and the cytosolic kinase HRI. OMA1 is a metalloprotease located in the mitochondrial inner membrane. It is activated by mitochondrial dysfunction, possibly via membrane depolarization. OMA1 cleaves DELE1 in the intermembrane space and the N-terminally truncated DELE1 fragment enters the cytosol where it interacts with and activates HRI and hence stress response. The consequences of ISR during mitochondrial dysfunction are not yet fully understood and DELE1 may not be the only mitochondrial ISR activator, but one missing link between mitochondrial stress and ISR has been clearly established.

As of this release, the OMA1 protease and DELE1 have been updated and are available in UniProtKB/Swiss-Prot. To annotate DELE1, we took advantage of the new format announced last December, which allows us to describe product-specific features, be it that of an alternative splicing isoform, for example, or of a peptide resulting from proteolytic cleavage. EIF2AK1 activation is a function restricted to the DELE1 cleavage product, and this uniqueness is clearly reported in the 'FUNCTION' subsection dedicated to DELE1 short form.

UniProtKB news

Cross-references to IDEAL

Cross-references have been added to the IDEAL database, a database of Intrinsically Disordered proteins.

IDEAL is available at http://idp1.force.cs.is.nagoya-u.ac.jp/IDEAL/.

The format of the explicit links is:

Resource abbreviation IDEAL
Resource identifier Resource identifier

Example: O15162

Show all entries having a cross-reference to IDEAL.

Text format

Example: O15162

DR   IDEAL; IID00006; -.

XML format

Example: O15162

<dbReference type="IDEAL" id="IID00006"/>

RDF format

Example: O15162

uniprot:O15162
  rdfs:seeAlso <http://identifiers.org/ideal/IID00006> .
<http://identifiers.org/ideal/IID00006>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/IDEAL> .

Cross-references to BioGRID-ORCS

Cross-references have been added to the BioGRID-ORCS database, a database of CRISPR phenotype screens.

BioGRID-ORCS is available at https://orcs.thebiogrid.org.

The format of the explicit links is:

Resource abbreviation BioGRID-ORCS
Resource identifier Resource identifier
Optional information 1 Number of hits

Example: Q96A29

Show all entries having a cross-reference to BioGRID-ORCS.

Text format

Example: Q96A29

DR   BioGRID-ORCS; 55343; 19 Hits in 787 CRISPR Screens.

XML format

Example: Q96A29

<dbReference type="BioGRID-ORCS" id="55343">
  <property type="hits" value="19 Hits in 787 CRISPR Screens"/>
</dbReference>

RDF format

Example: Q96A29

uniprot:Q96A29
  rdfs:seeAlso <http://purl.uniprot.org/biogrid-orcs/55343> .
<http://purl.uniprot.org/biogrid-orcs/55343>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/BioGRID-ORCS> ;
  rdfs:comment "19 Hits in 787 CRISPR Screens" .

Change to the cross-references to ABCD

We have introduced an additional field in the cross-references to the ABCD (AntiBodies Chemically Defined) database, a manually curated depository of sequenced antibodies. This allows us to specify the number of sequenced antibodies available for a given protein in UniProtKB.

Text format

Example: O75084

Previous format:

DR   ABCD; O75084; -.

New format:

DR   ABCD; O75084; 10 sequenced antibodies.

XML format

Example: O75084

Previous format:

<dbReference type="ABCD" id="O75084"/>

New format:

<dbReference type="ABCD" id="O75084"/>
  <property type="antibodies" value="10 sequenced antibodies"/>
</dbReference>

This change does not affect the XSD, but may nevertheless require code changes.

RDF format

Example: O75084

Previous format:

uniprot:O75084
  rdfs:seeAlso <http://purl.uniprot.org/abcd/O75084> .
<http://purl.uniprot.org/abcd/O75084>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/ABCD> .

New format:

uniprot:O75084
  rdfs:seeAlso <http://purl.uniprot.org/abcd/O75084> .
<http://purl.uniprot.org/abcd/O75084>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/ABCD> ;
  rdfs:comment "10 sequenced antibodies" .

Change of the cross-references to MycoCLAP

The MycoCLAP resource has changed its name to CLAE and we have updated our cross-references to reflect this name change.

Cross-references to BioGRID

The BioGrid database was renamed BioGRID. We changed the database name in the relevant cross-references (DR lines in the flat file) accordingly.

Example:

DR   BioGRID; 198188; 12.

Cross-references to Unimod in the ptmlist.txt document file

The ptmlist.txt document, which is available by FTP and on the website, describes post-translational modifications (PTMs) annotated in the UniProt knowledgebase. This release sees the addition of optional cross-references from ptmlist.txt to Unimod, an open access database of protein modifications for use in mass spectrometry applications which provides molecular-level details of PTMs (both natural and non-natural) including molecular formula, target residues, monoisotopic and average mass shifts and literature references with a community-driven curation.

Example:

ID   (3R)-3-hydroxyasparagine
AC   PTM-0369
FT   MOD_RES
..
KW   Hydroxylation.
..
DR   Unimod; 35.

This new mapping to Unimod will facilitate the integration of data on PTMs identified by mass spectrometry based proteomics.

We have currently mapped 236 of the most common PTMs in UniProtKB to Unimod and will continue to add new cross-references to Unimod in forthcoming releases. This mapping of PTMs to Unimod is part of our ongoing work on the standardization of knowledge of PTMs in UniProtKB by providing cross-references to a resource which is widely used in proteomics bioinformatics. We welcome your feedback on these current and future developments.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Lipidation' ('LIPID' in the flat file):

  • 3'-prenyl-2',N2-cyclotryptophan

UniProt release 2020_02

Published April 22, 2020

Headline

Genome integrity maintenance by HMCES

Apurinic or apyrimidinic sites, also known as abasic or AP sites, are one of the most common DNA lesions. They occur at a frequency of about 15,000 per day in human cells. In double-stranded DNA, the majority of AP sites are removed by base excision repair. After removal of the lesion, the undamaged strand is used as a template for repair synthesis. AP sites also form in single-stranded DNA (ssDNA), but until recently there was no known mechanism involved in their repair in this context. A major breakthrough in the field was reported last year in Cell.

Mohni et al. were interested in HMCES. HMCES full name is 'stem cell-specific 5-hydroxymethylcytosine-binding protein'. It was originally thought to be a regulator of 5-hydroxymethylcytosine. However, it had also been identified in the replisome, a large protein machine that carries out DNA replication. HMCES is conserved in almost all organisms, even in those that do not utilize methylcytosine for epigenetic control. Taken together, these observations suggested that HMCES could bear another crucial function, possibly in replication. Surprisingly HMCES knockout in cells did not affect DNA replication, nor cell division, but rather exacerbated cell sensitivity toward several DNA-damaging agents. Knockout cells accumulated DNA damage and exhibited increased genetic instability. Different DNA-damaging agents were tested and the only common kind of lesion they induced was the formation of AP sites.

HMCES appears to act as the initiating step of a replication-coupled repair mechanism for abasic sites in ssDNA. In eukaryotic cells, HMCES interacts with proliferating cell nuclear antigen (PCNA), an essential factor for replication, and travels with replication forks. When it senses AP sites in ssDNA, it covalently crosslinks to ssDNA AP sites generating a DNA-protein intermediate. The nature of this crosslink has been identified by crystallographic studies as a stable thiazolidine DNA-protein linkage formed between the N-terminal cysteine and the aldehyde form of the AP deoxyribose. The crosslink is so stable that its resolution requires HMCES degradation via the proteasome. This sequence of events may appear counterintuitive. It is almost as if HMCES takes a bad situation and makes it worse. However, this crosslink effectively shields the lesion from endonucleases and error-prone trans-lesion bypass (TLS) polymerases, such as REV1 and REV3L, and prevents mutagenesis they might engender. The DNA repair mechanism acting downstream of HMCES is not known.

As of this release, human HMCES, as well as YedK, an Escherichia coli homolog have been updated and are available in UniProtKB/Swiss-Prot. The exact structure of the chemical crosslink was submitted to ChEBI where more details are provided.

UniProtKB news

Change of annotation topic 'Interaction'

The annotation topic 'Interaction' provides information about binary protein-protein interactions. This data is curated in the IntAct database and a quality-filtered subset is imported into UniProtKB at each release.

In the context of improving the functional annotation of different gene products in UniProtKB/Swiss-Prot, we have started to import more detailed data from IntAct. Our previous representation of a binary protein-protein interaction provided details only for the protein that was described in another entry. This left ambiguity in UniProtKB/Swiss-Prot entries that describe more than one protein (isoforms or/and products of proteolytic cleavage). To address this we now describe both interacting proteins by unique UniProtKB identifiers.

This change affects the three main UniProtKB distribution formats (text, XML, RDF). The details are described for each format in a separate section below. The following placeholders are used in the format descriptions:

  • <Interactant> represents a UniProtKB protein.
    • <Accession> is a UniProtKB accession number.
    • <IsoId> is a UniProtKB isoform ID.
    • <ProductId> is a UniProtKB product ID.
    • <Gene> is either the gene name, ordered locus name or ORF name of the gene that encodes the UniProtKB protein (see Gene names).
  • <Experiments> is the number of experiments in IntAct that support an interaction.
  • <IntActId> is an IntAct protein ID.

Note: The format descriptions make use of POSIX ERE syntax.

Text format

Previous format:

CC   -!- INTERACTION:
CC       <Interactant>( \(xeno\))?; NbExp=<Experiments>; IntAct=<IntActId>, <IntActId>;
CC       <Interactant>( \(xeno\))?; NbExp=<Experiments>; IntAct=<IntActId>, <IntActId>;
CC       ...

The <Interactant> was described in the following way:

Self|(<Accession>|<IsoId>):(<Gene>|-)

Where Self represents a self-interaction and a dash is shown for proteins with an undefined <Gene>. xeno is an optional flag that indicates that the interacting proteins are derived from different species. This may be due to the experimental set-up or may reflect a pathogen-host interaction.

New format:

CC   -!- INTERACTION:
CC       <Interactant>; <Interactant>;( Xeno;)? NbExp=<Experiments>; IntAct=<IntActId>, <IntActId>;
CC       <Interactant>; <Interactant>;( Xeno;)? NbExp=<Experiments>; IntAct=<IntActId>, <IntActId>;
CC       ...

Where

  • the first <Interactant> is represented by:
    (<Accession>|<IsoId>|<ProductId>)
    
  • the second <Interactant> is represented by:
    (<Accession>|<IsoId>|<ProductId> [<Accession>])(: <Gene>)?
    

Example: P11309

Binary interactions with different isoforms that are described in P11309.

Previous format:

CC   -!- INTERACTION:
CC       Q9BZS1-1:FOXP3; NbExp=3; IntAct=EBI-1018629, EBI-9695448;
CC       Q9UNQ0:ABCG2; NbExp=5; IntAct=EBI-1018633, EBI-1569435;

New format:

CC   -!- INTERACTION:
CC       P11309-1; Q9BZS1-1: FOXP3; NbExp=3; IntAct=EBI-1018629, EBI-9695448;
CC       P11309-2; Q9UNQ0: ABCG2; NbExp=5; IntAct=EBI-1018633, EBI-1569435;

Example: P27958 and Q9NPY3

Binary interaction with a product of proteolytic cleavage. Interactions involving products of proteolytic cleavage were previously not imported from IntAct, therefore only the new data/format is shown.

New data and format of P27958:

CC   -!- INTERACTION:
CC       PRO_0000037566; Q9NPY3: CD93; Xeno; NbExp=2; IntAct=EBI-6377335, EBI-1755002;

New data and format of Q9NPY3:

CC   -!- INTERACTION:
CC       Q9NPY3; PRO_0000037566 [P27958]; Xeno; NbExp=2; IntAct=EBI-1755002, EBI-6377335;

XML format

The UniProtKB XSD represents a binary interaction with:

  • two interactant elements of interactantType
  • a boolean organismsDiffer element that indicates that the interacting proteins are derived from different species. This may be due to the experimental set-up or may reflect a pathogen-host interaction.
  • an experiments element that gives the number of experiments in IntAct that support an interaction.

The interactantType uses an interactantGroup to represent a sequence of:

  • an id element
  • an optional label element

We have added an optional dbReference element to the interactantGroup to allow us to represent the UniProtKB <Accession> for a <ProductId>:

<xs:group name="interactantGroup">
        <xs:sequence>
            <xs:element name="id" type="xs:string"/>
            <xs:element name="label" type="xs:string" minOccurs="0"/>
            <xs:element name="dbReference" type="dbReferenceType" minOccurs="0"/>
        </xs:sequence>
    </xs:group>

Previous format:

<comment type="interaction">
  <interactant intactId="<IntActId>"/>
  <interactant intactId="<IntActId>">
    <id><Accession>|<IsoId></id>
    <label><Gene></label>
  </interactant>
  <organismsDiffer>true|false</organismsDiffer>
  <experiments><Experiments></experiments>
</comment>

New format:

<comment type="interaction">
  <interactant intactId="<IntActId>">
    <id><Accession>|<IsoId>|<ProductId></id>
  </interactant>
  <interactant intactId="<IntActId>">
    <id><Accession>|<IsoId>|<ProductId></id>
    <label><Gene></label>
    <!-- If <id> is a <ProductId>: -->
    <dbReference type="UniProtKB" id="<Accession>"/>
  </interactant>
  <organismsDiffer>true|false</organismsDiffer>
  <experiments><Experiments></experiments>
</comment>

Example: P11309

Binary interactions with different isoforms that are described in P11309.

Previous format:

<comment type="interaction">
  <interactant intactId="EBI-1018629"/>
  <interactant intactId="EBI-9695448">
    <id>Q9BZS1-1</id>
    <label>FOXP3</label>
  </interactant>
  <organismsDiffer>false</organismsDiffer>
  <experiments>3</experiments>
</comment>
<comment type="interaction">
  <interactant intactId="EBI-1018633"/>
  <interactant intactId="EBI-1569435">
    <id>Q9UNQ0</id>
    <label>ABCG2</label>
  </interactant>
  <organismsDiffer>false</organismsDiffer>
  <experiments>5</experiments>
</comment>

New format:

<comment type="interaction">
  <interactant intactId="EBI-1018629">
    <id>P11309-1</id>
  </interactant>
  <interactant intactId="EBI-9695448">
    <id>Q9BZS1-1</id>
    <label>FOXP3</label>
  </interactant>
  <organismsDiffer>false</organismsDiffer>
  <experiments>3</experiments>
</comment>
<comment type="interaction">
  <interactant intactId="EBI-1018633">
    <id>P11309-2</id>
  </interactant>
  <interactant intactId="EBI-1569435">
    <id>Q9UNQ0</id>
    <label>ABCG2</label>
  </interactant>
  <organismsDiffer>false</organismsDiffer>
  <experiments>5</experiments>
</comment>

Example: P27958 and Q9NPY3

Binary interaction with a product of proteolytic cleavage. Interactions involving products of proteolytic cleavage had previously not been imported from IntAct, therefore only the new data/format is shown.

New data and format of P27958:

<comment type="interaction">
  <interactant intactId="EBI-6377335">
    <id>PRO_0000037566</id>
  </interactant>
  <interactant intactId="EBI-1755002">
    <id>Q9NPY3</id>
    <label>CD93</label>
  </interactant>
  <organismsDiffer>true</organismsDiffer>
  <experiments>2</experiments>
</comment>

New data and format of Q9NPY3:

<comment type="interaction">
  <interactant intactId="EBI-1755002">
    <id>Q9NPY3</id>
  </interactant>
  <interactant intactId="EBI-6377335">
    <id>PRO_0000037566</id>
    <dbReference type="UniProtKB" id="P27958"/>
  </interactant>
  <organismsDiffer>true</organismsDiffer>
  <experiments>2</experiments>
</comment>

RDF format

The UniProt RDF schema ontology represents a binary interaction with an interaction property whose rdfs:range is the Interaction class. This class is the domain of the following properties that describe the interaction:

  • xeno is a boolean that indicates that the interacting proteins are derived from different species. This may be due to the experimental set-up or may reflect a pathogen-host interaction.
  • experiments gives the number of experiments in IntAct that support an interaction.

A Participant is identified by its unique IntAct identifier. It also refers to the corresponding UniProtKB protein which is represented as described in the news article about the functional annotation of different gene products in UniProtKB/Swiss-Prot. An optional rdfs:label property may provide the gene name, ordered locus name or ORF name of the gene that encodes the UniProtKB protein.

The RDF schema ontology required no changes to represent the more detailed data that we now import from IntAct. Due to the symmetry of binary interactions, the UniProt SPARQL server already provided access to the full details about both interacting proteins. We have however taken this opportunity to normalize the URI of a binary interaction so that the two UniProtKB entries that describe the interacting proteins refer to the interaction with the same URI:

Previous format:

<<Accession>#interaction-<IntActId>-<IntActId>> .

New format:

<http://purl.uniprot.org/intact/<IntActId>-<IntActId>> .

Example: P11309 and Q8N9N5

Previous format:

P11309:

<P11309#interaction-696621-744695>

Q8N9N5:

<Q8N9N5#interaction-744695-696621>

New format:

P11309 and Q8N9N5:

<http://purl.uniprot.org/intact/EBI-696621-EBI-744695>

Cross-references to Antibodypedia

Cross-references have been added to Antibodypedia, a portal providing access to publicly available research antibodies towards human protein targets from many different providers.

Antibodypedia is available at https://www.antibodypedia.com/.

The format of the explicit links is:

Resource abbreviation Antibodypedia
Resource identifier Resource identifier
Optional information 1 Number of antibodies

Example: P04626

Show all entries having a cross-reference to Antibodypedia.

Text format

Example: P04626

DR   Antibodypedia; 740; 5394 antibodies.

XML format

Example: P04626

<dbReference type="Antibodypedia" id="740">
   <property type="antibodies" value="5394 antibodies"/>
</dbReference>

RDF format

Example: P04626

uniprot:P04626
  rdfs:seeAlso <http://purl.uniprot.org/antibodypedia/740> .

<http://purl.uniprot.org/antibodypedia/740>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Antibodypedia> ;
  rdfs:comment "5394 antibodies" .

Cross-references to MetOSite

Cross-references have been added to MetOSite, a database of methionine sulfoxide sites. Each collected site has been classified according to the effect of its sulfoxidation on the biological properties of the modified protein. Thus, MetOSite documents cases where the sulfoxidation of methionine leads to gain or loss of activity, increased or decreased protein-protein interaction susceptibility, and to changes in protein stability or in subcellular location.

MetOSite is available at https://metosite.uma.es/.

The format of the explicit links is:

Resource abbreviation MetOSite
Resource identifier UniProtKB accession number

Example: P10987

Show all entries having a cross-reference to MetOSite.

Text format

Example: P10987

DR   MetOSite; P10987; -.

XML format

Example: P10987

<dbReference type="MetOSite" id="P10987"/>

RDF format

Example: P10987

uniprot:P10987
  rdfs:seeAlso <http://purl.uniprot.org/metosite/P10987> .
<http://purl.uniprot.org/metosite/P10987>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/MetOSite> .

Cross-references to PHI-base

Cross-references have been added to PHI-base, a database providing expertly curated molecular and biological information on genes proven to affect the outcome of pathogen-host interactions.

PHI-base is available at http://www.phi-base.org/.

The format of the explicit links is:

Resource abbreviation PHI-base
Resource identifier Resource identifier

Example: Q00310

Show all entries having a cross-reference to PHI-base.

Text format

Example: Q00310

DR   PHI-base; PHI:104; -.

XML format

Example: Q00310

<dbReference type="PHI-base" id="PHI:104"/>

RDF format

Example: Q00310

uniprot:Q00310
  rdfs:seeAlso <http://purl.uniprot.org/phi-base/PHI:104> .
<http://purl.uniprot.org/phi-base/PHI:104>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/PHI-base> .

Change to the cross-references to Human Protein Atlas (HPA)

We have changed the way we present the Human Protein Atlas database cross-references. Links between UniProtKB entries and HPA used to be established by HPA antibody identifier, but are now based on Ensembl Gene identifiers.

We have also introduced an additional field in these cross-references to indicate the level of RNA tissue specificity. The RNA specificity category is based on mRNA expression levels in the analyzed samples. The categories include: 'Tissue enriched', 'Group enriched', 'Tissue enhanced', 'Low tissue specificity' and 'Not detected'. For more details on these categories, see the Classification of transcriptomics data by Human Protein Atlas.

Text format

Example: Q9NSG2

Previous format:

DR   HPA; HPA023778; -.
DR   HPA; HPA024451; -.

New format:

DR   HPA; ENSG00000000460; Tissue enhanced (lymphoid).

XML format

Example: Q9NSG2

Previous format:

<dbReference type="HPA" id="HPA023778"/>
<dbReference type="HPA" id="HPA024451"/>

New format:

<dbReference type="HPA" id="ENSG00000000460">
  <property type="expression patterns" value="Tissue enhanced (lymphoid)"/>
</dbReference>

This change does not affect the XSD, but may nevertheless require code changes.

RDF format

Example: Q9NSG2

Previous format:

uniprot:Q9NSG2
  rdfs:seeAlso <http://purl.uniprot.org/hpa/HPA023778> ,
               <http://purl.uniprot.org/hpa/HPA024451> .
<http://purl.uniprot.org/hpa/HPA023778>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/HPA> .
<http://purl.uniprot.org/hpa/HPA024451>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/HPA> .

New format:

uniprot:Q9NSG2
  rdfs:seeAlso <http://www.proteinatlas.org/ENSG00000000460> .
<http://www.proteinatlas.org/ENSG00000000460>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/HPA> ;
  rdfs:comment "Tissue enhanced (lymphoid)" .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • Thiazolidine linkage to a ring-opened DNA abasic site
  • Deoxyhypusine

RDF news

Change of URIs for the Human Protein Atlas (HPA) database

For historic reasons, UniProt had to generate URIs to cross-reference databases that did not have an RDF representation. Our policy is to replace these by the URIs generated by the cross-referenced database once it starts to distribute an RDF representation of its data.

The URIs for the Human Protein Atlas database have therefore been updated from:

http://purl.uniprot.org/hpa/<ID>

to:

http://www.proteinatlas.org/<ID>

If required for backward compatibility, you will be able to use the following query to add the old URIs:

PREFIX owl:<http://www.w3.org/2002/07/owl#>
PREFIX up:<http://purl.uniprot.org/core/>
INSERT
{
   ?protein rdfs:seeAlso ?old .
   ?old owl:sameAs ?new .
   ?old up:database <http://purl.uniprot.org/database/HPA> .
}
WHERE
{
   ?protein rdfs:seeAlso ?new .
   ?new up:database <http://purl.uniprot.org/database/HPA> .
   BIND(iri(concat('http://purl.uniprot.org/hpa/', substr(str(?new),29))) AS ?old)
}

The dereferencing of existing http://purl.uniprot.org/hpa/<ID> URIs will be maintained.

Standardized MD5 checksums in UniProt RDF

The UniProt databases UniProtKB, UniRef and UniParc have historically provided a CRC-64 checksum for the amino acid sequences. In the UniParc RDF representation we had already introduced an MD5 checksum, and we have now replaced it with a SPARQL 1.1 compliant MD5 representation (lowercase string) and use this across all databases. This allows to use the MD5 function defined in SPARQL 1.1 to check that the sequence string is not corrupted, without the need to use the lowercase (LCASE) function and a cast to string, as it was formerly the case:

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX up:<http://purl.uniprot.org/core/>
SELECT ?computedMD5 ((?uniprotMD5 = ?computedMD5) AS ?md5SumsMatch)
WHERE
{
  ?protein a up:Protein ;
    up:sequence ?sequence .
  ?sequence rdf:value ?value ;
    up:md5Checksum ?uniprotMD5 .
  BIND(MD5(?value) AS ?computedMD5)
}

UniProt release 2020_01

Published February 26, 2020

Headline

Coronavirus SARS-CoV-2 in UniProtKB

At the end of 2019, a novel coronavirus (nCoV) of animal origin started infecting humans, initiating a severe outbreak in China. nCoV infection can result in severe and even fatal respiratory diseases, such as acute respiratory distress syndrome. The virus is highly contagious and transmission occurs via airborne droplets and contact. On January 30th, 2019-nCoV was designated a global health emergency by the WHO. On February 11th, the WHO called the disease caused by the virus COVID-19, and the virus itself was named Severe Acute Respiratory Syndrome-related coronavirus 2 or SARS-CoV-2 by the International Committee on Taxonomy of Viruses (ICTV).

SARS-CoV-2 belongs to the large family of Coronaviridae, genus Betacoronavirus. This genus comprises mainly vertebrate respiratory viruses, including HCoV-OC43, which is responsible for 10% of common colds, and SARS, which caused an epidemic in 2003, resulting in over 8,000 infected individuals in 26 countries. The novel coronavirus genome has been sequenced. Its close similarity to SARS suggests it has emerged from the same reservoir, namely bats.

With a size of 30 kb, coronaviruses have the largest RNA genomes known to date. The genome encodes a polyprotein 1a that can be elongated by ribosomal frameshifting to produce polyprotein 1ab. The short and elongated polyproteins contain 11 and 15 chains, respectively, and are dedicated to viral RNA transcription and replication, while controlling the host antiviral defense. A strategy used by the virus to escape host cell innate immunity is to induce the formation of a specialized intracellular compartment from the endoplasmic reticulum, called endoplasmic spherules, which protects viral dsRNA replication intermediates. Later on subgenomic mRNAs are translated to produce virion structural proteins and yet another set of immune modulatory factors. Virions are assembled at the ER-Golgi intermediate compartment (or ERGIC) and exported out of the cell. The freshly exported virion is not yet infectious. Its surface is covered by spikes, giving the impression a crown (corona in Latin, hence its name), but spike proteins have to be cleaved in order to become functional and to confer infectivity on the virion. The activating proteolytic cleavages occur in the extracellular space.

It is at the level of spike proteins that SARS-CoV-2 diverges from SARS, differing in both amino acid sequence and glycosylation. The SARS-CoV-2 spike protein cleavage site comprises several arginines, making it an excellent substrate for many host proteases. This feature is predicted to enhance virus tropism and virulence. SARS-CoV-2 interacts with the same host receptor as SARS, ACE2, which presumably explains why both viruses infect lungs, as well as the small intestine and kidney. The functions of several other SARS-CoV-2 proteins are still unclear and need further investigations. Among them is SARS-CoV-2 NS8 protein, which shares sequence similarity with some Bat-hosted coronavirus NS8 proteins, but is entirely different from SARS NS8a or NS8b. Thus, in spite of many similarities to SARS and other coronaviruses, SARS-CoV-2 displays unique molecular features that lead to unpredictable behavior during infection.

SARS-CoV-2 protein sequences from the current public health emergency have been annotated in UniProtKB and made available as a pre-release dataset on the UniProt FTP site. These entries will be available in the usual file formats as part of release 2020_02.

UniProt release news

Change of release cycle

Starting with release 2020_01 of February 26th, UniProt releases are published every 8 weeks. Release 2020_02 is scheduled for April 22nd, 2020.

See also: How frequently is UniProt released? What is the synchronization delay with other databases?

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Deleted disease

  • Popov-Chang syndrome

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • N6-lactoyllysine

Modified term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • 6-(S-cysteinyl)-8alpha-(pros-histidyl)-FAD (Cys-His) -> 6-(S-cysteinyl)-8alpha-(pros-histidyl)-FAD (His-Cys)

UniProt release 2019_11

Published December 18, 2019

Headline

Thicker than water

We know about blood types and their incompatibility; transfusing someone who is O- with AB+ blood can be lethal. The ABO alleles present on chromosome 9 decide our blood type. The A and B antigens are a set of red blood cell surface carbohydrates ending in α-1,3-linked N-acetylgalactosamine and α-1,3-linked galactose respectively, while type O blood has neither of these cell surface sugars. Sequence variations in the ABO gene determine if the encoded protein has α-1,3-N-acetylgalactosaminyltransferase activity and makes type A blood, or if it has α-1,3-galactosyltransferase activity and makes type B blood. When both alleles are present, we make type AB blood. Deletion of a single G nucleotide in the ABO gene leads to a truncated inactive product and type O blood, which has the non-modified H antigen.

To improve the usability of blood, people have tried for years to find a way to enzymatically convert A or B blood to type O; it seems an obvious way to increase the supply of universal donor (which would still require Rhesus matching). While such enzymes have been found, they are not yet ideal, as they either work at high concentration or have very specific buffer requirements, not met by blood.

By screening human fecal metagenomic libraries, Rahfeld et al. have isolated a pair of enzymes from the obligate gut anaerobe Flavonifractor plautii that efficiently converts the A to H antigen (type O). The first enzyme (A type blood N-acetyl-alpha-D-galactosamine deacetylase, ADAC) deacylates all A antigen subtypes tested (and there are many), while the second enzyme (A type blood alpha-D-galactosamine galactosaminidase, AGAL) removes the residual galactosamine moiety. This reaction can occur on red blood cells and in blood, as opposed to a buffer system, and at low enzyme concentration, and thus shows promise for uses in blood production. Further testing is underway, and we still need a way to remove the B antigen, but this could well help increase the flexibility of our blood supply. It still won’t solve the world shortage of blood, only more donors can do that...

As of this release, ADAC and AGAL have been annotated and are available in UniProtKB/Swiss-Prot.

UniProtKB news

Change of FT and CC sections in UniProtKB text format

We have changed the format of the FT and CC section of the UniProtKB text files. The changes of the FT section likely affects all parsers, and software will have to be adapted accordingly. The changes of the CC section are smaller, but may also require code adaptations depending on the CC annotation types that you parse.

The motivation for this change is described in the section "Functional annotation of different gene products in UniProtKB/Swiss-Prot" below, where you can also find the technical details and examples under the heading Text format.

Change of line length in UniProtKB text format

Historically, the lines of the UniProtKB text format have been wrapped at 75 characters for technical reasons (terminal screen size and data processing capabilities). When these technical restrictions vanished, we introduced exceptions for data like URLs, protein names and cross-references where line wrapping does not improve readability. These lines can be up to 255 characters long, but most lines are still wrapped at 75 characters for readability. We have now increased the maximum number of characters for wrapped lines to 80 in the context of the format change of the FT section of the UniProtKB text format for the functional annotation of different gene products in UniProtKB/Swiss-Prot described below.

Functional annotation of different gene products in UniProtKB/Swiss-Prot

To reduce database redundancy, the UniProtKB/Swiss-Prot policy is to describe, whenever possible, all protein products that are encoded by one gene in a given species in a single entry. This includes isoforms generated by alternative promoter usage, alternative splicing, alternative initiation and ribosomal frameshifting. We assign a name and a unique identifier to each isoform and choose one of them to be the canonical sequence that is shown in the UniProtKB text and XML format (the RDF format shows all sequences). All positional annotations in the entry referred to this canonical sequence until this release. Some gene products are precursors that are processed by proteolytic cleavage to generate the biologically active product(s). These products are described by their location on the sequence, a name and a unique identifier.

When isoforms, or products of proteolytic cleavage, are known to differ in their function or other characteristics, we generally describe this in the text of the respective annotations. To make this information also accessible to software applications, we adapted the UniProtKB text format to describe the product to which an annotation applies in a computer-processable way. The schemas of the XML and RDF format already supported this and required no changes. The following sections describe the changes for the text format and how the data is represented in the XML and RDF format.

Text format

Isoforms are described in ALTERNATIVE PRODUCTS annotations in the CC section. The products of proteolytic cleavage are described in PEPTIDE and CHAIN annotations in the FT section. All three annotation types provide a name (<ProductName>) and a unique ID (<ProductId>) for the product that they describe:

  • ALTERNATIVE PRODUCTS annotations show the name of an isoform in the Name field and its ID in the IsoId field.
    CC   -!- ALTERNATIVE PRODUCTS:
    ...
    CC       Name=<ProductName>;
    CC         IsoId=<ProductId>; Sequence=Displayed;
    
  • PEPTIDE and CHAIN annotations showed the name of a proteolytic cleavage product in the <Description> field and its ID in the FTId field in the previous text format:
    FT   CHAIN       <B>    <E>       <ProductName>.
    FT                                /FTId=<ProductId>.
    
    In the new text format that is described in more details in the FT section they are shown in the /note= and /id= qualifiers, respectively:
    FT   CHAIN           <B>..<E>
    FT                   /note="<ProductName>"
    FT                   /id="<ProductId>"
    

Example: O60443

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=3;
CC       Name=1; Synonyms=Long;
CC         IsoId=O60443-1; Sequence=Displayed;
CC       Name=2; Synonyms=Short;
CC         IsoId=O60443-2; Sequence=VSP_004190;
CC         Note=No experimental confirmation available.;
CC       Name=3;
CC         IsoId=O60443-3; Sequence=VSP_044276;
...
FT   CHAIN         1    496       Gasdermin-E.
FT                                /FTId=PRO_0000148178.
FT   CHAIN         1    270       Gasdermin-E, N-terminal.
FT                                {ECO:0000269|PubMed:27281216,
FT                                ECO:0000305|PubMed:28459430}.
FT                                /FTId=PRO_0000442786.
FT   CHAIN       271    496       Gasdermin-E, C-terminal.
FT                                {ECO:0000305|PubMed:28459430}.
FT                                /FTId=PRO_0000442787.
CC section

The annotation types in the CC section describe a product by its name (isoform names are prefixed with the term "Isoform"). In the format descriptions below this name is represented by <ProductName>. Different products are described in separate annotations (see FUNCTION and BIOPHYSICOCHEMICAL PROPERTIES examples).

All annotation types of the CC section start with:

CC   -!- <TYPE>:

Where <TYPE> is a value from the controlled vocabulary of annotation types.

In some annotation types the content of the annotation used to directly follow the <TYPE>, and lines were wrapped at 75 chars:

CC   -!- <TYPE>: <Content>

In the new format a <ProductName> may be added between the <TYPE> and the <Content> and lines are wrapped at 80 chars (see Change of line length in UniProtKB text format ):

CC   -!- <TYPE>: [<ProductName>]: <Content>

The <ProductName> is surrounded by square brackets and separated by a colon from the <Content> to make it possible to parse it with a POSIX ERE like this one:

^CC   -!- ([^:]+):(?: \[(.+?)\]:)? (.+)

Where $1=<TYPE>, $2=<ProductName>, $3=<Content>.

In annotation types where the content is structured as a list of different fields that are formatted according to custom rules for better readability, the annotation content starts on a new line:

CC   -!- <TYPE>:
CC       <Content>

In the new format a <ProductName> may be added after the <TYPE> and this line is not wrapped (i.e. it may in rare cases exceed 80 chars).

CC   -!- <TYPE>: [<ProductName>]:
CC       <Content>

The format of the <Content> remains unchanged.

A <ProductName> cannot be added to ALTERNATIVE PRODUCTS and INTERACTION annotations. The INTERACTION format will be adapted in a different way to describe binary interactions that involve isoforms and/or products of proteolytic cleavage (see Change of annotation topic 'Interaction' ).

Please note that the previous text format of SUBCELLULAR LOCATION, COFACTOR and MASS SPECTROMETRY annotations already allowed to specify a product name/ID, but we have adapted it to be consistent with all other annotation types.

Representative examples for different annotation types are shown here:

FUNCTION

Example: Q96F85

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=2;
CC       Name=1; Synonyms=CRIP1a;
CC         IsoId=Q96F85-1; Sequence=Displayed;
CC       Name=2; Synonyms=CRIP1b;
CC         IsoId=Q96F85-2; Sequence=VSP_035598;

Previous format:

CC   -!- FUNCTION: Isoform 1 suppresses cannabinoid receptor CNR1-mediated
CC       tonic inhibition of voltage-gated calcium channels. Isoform 2 does
CC       not have this effect. {ECO:0000269|PubMed:17895407}.

New format:

CC   -!- FUNCTION: [Isoform 1]: Suppresses cannabinoid receptor CNR1-mediated
CC       tonic inhibition of voltage-gated calcium channels.
CC       {ECO:0000269|PubMed:17895407}.
CC   -!- FUNCTION: [Isoform 2]: Does not suppress cannabinoid receptor CNR1-
CC       mediated tonic inhibition of voltage-gated calcium channels.
CC       {ECO:0000269|PubMed:17895407}.

DISEASE

Example: P35555

FT   CHAIN      2732   2871       Asprosin. {ECO:0000305|PubMed:27087445,
FT                                ECO:0000305|PubMed:9817919}.
FT                                /FTId=PRO_0000436882.

Previous format:

CC   -!- DISEASE: Marfan lipodystrophy syndrome (MFLS) [MIM:616914]: A
CC       syndrome characterized by congenital ...
CC       Note=The disease is caused by mutations affecting the gene
CC       represented in this entry. Asprosin: Mutations specifically affect
CC       Asprosin, a hormone peptide present at the C-terminus of
CC       Fibrillin-1 chain, which is cleaved from Fibrillin-1 following
CC       secretion (PubMed:27087445). {ECO:0000269|PubMed:27087445}.

New format:

CC   -!- DISEASE: [Asprosin]: Marfan lipodystrophy syndrome (MFLS) [MIM:616914]:
CC       A syndrome characterized by congenital ...
CC       Note=The disease is caused by mutations affecting the gene represented
CC       in this entry. {ECO:0000269|PubMed:27087445}.

SUBCELLULAR LOCATION

Please note that the previous text format of SUBCELLULAR LOCATION annotations already allowed to describe a product by its name in the optional first field. To be consistent with all other annotation types we have added square brackets around the product name.

Example: Q13421

CC   -!- ALTERNATIVE PRODUCTS:
...
CC       Name=3; Synonyms=SMRP;
CC         IsoId=Q13421-2; Sequence=VSP_021059, VSP_021060;
...
FT   CHAIN        37    286       Megakaryocyte-potentiating factor.
FT                                /FTId=PRO_0000253560.

Previous format:

CC   -!- SUBCELLULAR LOCATION: Cell membrane; Lipid-anchor, GPI-anchor.
CC       Golgi apparatus.
CC   -!- SUBCELLULAR LOCATION: Megakaryocyte-potentiating factor: Secreted.
CC   -!- SUBCELLULAR LOCATION: Isoform 3: Secreted.

New format:

CC   -!- SUBCELLULAR LOCATION: Cell membrane; Lipid-anchor, GPI-anchor. Golgi
CC       apparatus.
CC   -!- SUBCELLULAR LOCATION: [Megakaryocyte-potentiating factor]: Secreted.
CC   -!- SUBCELLULAR LOCATION: [Isoform 3]: Secreted.

MASS SPECTROMETRY

Please note that the previous text format of MASS SPECTROMETRY annotations already allowed to describe a product (by its sequence range and an optional isoform ID) in the Range field. To be consistent with all other annotation types we have replaced the Range field by a <ProductName> field.

Example: P09493

CC   -!- ALTERNATIVE PRODUCTS:
...
CC       Name=3; Synonyms=Fibroblast, TM3;
CC         IsoId=P09493-3; Sequence=VSP_006577, VSP_006579;

Previous format:

CC   -!- MASS SPECTROMETRY: Mass=32875.93; Method=MALDI; Range=1-284
CC       (P09493-3); Evidence={ECO:0000269|PubMed:11840567};

New format:

CC   -!- MASS SPECTROMETRY: [Isoform 3]: Mass=32875.93; Method=MALDI;
CC       Evidence={ECO:0000269|PubMed:11840567};

RNA EDITING

Example: Q9P225

CC   -!- ALTERNATIVE PRODUCTS:
...
CC       Name=3;
CC         IsoId=Q9P225-3; Sequence=VSP_031913, VSP_031914, VSP_031915;

Previous format:

CC   -!- RNA EDITING: Modified_positions=Not_applicable; Note=Exon 13
CC       included in isoform 3 is extensively edited in brain.
CC       {ECO:0000269|PubMed:20835228};

New format:

CC   -!- RNA EDITING: [Isoform 3]: Modified_positions=Not_applicable; Note=Exon
CC       13 is extensively edited in brain. {ECO:0000269|PubMed:20835228};

WEB RESOURCE

Example: P50570

CC   -!- ALTERNATIVE PRODUCTS:
...
CC       Name=1;
CC         IsoId=P50570-1; Sequence=Displayed;

Previous format:

CC   -!- WEB RESOURCE: Name=The UMD-DNM2-isoform 1 mutations database;
CC       URL="http://www.umd.be/DNM2/";

New format:

CC   -!- WEB RESOURCE: [Isoform 1]: Name=The UMD-DNM2-isoform 1 mutations
CC       database;
CC       URL="http://www.umd.be/DNM2/";

CATALYTIC ACTIVITY

Example: Q2YHF0

FT   CHAIN      1475   2092       Serine protease NS3.
FT                                {ECO:0000250|UniProtKB:P29990}.
FT                                /FTId=PRO_0000268140.
...
FT   CHAIN      2488   3387       RNA-directed RNA polymerase NS5.
FT                                {ECO:0000250|UniProtKB:P29990}.
FT                                /FTId=PRO_0000268144.

Previous format:

CC   -!- CATALYTIC ACTIVITY:
CC       Reaction=Selective hydrolysis of -Xaa-Xaa-|-Yaa- bonds in which
CC         each of the Xaa can be either Arg or Lys and Yaa can be either
CC         Ser or Ala.; EC=3.4.21.91;
CC   -!- CATALYTIC ACTIVITY:
CC       Reaction=a ribonucleoside 5'-triphosphate + RNA(n) = diphosphate +
CC         RNA(n+1); Xref=Rhea:RHEA:21248, Rhea:RHEA-COMP:11128, Rhea:RHEA-
CC         COMP:11129, ChEBI:CHEBI:33019, ChEBI:CHEBI:61557,
CC         ChEBI:CHEBI:83400; EC=2.7.7.48; Evidence={ECO:0000255|PROSITE-
CC         ProRule:PRU00539};

New format:

CC   -!- CATALYTIC ACTIVITY: [Serine protease NS3]:
CC       Reaction=Selective hydrolysis of -Xaa-Xaa-|-Yaa- bonds in which each of
CC         the Xaa can be either Arg or Lys and Yaa can be either Ser or Ala.;
CC         EC=3.4.21.91;
CC   -!- CATALYTIC ACTIVITY: [RNA-directed RNA polymerase NS5]:
CC       Reaction=a ribonucleoside 5'-triphosphate + RNA(n) = diphosphate +
CC         RNA(n+1); Xref=Rhea:RHEA:21248, Rhea:RHEA-COMP:11128, Rhea:RHEA-
CC         COMP:11129, ChEBI:CHEBI:33019, ChEBI:CHEBI:61557, ChEBI:CHEBI:83400;
CC         EC=2.7.7.48; Evidence={ECO:0000255|PROSITE-ProRule:PRU00539};

COFACTOR

Please note that the previous text format of COFACTOR annotations already allowed to describe a product by its name in the optional first field. To be consistent with all other annotation types we have added square brackets around the product name.

Example: P26662

FT   CHAIN      1027   1657       Serine protease NS3. {ECO:0000255}.
FT                                /FTId=PRO_0000037644.
FT   CHAIN      1658   1711       Non-structural protein 4A. {ECO:0000255}.
FT                                /FTId=PRO_0000037645.

Previous format:

CC   -!- COFACTOR: Serine protease NS3:
CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
CC         Evidence={ECO:0000269|PubMed:9060645};
CC       Note=Binds 1 zinc ion. {ECO:0000269|PubMed:9060645};
CC   -!- COFACTOR: Non-structural protein 5A:
CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105; Evidence={ECO:0000250};
CC       Note=Binds 1 zinc ion in the NS5A N-terminal domain.
CC       {ECO:0000250};

New format:

CC   -!- COFACTOR: [Serine protease NS3]:
CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
CC         Evidence={ECO:0000269|PubMed:9060645};
CC       Note=Binds 1 zinc ion. {ECO:0000269|PubMed:9060645};
CC   -!- COFACTOR: [Non-structural protein 5A]:
CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105; Evidence={ECO:0000250};
CC       Note=Binds 1 zinc ion in the NS5A N-terminal domain. {ECO:0000250};

BIOPHYSICOCHEMICAL PROPERTIES

Example: Q9ULC5

CC   -!- ALTERNATIVE PRODUCTS:
...
CC       Name=1; Synonyms=ACSL5b, ACSL5-fl;
CC         IsoId=Q9ULC5-1; Sequence=Displayed;
...
CC       Name=3; Synonyms=ACSL5delta20;
CC         IsoId=Q9ULC5-4; Sequence=VSP_038233;

Previous format:

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       Kinetic parameters:
CC         KM=0.11 uM for palmitic acid (isoform 1 at pH 7.5)
CC         {ECO:0000269|PubMed:17681178};
CC         KM=0.38 uM for palmitic acid (isoform 1 at pH 9.5)
CC         {ECO:0000269|PubMed:17681178};
CC         KM=0.04 uM for palmitic acid (isoform 3 at pH 7.5)
CC         {ECO:0000269|PubMed:17681178};
CC         KM=0.15 uM for palmitic acid (isoform 3 at pH 8.5)
CC         {ECO:0000269|PubMed:17681178};
CC       pH dependence:
CC         Optimum pH is 9.5 (isoform 1), 7.5-8.5 (isoform 3).
CC         {ECO:0000269|PubMed:17681178};

New format:

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES: [Isoform 1]:
CC       Kinetic parameters:
CC         KM=0.11 uM for palmitic acid (at pH 7.5)
CC         {ECO:0000269|PubMed:17681178};
CC         KM=0.38 uM for palmitic acid (at pH 9.5)
CC         {ECO:0000269|PubMed:17681178};
CC       pH dependence:
CC         Optimum pH is 9.5. {ECO:0000269|PubMed:17681178};
CC   -!- BIOPHYSICOCHEMICAL PROPERTIES: [Isoform 3]:
CC       Kinetic parameters:
CC         KM=0.04 uM for palmitic acid (at pH 7.5)
CC         {ECO:0000269|PubMed:17681178};
CC         KM=0.15 uM for palmitic acid (at pH 8.5)
CC         {ECO:0000269|PubMed:17681178};
CC       pH dependence:
CC         Optimum pH is 7.5-8.5. {ECO:0000269|PubMed:17681178};

SEQUENCE CAUTION

Example: Q9NQS3

Previous format:

CC   -!- ALTERNATIVE PRODUCTS:
...
CC       Name=1;
CC         IsoId=Q9NQS3-1; Sequence=Displayed;
...
CC       Name=3;
CC         IsoId=Q9NQS3-3; Sequence=VSP_046893, VSP_046894;
CC         Note=Ref.2 (BAC11404) sequence differs from that shown due to
CC         erroneous termination (Truncated C-terminus). {ECO:0000305};
...
CC   -!- SEQUENCE CAUTION:
CC       Sequence=AAH17572.1; Type=Erroneous initiation; Note=Truncated N-terminus.; Evidence={ECO:0000305};

New format:

CC   -!- ALTERNATIVE PRODUCTS:
...
CC       Name=1;
CC         IsoId=Q9NQS3-1; Sequence=Displayed;
...
CC       Name=3;
CC         IsoId=Q9NQS3-3; Sequence=VSP_046893, VSP_046894;
...
CC   -!- SEQUENCE CAUTION: [Isoform 1]:
CC       Sequence=AAH17572.1; Type=Erroneous initiation; Note=Truncated N-terminus.; Evidence={ECO:0000305};
CC   -!- SEQUENCE CAUTION: [Isoform 3]:
CC       Sequence=BAC11404.1; Type=Erroneous termination; Note=Truncated C-terminus.; Evidence={ECO:0000305};
FT section

Note: The format descriptions make use of POSIX ERE syntax.

All positional annotations in the FT section previously referred to the canonical sequence that is shown in the UniProtKB entry. This was the text format of these annotation types:

FT   <TYPE>      <B>    <E>       (<Description>.)?( {<Evidences>}.)?
(FT                                /FTId=<Id>.)?

Where

  • <TYPE> is a value from the controlled vocabulary of positional annotation types.
  • <B> and <E> are amino acid positions on the canonical sequence. For most annotation types, they are the begin and end position of a sequence range, but they have other semantics for some types (e.g. CROSSLNK and DISULFID).
  • <Description> may provide information in addition to that conveyed by the <TYPE> and the location <B> and <E>. This field is mandatory for some annotation types and optional for others.
  • <Evidences> are optional and added between curly braces.
  • <Id> is a unique annotation identifier that is mandatory for some annotation types, including CHAIN and PEPTIDE where it corresponds to the <ProductId>.

We have modified this format in order to describe amino acid positions on isoforms sequences. The new format is inspired by the INSDC's feature table format to enable code reuse:

FT   <TYPE>          <Location>
(FT                   /<Qualifier>(="<Value>")?)*

Where

  • <TYPE> is a value from the controlled vocabulary of positional annotation types.
  • <Location> is a sequence location on the canonical or an isoform sequence. We will use for now only a subset of the INSDC Location types: A <Location> must be either a single <Position> or a range of <Position> that may optionally be preceded by an isoform ID. The < and > symbols may be used with begin and end positions to indicate that the begin or end point is beyond the specified amino acid position. Please note that we have to extend the INSDC Location format with the ? symbol to allow us to represent all existing UniProtKB locations. This symbol may precede a <Position> to indicate that the exact position is unsure, or it may substitute the <Position> when the position is unknown.
    (<IsoformId>:)?((<|?)?<Position>|?)(..((>|?)?<Position>|?))?
    
  • /<Qualifier> may provide information in addition to that conveyed by the <TYPE> and <Location>. While we will follow the format of the INSDC Qualifiers, we will introduce our own <Qualifier> types where necessary. For this format change, we will represent the existing data with 3 qualifiers:
    • /note= will show the content of the current <Description> field.
    • /evidence= will show the content of the current <Evidences> field.
    • /id= will show the content of the current /FTId= field.

In a future format change, we may introduce more <Location> and <Qualifier> types to structure the description of positional annotations further.

Lines are wrapped at 80 chars (see section Change of line length in UniProtKB text format above).

Example: P84077

This example illustrates the format change with a selection of representative positional annotation types that refer to the canonical sequence.

Previous format:

FT   INIT_MET      1      1       Removed. {ECO:0000244|PubMed:19413330,
FT                                ECO:0000244|PubMed:22223895,
FT                                ECO:0000269|PubMed:25255805,
FT                                ECO:0000269|PubMed:25807930}.
FT   CHAIN         2    181       ADP-ribosylation factor 1.
FT                                /FTId=PRO_0000207378.
...
FT   NP_BIND     126    129       GTP. {ECO:0000244|PDB:1HUR,
FT                                ECO:0000244|PDB:1RE0,
FT                                ECO:0000244|PDB:1U81,
FT                                ECO:0000244|PDB:3O47, ECO:0000305}.
...
FT   VARIANT      35     35       Y -> H (in PVNH8; decreased interaction
FT                                with GGA3; dbSNP:rs879036238).
FT                                {ECO:0000269|PubMed:28868155}.
FT                                /FTId=VAR_081272.
...
FT   HELIX         6      9       {ECO:0000244|PDB:1HUR}.

New format:

FT   INIT_MET        1
FT                   /note="Removed"
FT                   /evidence="ECO:0000244|PubMed:19413330,
FT                   ECO:0000244|PubMed:22223895, ECO:0000269|PubMed:25255805,
FT                   ECO:0000269|PubMed:25807930"
FT   CHAIN           2..181
FT                   /note="ADP-ribosylation factor 1"
FT                   /id="PRO_0000207378"
...
FT   NP_BIND         126..129
FT                   /note="GTP"
FT                   /evidence="ECO:0000244|PDB:1HUR, ECO:0000244|PDB:1RE0,
FT                   ECO:0000244|PDB:1U81, ECO:0000244|PDB:3O47, ECO:0000305"
...
FT   VARIANT         35
FT                   /note="Y -> H (in PVNH8; decreased interaction with GGA3;
FT                   dbSNP:rs879036238)"
FT                   /evidence="ECO:0000269|PubMed:28868155"
FT                   /id="VAR_081272"
...
FT   HELIX           6..9
FT                   /evidence="ECO:0000244|PDB:1HUR"

Example: P0C551

This example illustrates the use of the < and ? symbols in UniProtKB locations.

Previous format:

FT   SIGNAL       <1      ?       {ECO:0000250}.
FT   PROPEP        ?     17       {ECO:0000250}.
FT                                /FTId=PRO_0000293097.
FT   CHAIN        18    142       Acidic phospholipase A2 KBf-grIB.
FT                                /FTId=FTId=PRO_0000293098.

New format:

FT   SIGNAL          <1..?
FT                   /evidence="ECO:0000250"
FT   PROPEP          ?..17
FT                   /evidence="ECO:0000250"
FT                   /id="PRO_0000293097"
FT   CHAIN           18..142
FT                   /note="Acidic phospholipase A2 KBf-grIB"
FT                   /id="PRO_0000293098"

Example: P12821

This example illustrates how positional annotations for isoforms are represented.

Previous format:

CC   -!- ALTERNATIVE PRODUCTS:
CC       ...
CC       Name=Testis-specific; Synonyms=ACE-T;
CC         IsoId=P12821-3, P22966-1;
CC         Sequence=VSP_035120, VSP_035121;
CC         Note=Variant in position: 32:S->P (in dbSNP:rs4317). Variant in
CC         position: 49:S->G (in dbSNP:rs4318).;
...
FT   VARIANT     154    154       A -> T (in dbSNP:rs13306087).
FT                                /FTId=VAR_029139.

New format:

CC   -!- ALTERNATIVE PRODUCTS:
CC       ...
CC       Name=Testis-specific; Synonyms=ACE-T;
CC         IsoId=P12821-3, P22966-1;
CC         Sequence=VSP_035120, VSP_035121;
...
FT   VARIANT         154
FT                   /note="A -> T (in dbSNP:rs13306087)"
FT                   /id="VAR_029139"
...
FT   VARIANT         P12821-3:32
FT                   /note="S -> P (in dbSNP:rs4317)"
FT                   /id="VAR_x"
FT   VARIANT         P12821-3:49
FT                   /note="S -> G (in dbSNP:rs4318)"
FT                   /id="VAR_y"

XML format

The UniProtKB XSD already allowed to describe the product to which an annotation applies and required no changes.

Isoforms are described in "alternative products" annotations. The products of proteolytic cleavage are described in "peptide" and "chain" annotations. All three annotation types provide a name (<ProductName>) and/or a unique ID (<ProductId>) for the product that they describe:

  • "alternative products" annotations describe each isoform by an isoform element of isoformType. The isoformType describes the product IDs and names with sequences of id and name elements (where the first element in each sequence is the main product ID/name).
    <comment type="alternative products">
      ...
      <isoform>
        <id><ProductId></id>
        <id><OldProductId></id>
        <name><ProductName></name>
        <name><AlternativeProductName></name>
        ...
      </isoform>
      ...
    </comment>
    
  • "peptide" and "chain" annotations show the name and ID of a proteolytic cleavage product in the description and id attributes of the featureType.
    <feature type="chain" description="<ProductName>" id="<ProductId>">
    ...
    </feature>
    
commentType

The commentType has two ways to indicate that the annotation applies to a specific product:

  • An optional molecule element of moleculeType allows to describe a product by its name or/and unique ID. It is currently only used for "subcellular location" and "cofactor" annotations (see examples below). In the future it may be used for all annotations that are represented by commentType.
  • An optional sequence of location elements of locationType allows to describe the sequence coordinates of an annotation. The locationType has an optional sequence attribute that is only set (to an isoform ID) when the coordinates are not for the canonical sequence. Sequence coordinates may currently be given for "rna editing", "sequence caution" and "mass spectrometry" annotations. In the future sequence caution and mass spectrometry annotations will no longer describe sequence coordinates.

subcellular location

Example: Q13421

<comment type="alternative products">
  ...
  <isoform>
    <id>Q13421-2</id>
    <name>3</name>
    <name>SMRP</name>
    <sequence type="described" ref="VSP_021059 VSP_021060"/>
    ...
  </isoform>
  ...
</comment>
...
<feature type="chain" description="Megakaryocyte-potentiating factor"
                      id="PRO_0000253560">
  <location>
    <begin position="37"/>
    <end position="286"/>
  </location>
</feature>
<comment type="subcellular location">
  <subcellularLocation>
    <location>Cell membrane</location>
    <topology>Lipid-anchor</topology>
    <topology>GPI-anchor</topology>
  </subcellularLocation>
  <subcellularLocation>
    <location>Golgi apparatus</location>
  </subcellularLocation>
</comment>
<comment type="subcellular location">
  <molecule>Megakaryocyte-potentiating factor</molecule>
  <subcellularLocation>
    <location>Secreted</location>
  </subcellularLocation>
</comment>
<comment type="subcellular location">
  <molecule>Isoform 3</molecule>
  <subcellularLocation>
    <location>Secreted</location>
  </subcellularLocation>
</comment>

cofactor

Example: P26662

<feature type="chain" description="Serine protease NS3"
                      id="PRO_0000037644" evidence="4">
  <location>
    <begin position="1027"/>
    <end position="1657"/>
  </location>
</feature>
<feature type="chain" description="Non-structural protein 4A"
                      id="PRO_0000037645" evidence="4">
  <location>
    <begin position="1658"/>
    <end position="1711"/>
  </location>
</feature>
<comment type="cofactor">
  <molecule>Serine protease NS3</molecule>
  <cofactor evidence="14">
    <name>Zn(2+)</name>
    <dbReference type="ChEBI" id="CHEBI:29105"/>
  </cofactor>
  <text evidence="14">Binds 1 zinc ion.</text>
</comment>
<comment type="cofactor">
  <molecule>Non-structural protein 5A</molecule>
  <cofactor evidence="3">
    <name>Zn(2+)</name>
    <dbReference type="ChEBI" id="CHEBI:29105"/>
  </cofactor>
  <text evidence="3">Binds 1 zinc ion in the NS5A N-terminal domain.</text>
</comment>
featureType

The featureType has a mandatory location element of locationType to describe the sequence coordinates of an annotation.

Example: P84077

<feature type="initiator methionine" description="Removed" evidence="6 7 21 22">
  <location>
    <position position="1"/>
  </location>
</feature>
<feature type="chain" description="ADP-ribosylation factor 1" id="PRO_0000207378">
  <location>
    <begin position="2"/>
    <end position="181"/>
  </location>
</feature>
...
<feature type="nucleotide phosphate-binding region" description="GTP" evidence="1 2 3 4 25">
  <location>
    <begin position="126"/>
    <end position="129"/>
  </location>
</feature>
...
<feature type="sequence variant" description="In PVNH8; decreased interaction with GGA3; dbSNP:rs879036238." id="VAR_081272" evidence="23">
  <original>Y</original>
  <variation>H</variation>
  <location>
    <position position="35"/>
  </location>
</feature>
...
<feature type="helix" evidence="1">
  <location>
    <begin position="6"/>
    <end position="9"/>
  </location>
</feature>

The locationType has an optional sequence attribute that is only set (to an isoform ID) when the coordinates are not for the canonical sequence.

Example: P12821

Previous representation:

<comment type="alternative products">
  ...
  <isoform>
    <id>P12821-3</id>
    <id>P22966-1</id>
    <name>Testis-specific</name>
    <name>ACE-T</name>
    <sequence type="described" ref="VSP_035120 VSP_035121"/>
    <text>Variant in position: 32:S->P (in dbSNP:rs4317). Variant in position: 49:S->G (in dbSNP:rs4318).</text>
  </isoform>
  ...
</comment>
...
<feature type="sequence variant" description="In dbSNP:rs13306087." id="VAR_029139">
  <original>A</original>
  <variation>T</variation>
  <location>
    <position position="154"/>
  </location>
</feature>

New representation:

<comment type="alternative products">
  ...
  <isoform>
    <id>P12821-3</id>
    <id>P22966-1</id>
    <name>Testis-specific</name>
    <name>ACE-T</name>
    <sequence type="described" ref="VSP_035120 VSP_035121"/>
  </isoform>
  ...
</comment>
...
<feature type="sequence variant" description="In dbSNP:rs13306087." id="VAR_029139">
  <original>A</original>
  <variation>T</variation>
  <location>
    <position position="154"/>
  </location>
</feature>
...
<feature type="sequence variant" description="In dbSNP:rs4317." id="VAR_x">
  <original>S</original>
  <variation>P</variation>
  <location sequence="P12821-3">
    <position position="32"/>
  </location>
</feature>
<feature type="sequence variant" description="In dbSNP:rs4318." id="VAR_y">
  <original>S</original>
  <variation>G</variation>
  <location sequence="P12821-3">
    <position position="49"/>
  </location>
</feature>

RDF format

The UniProt RDF schema ontology already allowed to describe the product to which an annotation applies and required no changes for this purpose.

The RDF format has a single hierarchy of Annotation classes with various intermediary classes. The subclass Sequence_Annotation groups all classes that refer to a location on a protein sequence. This location is represented with FALDO and always indicates the FALDO reference sequence for the location (the RDF format makes no special case for a canonical sequence). Annotations that do not refer to a specific location on a protein sequence, but that apply to a given product, describe the sequence of this product with a sequence property. The object of this property may be a Sequence or a Chain_Annotation / Peptide_Annotation that describes a sequence that is the product of proteolytic processing.

Please note that the change of mass spectrometry annotations required an adaptation of the hierarchy of Annotation classes: The Mass_Spectrometry_Annotation class no longer is an rdfs:subClassOf of the Sequence_Annotation class, but a direct rdfs:subClassOf of the Annotation class.

Example: Q13421

@prefix up: <http://purl.uniprot.org/core/> .
@prefix uniprot: <http://purl.uniprot.org/uniprot/> .
@prefix isoform: <http://purl.uniprot.org/isoforms/> .
@prefix annotation: <http://purl.uniprot.org/annotation/> .
@prefix faldo: <http://biohackathon.org/resource/faldo#> .

uniprot:Q13421
  up:annotation
    annotation:PRO_0000253560 ,
    <Q13421#SIPADAC7D651EFC09CC> ,
    <Q13421#SIP307BEB951103B073> ,
    <Q13421#SIPB6746E472B99B031> ,
    ...
  up:sequence
    isoform:Q13421-1 ,
    isoform:Q13421-3 ,
    isoform:Q13421-2 ,
    isoform:Q13421-4 ;

annotation:PRO_0000253560
  rdf:type up:Chain_Annotation ;
  rdfs:comment "Megakaryocyte-potentiating factor" ;
  up:range range:22853569102360878tt37tt286 .
range:22853569102360878tt37tt286
  rdf:type faldo:Region ;
  faldo:begin position:22853569102360878tt37 ;
  faldo:end position:22853569102360878tt286 .
position:22853569102360878tt37
  rdf:type faldo:Position , faldo:ExactPosition ;
  faldo:position 37 ;
  faldo:reference isoform:Q13421-1 .
position:22853569102360878tt286
  rdf:type faldo:Position , faldo:ExactPosition ;
  faldo:position 286 ;
  faldo:reference isoform:Q13421-1 .

<Q13421#SIPADAC7D651EFC09CC>
  rdf:type up:Subcellular_Location_Annotation ;
  up:locatedIn <Q13421#SIP04927440DF8EB941> ,
               <Q13421#SIP727DF431EB6C89EC> .

<Q13421#SIP307BEB951103B073>
  rdf:type up:Subcellular_Location_Annotation ;
  up:locatedIn <Q13421#SIPD59D33F5047A94FD> ;
  up:sequence annotation:PRO_0000253560 .

<Q13421#SIPB6746E472B99B031>
  rdf:type up:Subcellular_Location_Annotation ;
  up:locatedIn <Q13421#SIPD59D33F5047A94FD> ;
  up:sequence isoform:Q13421-2 .

isoform:Q13421-1
  rdf:type up:Simple_Sequence ;
  up:modified "2006-10-17"^^xsd:date ;
  up:version 2 ;
  up:precursor true ;
  up:mass 68986 ;
  up:crc64Checksum "FA17E3609B6CC9CA"^^xsd:token ;
  up:name "1" ;
  rdf:value "MALPTARPLLGSCGTPALGSLLFLLFSL ... LLASTLA" .
isoform:Q13421-3
  rdf:type up:Modified_Sequence ;
  up:name "2" ;
  up:basedOn isoform:Q13421-1 ;
  up:modification annotation:VSP_021059 .
  rdf:value "MALPTARPLLGSCGTPALGSLLFLLFSL ... LLASTLA" ;
isoform:Q13421-2
  rdf:type up:Modified_Sequence ;
  up:name "3" , "SMRP" ;
  up:basedOn isoform:Q13421-1 ;
  up:modification annotation:VSP_021059 , annotation:VSP_021060 .
  rdf:value "MALPTARPLLGSCGTPALGSLLFLLFSL ... LRAPLPC" ;
isoform:Q13421-4
  rdf:type up:Modified_Sequence ;
  up:name "4" ;
  up:basedOn isoform:Q13421-1 ;
  up:modification annotation:VSP_021058 , annotation:VSP_021059 .
  rdf:value "MALPTARPLLGSCGTPALGSLLFLLFSL ... LLASTLA" ;

Cross-references to RNAct

Cross-references have been added to RNAct, a database of protein-RNA interaction predictions for model organisms with supporting experimental data.

RNAct is available at https://rnact.crg.eu.

The format of the explicit links is:

Resource abbreviation RNAct
Resource identifier UniProtKB accession number
Optional information 1 Molecule type

Example: Q9Y2I1

Show all entries having a cross-reference to RNAct.

Text format

Example: Q9Y2I1

DR   RNAct; Q9Y2I1; protein.

XML format

Example: Q9Y2I1

<dbReference type="RNAct" id="Q9Y2I1">
   <property type="molecule type" value="protein"/>
</dbreference>

RDF format

Example: Q9Y2I1

uniprot:Q9Y2I1
  rdfs:seeAlso <http://purl.uniprot.org/rnact/Q9Y2I1> .
<http://purl.uniprot.org/rnact/Q9Y2I1>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/RNAct> ;
  rdfs:comment "protein" .

Change of the cross-references to Pharos

We have introduced an additional field in the cross-references to the Pharos database to indicate the development status of a target. Targets are categorized into four development/druggability levels (TDLs), ranging from Tclin for approved drugs with known mechanisms of action, to Tdark for targets about which virtually nothing is known.

Text format

Example: P33151

DR   Pharos; P33151; Tbio.

XML format

Example: P33151

<dbReference type="Pharos" id="P33151">
  <property type="development level" value="Tbio"/>
</dbReference>

This change does not affect the XSD, but may nevertheless require code changes.

RDF format

Example: P33151

uniprot:P33151
  rdfs:seeAlso <http://purl.uniprot.org/pharos/P33151> .
<http://purl.uniprot.org/pharos/P33151>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Pharos> ;
  rdfs:comment "Tbio" .

Removal of the cross-references to PMAP-CutDB

Cross-references to PMAP-CutDB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • 6-(S-cysteinyl)-8alpha-(pros-histidyl)-FAD (Cys-His)

Changes to keywords

Deleted keyword:

  • Complete proteome

Proteomes changes

The UniProt Proteomes portal is offering protein sequence sets obtained from the translation of sequenced genomes. Published genomes from NCBI Genome used to be brought into UniProt if they satisfied the following criteria:

  • The genome is annotated and a set of coding sequences is available.
  • The number of predicted coding sequences falls within a statistically significant range of published proteomes from neighbouring species.

We have changed these criteria to publish all proteomes that can be derived from NCBI genomes that are not considered to be low quality assemblies. We now use a subset of the RefSeq reasons to exclude a genome assembly to determine which proteomes to bring into UniProtKB and we give the reason(s) why a proteome is excluded from UniProtKB. We also provide two metrics to help users to assess the quality of a proteome:

  • A score obtained with the BUSCO software.
  • A score based on the number of coding sequences expected based on neighbouring species, "Complete Proteome Detector (CPD)".

The "Complete proteome" keyword was removed from all UniProtKB entries. Individual proteomes can be retrieved from the UniProt website by their unique proteome identifier, e.g. UP000005640.

UniProt release 2019_10

Published November 13, 2019

Headline

A scorpion venom toxin may help unravel the mystery of chronic pain

The old saying goes 'an ounce of prevention is worth a pound of cure’, and indeed, our body has developed various strategies to alert us of potential dangers to avoid. One contributor to this strategy is TRPA1, also called the 'wasabi receptor'. TRPA1, a member of the transient receptor family (TRP), is a plasma membrane cation channel expressed by primary afferent sensory neurons. It is activated by chemically reactive electrophiles present in a range of environmental irritants and endogenous inflammatory agents. Cigarette smoke, for example, is rich in reactive electrophiles that can trigger TRPA1 in the cells that line the airways, inducing coughing and sustained airway inflammation. Some plants, such as mustard, wasabi or onions, have evolved compounds that activate TRPA1, possibly to ward off animals that might otherwise eat them. In this context, TRPA1 activation is responsible for the sinus-jolting sting of wasabi and the flood of tears associated with chopping onions.

Not only plants produce TRPA1 activating compounds. Black rock scorpions do too, as has been reported in a recent publication by Lin King et al. This comes as a surprise. Most animal toxins identified so far target voltage-gated ion channels, and the few known to act on TRP channels all activate the capsaicin receptor, TRPV1. The newly discovered black rock scorpion toxin has been called Wasabi receptor toxin or WaTx. In its mature form, it is a 19 amino acid-long peptide, which has the amazing ability to penetrate cells by passive diffusion. This property is not unique to WaTx, other proteins, such as HIV Tat or Drosophila penetratin also share it, but WaTx does not have any sequence similarity to them.

Once in the cell, WaTx binds TRPA1 at the same site as plant and environmental irritants, but the similarity ends there. Reactive electrophiles covalently bind TRPA1 and produce a large increase in the probability of channel opening characterized by brief transitions between open and closed states. This results in the influx of sodium and calcium ions. The influx of Ca(2+), in turn, causes the exocytosis of dense-core vesicles, the release of calcitonin-gene-related peptide (CGRP) and substance P, and ultimately induces neurogenic inflammation. WaTx non-covalent binding to TRPA1 stabilizes the open state of the channel and prolongs open time. Consequently, it induces neuronal depolarization and subsequent hypersensitivities, which are characteristic of chronic pain. In addition, it decreases the relative Ca(2+)-permeability of the channel. The Ca(+2) influx is not sufficient to trigger CGRP release and does not cause any inflammation. These observations show a striking convergent evolution between plants and animals in terms of binding site, resulting, however, in a very different modulation of cation channel activity and a distinct outcome in terms of inflammation.

TRPA1 is expressed in virtually every animal, from worms and humans, but WaTx only activates mammalian orthologs. Why so? It is difficult to say. Black rock scorpions feed on insects like cockroaches and beetles, as well as other small invertebrates such as millipedes, centipedes, spiders and rarely earthworms, but never mammals. Therefore, WaTx may have a deterrent role aimed specifically at mammalian predators.

One thing is certain: with WaTx, scorpions provide us with a powerful tool to study the central neural pathways contributing to chronic pain and to investigate the link between chronic pain and inflammation. TRPA1 is emerging as a potential target for new classes of non-opioid analgesics to treat chronic pain.

As of this release, WaTx has been annotated and is painlessly available in UniProtKB/Swiss-Prot.

UniProtKB news

Removal of the cross-references to EcoGene

Cross-references to EcoGene have been removed.

Change of the cross-references to DisProt

Cross-references to DisProt may now be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Example: Q9NQC3

Changes to the controlled vocabulary of human diseases

New diseases:

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt release 2019_09

Published October 16, 2019

Headline

Biological weapons in the struggle for life

The apoplast, the extracellular space between plant cells, is a battleground between plants and attacking microorganisms. The battle often starts with the secretion of carbohydrate-degrading enzymes by the invader and a counter-attack by the plant with dedicated enzyme inhibitors. This happens for instance when Phytophthora sojae invades soybean. During very early infection stages (20 minutes to 2 hours), the oomycete expresses and secretes high levels of a xyloglucanase enzyme, called XEG1. XEG1 degrades the plant cell wall polymers and not only permits pathogen invasion, but also provides it with nutrients. XEG1 expression declines rapidly from 3 hours onwards.

Like other plants, soybean monitors the apoplastic environment through pattern recognition receptors and recognizes XEG1 as a PAMP (pathogen-associated molecular pattern). This recognition triggers defense responses, which include the secretion of GIP1, a xyloglucanase inhibitor. Efficient XEG1 inhibition by GIP1 increases soybean resistance towards P. sojae. This could be the end of the story, but P. sojae has another trick up its sleeve.

When Ma et al. studied GIP1 binding to XEG1 paralogs, they retrieved only one protein, called XLP1, in addition to XEG1. Like XEG1, XLP1 is targeted to the apoplast. It also shows a similar expression time course to XEG1, i.e. high levels during very early infection stages, followed by a rapid decline, suggesting a role in virulence. Unlike XEG1, however, XLP1 does not show any xyloglucanase activity. It is 52 residues shorter and is missing one of the residues critical for xyloglucanase activity. XLP1 contributes to P. sojae infectivity, but only in the presence of active XEG1. Thus XLP1 acts as a decoy to disrupt plant defenses. It interacts with GIP1 with a five-fold higher affinity than that of XEG1 and hence neutralizes the GIP1 inhibitor. In this setting, XEG1 can pursue plant cell wall digestion without any hindrance.

The 3 belligerent proteins have been annotated in UniProtKB/Swiss-Prot and are publicly available.

UniProtKB news

Change of annotation topic 'Sequence caution'

The annotation topic Sequence caution reports differences between the protein sequence shown in a UniProtKB entry and other available protein sequences derived from the same gene. It indicates the likely cause for the differences, and when that cause is a frameshift or erroneous termination, the amino acid sequence position(s) of these errors were listed when possible. Since it is nowadays easy to align two protein sequences for comparison, we no longer curate error positions and removed the field where this information was stored.

Text format

We removed the optional Positions field.

Example: P14332

Previous format:

CC   -!- SEQUENCE CAUTION: Sequence=CAA34633.1; Type=Frameshift; Positions=226, 249; Evidence={ECO:0000305};

New format:

CC   -!- SEQUENCE CAUTION: Sequence=CAA34633.1; Type=Frameshift; Evidence={ECO:0000305};

XML format

This change did not affect the UniProtKB XSD.

Example: P14332

Previous format:

<comment type="sequence caution" evidence="3">
  <conflict type="frameshift">
    <sequence resource="EMBL-CDS" id="CAA34633" version="1"/>
  </conflict>
  <location>
    <position position="226"/>
  </location>
  <location>
    <position position="249"/>
  </location>
</comment>

New format:

<comment type="sequence caution" evidence="3">
  <conflict type="frameshift">
    <sequence resource="EMBL-CDS" id="CAA34633" version="1"/>
  </conflict>
</comment>

RDF format

This change required an adaptation of the hierarchy of Annotation classes in the UniProt RDF schema ontology: The Sequence_Caution_Annotation class is no longer an rdfs:subClassOf of the Sequence_Annotation class, but a direct rdfs:subClassOf of the Annotation class.

Example: P14332

Previous format:

uniprot:P14332
  up:annotation <P14332#SIP7159608509D280BB> .

<P14332#SIPBBAF3CC29FCD3715>
  rdf:type up:Frameshift_Annotation ;
  up:conflictingSequence <P14332#SIPD8B2EDFEB46FA203> ;
  up:range range:22572098403906094tt226tt226 ,
           range:22572098403906094tt249tt249 .
range:22572098403906094tt226tt226
  rdf:type faldo:Region ;
  faldo:begin position:22572098403906094tt226 ;
  faldo:end position:22572098403906094tt226 .
position:22572098403906094tt226
  rdf:type faldo:Position , faldo:ExactPosition ;
  faldo:position 226 ;
  faldo:reference isoform:P14332-1 .
range:22572098403906094tt249tt249
  rdf:type faldo:Region ;
  faldo:begin position:22572098403906094tt249 ;
  faldo:end position:22572098403906094tt249 .
position:22572098403906094tt249
  rdf:type faldo:Position , faldo:ExactPosition ;
  faldo:position 249 ;
  faldo:reference isoform:P14332-1 .

New format:

uniprot:P14332
  up:annotation <P14332#SIP7159608509D280BB> .

<P14332#SIPBBAF3CC29FCD3715>
  rdf:type up:Frameshift_Annotation ;
  up:conflictingSequence <P14332#SIPD8B2EDFEB46FA203> .

Cross-references to PlantReactome

Cross-references have been added to PlantReactome, a curated resource of core pathways and reactions in plant biology.

PlantReactome is available at https://plantreactome.gramene.org

The format of the explicit links is:

Resource abbreviation PlantReactome
Resource identifier Resource identifier
Optional information Pathway name

Example: P0C128

Show all entries having a cross-reference to PlantReactome.

Text format

Example: P0C128

DR   PlantReactome; R-OSA-5608118; Auxin signalling.

XML format

Example: P0C128

<dbReference type="PlantReactome" id="R-OSA-5608118">
  <property type="pathway name" value="Auxin signalling"/>
</dbReference>

RDF format

Example: P0C128

uniprot:P0C128
  rdfs:seeAlso <http://purl.uniprot.org/plantreactome/R-OSA-5608118> .
<http://purl.uniprot.org/plantreactome/R-OSA-5608118>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/PlantReactome> ;
  rdfs:comment "Auxin signalling" .

Change of the cross-references to Reactome

Cross-references to Reactome may now be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Example: P00167

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Mental retardation, X-linked 17

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • N5-[4-(S-L-cysteinyl)-5-methyl-1H-imidazol-2-yl]-L-ornithine (Arg-Cys) (interchain with C-...)
  • N5-[4-(S-L-cysteinyl)-5-methyl-1H-imidazol-2-yl]-L-ornithine (Cys-Arg) (interchain with R-...)

New term for the feature key 'Glycosylation' ('CARBOHYD' in the flat file):

  • N-linked (Glc) (glycation) arginine

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • S-(2-succinyl)cysteine
  • N6-carbamoyllysine
  • S-(2,3-dicarboxypropyl)cysteine
  • S-cGMP-cysteine

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt release 2019_08

Published September 18, 2019

Headline

Magnetic personalities

Magnetotactic bacteria sense and align to the Earth's magnetic field, swimming north in the Northern Hemisphere and south in the Southern in the presence of oxygen. This amazing ability to sense the Earth's magnetic field is provided by small organelles, called magnetosomes, formed by iron nanocrystals of either magnetite (Fe3O4) or greigite (Fe3S4) surrounded by a phospholipid bilayer. Generally, magnetosomes form chains that align along the long axis of the cell using a dedicated actin-like cytoskeletal structure.

image
Source: Frank Mickoleit CC BY-SA 3.0

Magnetosome formation is a complex process, which includes invagination of the cell inner membrane to form vesicles, iron ion uptake, crystal biomineralization and magnetosome chain assembly. It involves a large number of proteins, encoded by genes clustered in an approximately 100 kb magnetosome island. Among all proteins involved in magnetosome formation, one of the single most important is MamB, a probable iron transporter with a role in both vesicle formation and biomineralization. MamB is stabilized by heterodimerization with MamM. Studies in genetically tractable Magnetospirillum magneticum (strain AMB-1) and Magnetospirillum gryphiswaldense (strain MSR) have pinpointed the function of many more proteins. For instance, MamA forms a scaffold to which other proteins attach on the organelle's exterior. MamI aids in magnetite nucleation, while MamH is another probable iron transporter. MamN may control the pH of the magnetosome lumen. 4 redox-active multi-heme proteins are probably involved in correct iron oxidization (MamP, MamT, MamX and MamE), the latter is also a protease necessary for magnetosome protein maturation. There are proteins that positively regulate crystal size (including MamC, MamD MamG, MamF and those that negatively regulate crystal size (Mms36 and Mms48). Finally MamK is an actin-like protein involved in organelle positioning, along with MamJ.

The interest in magnetosomes goes far beyond the understanding of these fascinating bacteria. Magnetosomes may be instrumental for the improvement of magnetic nanoparticle biotechnologies. Purified bacterial magnetosomes represent magnetic nanoparticles with exceptionally well-defined characteristics, owing to the precise control that is exerted during all stages of biogenesis, and several unprecedented properties, such as high crystallinity, strong magnetization, and a uniform distribution of shape and size that cannot be replicated by synthesis using abiotic processes. In the biomedical field, promising results suggest that magnetosomes could be used in medical imaging, targeted drug delivery and tumor hyperthermia. In the context of wastewater treatment, it has been shown that heavy metal ions can be adsorbed onto magnetosome-producing microorganisms and then removed by magnetic separation. In addition, it may also help us learn more about the origin of life and the evolution of membrane-bound eukaryotic organelles.

As of this release nearly 70 magnetosome proteins have been annotated and can be retrieved using the term magnetosome.

UniProtKB news

Cross-references to DrugCentral

Cross-references have been added to DrugCentral, an online drug information resource providing information on active ingredients chemical entities, pharmaceutical products, drug mode of action, indications, pharmacologic action.

DrugCentral is available at http://drugcentral.org.

The format of the explicit links is:

Resource abbreviation DrugCentral
Resource identifier UniProtKB accession number

Example: P35372

Show all entries having a cross-reference to DrugCentral.

Text format

Example: P35372

DR   DrugCentral; P35372; -.

XML format

Example: P35372

<dbReference type="DrugCentral" id="P35372"/>

RDF format

Example: P35372

uniprot:P35372
  rdfs:seeAlso <http://purl.uniprot.org/drugcentral/P35372> .
<http://purl.uniprot.org/drugcentral/P35372>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/DrugCentral> .

Cross-references to Pharos

Cross-references have been added to Pharos, a user interface to the knowledge-base for the Druggable Genome (DG), whose goal is to illuminate the uncharacterized and/or poorly annotated portion of the DG, focusing on three of the most commonly drug-targeted protein families: G-protein-coupled receptors (GPCRs), ion channels (ICs) and kinases.

Pharos is available at https://pharos.nih.gov.

The format of the explicit links is:

Resource abbreviation Pharos
Resource identifier UniProtKB accession number

Example: Q7Z3E2

Show all entries having a cross-reference to Pharos.

Text format

Example: Q7Z3E2

DR   Pharos; Q7Z3E2; -.

XML format

Example: Q7Z3E2

<dbReference type="Pharos" id="Q7Z3E2"/>

RDF format

Example: Q7Z3E2

uniprot:Q7Z3E2
  rdfs:seeAlso <http://purl.uniprot.org/pharos/Q7Z3E2> .
<http://purl.uniprot.org/pharos/Q7Z3E2>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Pharos> .

Cross-references to MassIVE

Cross-references have been added to MassIVE, a community resource developed by the NIH-funded Center for Computational Mass Spectrometry to promote the global, free exchange of mass spectrometry data and provide a reusable aggregation of community-scale detection of peptides and proteins observations.

MassIVE is available at https://massive.ucsd.edu/.

The format of the explicit links is:

Resource abbreviation MassIVE
Resource identifier UniProtKB accession number

Example: Q8IY92

Show all entries having a cross-reference to MassIVE.

Text format

Example: Q8IY92

DR   MassIVE; Q8IY92; -.

XML format

Example: Q8IY92

<dbReference type="MassIVE" id="Q8IY92"/>

RDF format

Example: Q8IY92

uniprot:Q8IY92
  rdfs:seeAlso <http://purl.uniprot.org/massive/Q8IY92> .
<http://purl.uniprot.org/massive/Q8IY92>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/MassIVE> .

Changes to the controlled vocabulary of human diseases

New diseases:

Changes in subcellular location controlled vocabulary

New subcellular locations:

Modified subcellular location:

UniRef news

Change of UniRef clustering method from CD-HIT to MMseqs2

We have switched the clustering program for UniRef90 and UniRef50 from CD-HIT to MMseqs2 (Steinegger M. and Soeding J., Nat. Commun. 9 (2018)).

The clustering algorithm remains "Greedy Incremental Clustering" with the same parameters (thanks to the MMseqs2 authors for making this available). UniRef100 was not affected.

UniProt XML news

Removal of whitespace characters in the XML amino acid sequence representations

The <sequence> elements of the UniProtKB, UniParc and UniRef XML representations formatted the amino acid sequence for historic reasons with spaces and newlines. These whitespace characters had to be removed before parsing with native XML tools. To avoid this complication we have removed all whitespace characters in the <sequence> elements, so that they contain only IUPAC amino acid codes.

UniProt release 2019_07

Published July 31, 2019

Headline

The enemy of my enemy is my friend

When prey encounters predator, escape is often the best form of defense. Unfortunately this is not an option for plants, who instead have evolved a range of chemical defenses to deter herbivores, such as insects and other arthropods. These chemical defenses include compounds that are toxic to herbivores (direct defense) as well as compounds that attract herbivore predators (indirect defense).

Indirect defense has been extensively studied in maize. Upon foliar damage by lepidopteran larvae, maize releases a complex volatile terpenoid mixture, which attracts parasitic wasps, like Cotesia marginiventris or Cotesia sesamiae. These wasps deposit eggs in the lepidopteran larvae, which leads to an understandable loss of appetite by the lepidopterans, and eventually their death, when the wasp finally emerges from its host. These volatile terpenoids can also 'prime' neighboring plants, causing them to increase the transcription of defense-related genes and respond faster and more vigorously to subsequent herbivore attacks.

The volatile terpenoid mixture produced by maize under lepidopteran attack includes one sesquiterpene of special interest, which is (E)"-beta-caryophyllene":https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:10357. This same sesquiterpene is also produced by roots upon damage by root-feeding pests, such as western corn rootworm (Diabrotica virgifera virgifera). Here it attracts entomopathogenic nematodes, which also live parasitically inside the infected lepidopterans. (E)-beta-caryophyllene is produced by the (E)-beta-caryophyllene synthase encoded by TPS23, which catalyzes the cyclization of farnesyl diphosphate to (E)-beta-caryophyllene. As expected for a gene involved in plant defense, TPS23 gene expression is tightly regulated. Herbivore-induced leaf damage causes increased expression in leaves, but not in roots, while attack by root-feeding pests increases expression in roots, but not in the shoots. TPS23 transcript levels correlate with the production of (E)-beta-caryophyllene and high-level expressing maize lines are more resistant to herbivores than low-level expressing ones.

(E)-beta-caryophyllene is just one of a host of chemical deterrents that plants deploy against herbivores and other pests: the UniProt plant annotation program aims to capture knowledge of the relevant biochemical pathways using Rhea and ChEBI, as well as many other aspects of plant biology too. Find out more about maize TPS23 in the latest updated version of UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to ChEBI in the ptmlist.txt document file

The ptmlist.txt document, which is available by FTP and on the website, describes post-translational modifications (PTMs) annotated in the UniProt knowledgebase. This release sees the addition of optional cross-references from ptmlist.txt to ChEBI (Chemical Entities of Biological Interest), a freely available dictionary of molecular entities focused on small chemical compounds and derivatives, including modified residues.

Example:

ID   (3R)-3-hydroxyarginine
AC   PTM-0476
FT   MOD_RES
..
KW   Hydroxylation.
DR   ChEBI; CHEBI:78294.

This new mapping to ChEBI will facilitate the integration of data on PTMs with knowledge of enzymatic reactions described in UniProt using the Rhea knowledgebase of biochemical reactions (itself built on ChEBI). The following query allows users to find enzymes in UniProtKB that are capable of creating the modified (3R)-3-hydroxyarginine residue (PTM-0476, CHEBI:78294):

annotation:(type:"catalytic activity" chebi:"(3R)-3-hydroxy-L-arginine residue [78294]")

We have currently mapped over 120 of the most common PTMs in UniProtKB to ChEBI and will continue to add new cross-references to ChEBI in forthcoming releases. This mapping of PTMs to ChEBI is part of our ongoing work on the standardization of knowledge of small molecule chemistry in UniProtKB that now covers enzyme cofactors and reactions as well as PTMs, and that will eventually extend to all small molecule protein interactions. We welcome your feedback on these current and future developments.

Retirement of UniProt decoy databases

Based on usage statistics, we decided to retire the UniProt decoy databases from our FTP site. If you wish to generate decoy databases from UniProt FASTA databases, you can use this software.

Please contact us if you have questions about this change.

Cross-references to NIAGADS

Cross-references have been added to the NIAGADS Alzheimer's GenomicsDB. NIAGADS is a searchable annotation resource that provides access to publicly available NIAGADS summary statistics datasets for Alzheimer's Disease (AD) and related neuropathologies.

NIAGADS is available at https://www.niagads.org/genomics/.

The format of the explicit links is:

Resource abbreviation NIAGADS
Resource identifier Resource identifier

Example: E9PDY4

Show all entries having a cross-reference to NIAGADS.

Text format

Example: E9PDY4

DR   NIAGADS; ENSG00000203710; -.

XML format

Example: E9PDY4

<dbReference type="NIAGADS" id="ENSG00000203710"/>

RDF format

Example: E9PDY4

uniprot:E9PDY4
  rdfs:seeAlso <http://purl.uniprot.org/niagads/ENSG00000203710> .
<http://purl.uniprot.org/niagads/ENSG00000203710>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/NIAGADS> .

Cross-references to CPTAC

Cross-references have been added to the CPTAC Assay Portal. CPTAC serves as a centralized public repository of "fit-for-purpose" multiplexed quantitative mass spectrometry-based proteomic targeted assays.
CPTAC is available at https://assays.cancer.gov/.

The format of the explicit links is:

Resource abbreviation CPTAC
Resource identifier Resource identifier

Example: P04083

Show all entries having a cross-reference to CPTAC.

Text format

Example: P04083

DR  CPTAC; CPTAC-311; -.

XML format

Example: P04083

<dbReference type="CPTAC" id="CPTAC-311"/>

RDF format

Example: P04083

uniprot:P04083
  rdfs:seeAlso <http://purl.uniprot.org/cptac/CPTAC-311> .
<http://purl.uniprot.org/cptac/CPTAC-311>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/CPTAC> .

Removal of the cross-references to H-InvDB

Cross-references to H-InvDB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2019_06

Published July 3, 2019

Headline

The three-peptide itch

Just like pain, itch is a signal provided to our brain that something is wrong, or potentially dangerous. Contrary to pain, where our reflex leads to withdraw from danger, an itch leads to scratching, in order to remove the irritating, potentially toxic agent, be it an insect or a chemical. Although itching in response to environmental cues is crucial for survival, chronic itch can be debilitating and severely impact the well-being of affected persons. It is known that itch is relayed from the skin, via the dorsal root ganglion neurons, to the second order neurons in the spinal cord that project to the brain. Several Mas-related G-protein coupled receptors (MRGPRs) have been identified as the primary targets of itch signals, and many of them, including MRGPRX1 (also called MrgprC11 in rodents) and Mrgpra3, are activated by the anti-malarial drug chloroquine.

However, the molecular and neural mechanisms of itch are not well elucidated, which is why the discovery of new tools to identify itch receptors and develop new drugs is extremely valuable. In this context, conotoxins have proven to be a gold mine. Conotoxins are produced by cone snails as part of an envenomation survival strategy for feeding and defense. They are short peptides (usually 10 to 30 amino acid residues), typically with one or more disulfide bonds. Many of them modulate the activity of ion channels and receptors with very high affinity and specificity. They can be highly selective between closely related receptor subtypes, therefore they could meet specific therapeutic needs with a reduced likelihood of side effects due to off-target drug effects.

In mice, the injection of Conus textile venom, but not that of C. geographus induced a scratching reflex, which was accompanied by the activation of 89% of sensory neurons that were also sensitive to chloroquine. Two peptides, CNF-Tx1 and CNF-Tx2, were isolated from C. textile venom gland and tested in vitro for their activity on MRGPRX1. In parallel, the same activity was measured for two additional peptides, CNF-Sr1 and CNF-Sr2, previously identified in C. spurious and one, CNF-Vc1, from C. victoriae. Three of these peptides were able to activate MRGPRX1: CNF-Tx2 activated the human, but not the mouse ortholog, CNF-Sr1 activates only the mouse, but not the human ortholog, and CNF-Vc1 activated both. CNF-Tx2 and CNF-Vc1 were then tested in a humanized mouse transgenic line, which has a knockout of the entire endogenous MRGPR cluster and expresses human MRGPRX1 in primary sensory neurons. In this setting, both peptides elicited a scratching reflex, further confirming that CNF-Tx2 and CNF-Vc1 act via the itch receptor MRGPRX1. Compared with the well-established MrgprX1 agonist chloroquine, CNF-Tx2 and CNF-Vc1 were 600 times and 200 times more potent, respectively.

The 5 conopeptides have been updated in UniProtKB/Swiss-Prot and are publicly available.

UniProt website news

We have changed our BLAST default dataset from UniProtKB to "Reference proteomes + Swiss-Prot". You can still select UniProtKB or other options under "Target database" in the BLAST submission form.

UniProtKB news

Cross-references to ABCD

Cross-references have been added to the ABCD (AntiBodies Chemically Defined) Database, a manually curated depository of sequenced antibodies.

ABCD is available at https://web.expasy.org/abcd.

The format of the explicit links is:

Resource abbreviation ABCD
Resource identifier UniProtKB accession number

Example: P07766

Show all entries having a cross-reference to ABCD.

Text format

Example: P07766

DR   ABCD; P07766; -.

XML format

Example: P07766

<dbReference type="ABCD" id="P07766"/>

RDF format

Example: P07766

uniprot:P07766
  rdfs:seeAlso <http://purl.uniprot.org/abcd/P07766> .
<http://purl.uniprot.org/abcd/P07766>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/ABCD> .

Removal of the cross-references to ProDom

Cross-references to ProDom have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Cardiomyopathy, familial hypertrophic 19
  • Myopathy, centronuclear, 3
  • Parkes Weber syndrome

Changes in subcellular location controlled vocabulary

New subcellular locations:

Changes to keywords

New keyword:

UniProt release 2019_05

Published June 5, 2019

Headline

Love's Labour (nearly) Lost

Actin is a globular multi-functional protein that forms microfilaments. It is probably one of the most abundant proteins in our cells, and in almost all eukaryotic cells. Actin plays crucial roles in many processes essential to life, such as cell migration and division, and muscle contraction. Actin undergoes several post-translational modifications thought to control its cellular functions. Among them is histidine methylation, a rare modification in vertebrates that affects only a few proteins, but which was reported to occur in actin over a half-century ago. Two recent publications revealed the identity of the enzyme catalyzing this methylation, namely a SET domain-containing protein called SETD3. This is not only the first actin methylase, but also the first histidine methylase to be identified in vertebrates.

SETD3 had previously been reported to be a methylase, but a histone methyltransferase, modifying essentially lysine residues. This observation was consistent with the well-established role of SET domain-containing proteins in histone methylation on lysine residues. Consistent, yes, but erroneous! SETD3 actually methylates only actin and only at histidine-73 (His-73). Structural studies showed that the catalytic pocket so perfectly fits the actin peptide, with an extensive network of interactions, that accommodation of divergent sequences may be quite inefficient. From a functional point view, His-73 methylation modestly accelerates the assembly of actin filaments and somewhat reduces the nucleotide-exchange rate on actin monomers. SETD3 knockout mice are viable and overall healthy, in spite of several moderate phenotypes, including some skeletal muscle myopathy, abnormal cardiac electrocardiogram and mildly decreased lean mass. So what? One could think that SETD3-catalyzed histidine methylation of actin is not so important after all, if not for the observation that litter sizes of homozygous knockout females are significantly smaller than litters from wild-type or heterozygous females. This is due to incomplete delivery, with fetuses remaining in utero. The mutant females experience uterus contraction problems that are not improved by oxytocin administration. In vitro SETD3-depleted human myometrial cells also have impeded signal-induced contractions by oxytocin and endothelin-1/EDN1, supporting a role for His-73 methylation in uterine smooth muscle cell contraction during parturition.

The UniProtKB/Swiss-Prot SETD3 entries have been updated and are publicly available as of this release.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Brain small vessel disease with or without ocular anomalies
  • Macrocephaly, macrosomia, facial dysmorphism syndrome.

UniProt release 2019_04

Published May 8, 2019

Headline

A pox on your messenger

Millions of years of coexistence between viruses and their hosts has imposed a selection pressure on hosts to develop the most efficient defense systems possible, while viruses evolve more and more complex strategies to escape them. New findings from James B. Eaglesham and co-workers describe how viruses block second messenger production to evade innate immunity.

The innate immune system is the first line of host defense; it relies on pattern recognition receptors (PRRs) to detect pathogen-specific molecules. The enzyme cyclic 2',3'-cyclic GMP-AMP (cGAMP) synthase cGAS is a PRR that recognizes cytosolic double-stranded DNA, a marker for viral infection, and catalyzes the formation of cGAMP. cGAMP in turn activates STING (also known as TMEM173) and type I interferon and NF-kappa-B responses, leading to a potent anti-viral state. The poxin family of viral proteins cleave cytosolic 2',3'-cGAMP into linear Gp[2'-5']Ap[3'], thereby preventing the activation of STING and the resulting induction of the host immune response. Not surprisingly, poxins are required for efficient viral replication in vivo.

Poxins are found in poxviruses, a large family of DNA viruses infecting mammalian cells, and baculoviruses, which target insect cells. The active site is conserved between poxviruses, including vaccinia virus poxin, and baculoviruses, an amazing observation considering the evolutionary distance between these viral families. Even more surprising is the presence of an enzymatically active poxin homolog in the silk moth Bombyx mori and other moth and butterfly genomes! The current hypothesis for the spread of poxins between insects and insect and mammalian viruses is that poxviruses and baculoviruses can share overlapping host tropisms and readily acquire genes through homologous recombination.

UniProtKB/Swiss-Prot poxin entries have been updated and should not escape your attention.

UniProtKB news

Removal of the cross-references to HOVERGEN

Cross-references to HOVERGEN have been removed.

Removal of the cross-references to ProteinModelPortal

Cross-references to ProteinModelPortal have been removed.

In an attempt to improve access to modelling data, we now add a link to the SWISS-MODEL-Workspace for all entries that are not already cross-referenced to the SWISS-MODEL Repository (SMR). This new link will allow users to start a new homology modelling project.

Removal of the cross-references to UniGene

Cross-references to UniGene have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to keywords

New keywords:

Deleted keywords:

  • Immunoglobulin C region
  • Immunoglobulin V region

UniProt release 2019_03

Published April 10, 2019

Headline

A drug arsenal from lupins

Digging into traditional medicines to find new drugs is a proven and fruitful strategy. Think of forskolin, a very effective activator of adenylate cyclase, used daily in numerous laboratories. This agent is produced by the Ayurvedic herb Plectranthus barbatus, which used to be recommended, among others, to treat cardiovascular disorders. Or the anticancer drug paclitaxel (taxol), isolated from Taxus brevifolia, the Pacific yew. Or artemisinin, the most efficient treatment against malaria, which derives from Artemisia annua, also called sweet wormwood, a herb employed in Chinese traditional medicine. Plant metabolites and direct derivatives thereof constitute more than a third of currently approved pharmaceuticals. Lupin seeds also belong to the traditional pharmacopoeia on all continents where it has been cultivated. The great Persian physician Avicenna recommended lupin seed flour, mixed with fenugreek and zedoary, to treat diabetes, as he noticed that this mixture considerably decreased sugar excretion in patients. A thousand years after his observation, the lupin protein mediating this effect, gamma-conglutin, has been identified.

In the lupin seed, most conglutins are storage proteins, which are hydrolyzed during germination and nourish the early stages of seedling growth. By contrast, gamma-conglutin is resistant to proteolysis. In this context, its physiological role in the seed is puzzling, but we have a little more insight into its effect on mammalian cells and organisms. Magni et al. reported that hyperglycemic rats experienced a substantial normalization of blood glucose levels after oral administration of white lupin (Lupinus albus) gamma-conglutin. The decrease in sugar blood level was comparable to that obtained with metformin, a well-established medication for the treatment of type 2 diabetes. This observation was later confirmed and extended to small groups of human volunteers.

After ingestion, gamma-conglutin is not digested in the gastrointestinal tract and the intact protein may be translocated across the intestinal barrier through transcytosis. Once in the blood, it may act at several levels. It seems to bind insulin and may potentiate its activity. When myocytes are incubated with gamma-conglutin, they activate signaling pathways similar to those of insulin, including the activation of the insulin receptor substrate 1 (IRS1), AKT1, and EIF4EBP1/PHAS1. Gamma-conglutin peptides produced in vitro can also inhibit dipeptidyl peptidase-4 (DPP4), an enzyme which degrades incretins, a group of metabolic hormones that stimulate a decrease in blood glucose levels. Gamma-conglutin also enhances the cell surface expression of glucose transporters, including SLC2A4 (GLUT4), and inhibits gluconeogenesis in hepatocytes.

More investigations are needed before gamma-conglutin becomes a drug for type 2 diabetes, but in view of the dynamics of the diabetes epidemic, it seems that nature may be giving us a hand in new drug development.

It's time now for you to enjoy a lupin bean snack and consult our newly annotated UniProtKB/Swiss-Prot lupin gamma-conglutin entries.

UniProt website news

Search for small molecules via InChiKey

We have recently enhanced enzyme annotation in UniProtKB using Rhea, a comprehensive expert-curated knowledgebase of biochemical reactions that uses the ChEBI (Chemical Entities of Biological Interest) ontology to describe reaction participants, their chemical structures, and chemical transformations. We also use ChEBI to annotate enzyme cofactors in UniProtKB.

You can now search UniProtKB for small molecule reaction participants and cofactors using the InChIKey, a standard hashed representation of the IUPAC International Chemical Identifier (InChI) that provides a unique and compact representation of chemical structure data. The UniProt website supports flexible chemical structure searches with the complete InChIKey, as well as with the connectivity and stereochemistry layers, or the connectivity layer alone. You can search our "Catalytic activity" or "Cofactor" annotations, or both combined, by using the new "Small molecule" advanced search field:

image

image

image

This new InChIKey-based search will help unlock the power of chemical structure data in UniProtKB, particularly when combined with our existing search tools and options for biological data. It complements the chemical ontology search, which allows users to search UniProtKB for chemical classes of biological interest like lipids, amino acids, sugars and specializations thereof, using identifiers from the ChEBI ontology of small molecules.

See How can I search UniProt for chemical or reaction data?

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Modified disease:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • (Z)-2,3-didehydroaspartate

UniProt release 2019_02

Published February 13, 2019

Headline

Let's twist again with Myo1D

At first glance, we look bilaterally symmetrical. Our left side appears pretty much the mirror image of the right one. For our internal organs, it's a completely different story. For instance, our heart is on the left of our body, while the liver lies to the right. Macroscopic left-right patterning is only one aspect of an organism's asymmetry. Actually all known life forms show asymmetric properties in chemical structures, as well as in macroscopic anatomy, development and behavior. However, not much is known about the nature of the link between molecular-level and macroscopic asymmetry.

Studies in Drosophila led to the discovery of a crucial role for an unconventional myosin, called Myo31DF or Myo1D, in left-right asymmetry. Myo1D inactivation in the fly can reverse handedness of the gut and testes. In a recent publication, Lebreton et al. have extended these observations, showing that ectopic expression of Myo1D in 'naive' tissues, i.e. devoid of left-right asymmetry, such as epidermis and trachea, was sufficient to drive laterality. In the larval epidermis, Myo1D expression induced dextral twisting of the whole larval body, which could rotate up to 180°, resulting in abnormal crawling behavior. In the trachea, pronounced right-handed twisting, with a spiraling ribbon shape with multiple turns, was observed instead of the smooth and linear conformation of the wild-type tissue. This asymmetry was also seen at the cellular level. In control conditions, epidermal cells were perpendicular to the anterior-posterior axis. In contrast, cells ectopically expressing Myo1D showed elongation and a clear shift in membrane orientation toward one side. Myo1D functions as an actin-based motor protein with ATPase activity and this activity was required for the establishment of left-right asymmetry. In vitro Myo1D caused actin filaments to move in anticlockwise circular motion, suggesting that the multiscale property of Myo1D emerges from its molecular interaction with F-actin.

Does this conclusion apply to vertebrates? This answer is not straightforward. Experiments in some vertebrates point to a role for MYO1D in left-right patterning. In Xenopus, MYO1D morpholino knockdown affected organ placement in over 50% of the morphant tadpoles. In Zebrafish, MYO1D plays a role in the formation of Kupffer's vesicle, an organ that functions as left-right organizer during embryogenesis. However, in rat, MYO1D knockout didn't lead to visceral situs inversus and caused no obvious motor defects, indicating that, at least in certain mammals, MYO1D is not involved in left-right body asymmetry.

As of this release, UniProtKB/Swiss-Prot MYO1D entries have been updated and are publicly available.

UniProtKB news

Removal of the cross-references to CleanEx

Cross-references to CleanEx have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • ADP-ribosyldiphthamide

RDF news

Change of URIs for Orphanet

For historic reasons, UniProt had to generate URIs to cross-reference databases that did not have an RDF representation. Our policy is to replace these by the URIs generated by the cross-referenced database once it starts to distribute an RDF representation of its data.

The URIs for the Orphanet database have therefore been updated from:

http://purl.uniprot.org/orphanet/<ID>
to:
http://www.orpha.net/ORDO/Orphanet_<ID>
If required for backward compatibility, you can use the following query to add the old URIs:
PREFIX owl:<http://www.w3.org/2002/07/owl#>
PREFIX up:<http://purl.uniprot.org/core/>
INSERT
{
   ?protein rdfs:seeAlso ?old .
   ?old owl:sameAs ?new .
   ?old up:database <http://purl.uniprot.org/database/orphanet> .
}
WHERE
{
   ?protein rdfs:seeAlso ?new .
   ?new up:database <http://purl.uniprot.org/database/Orphanet> .
   BIND(iri(concat('http://purl.uniprot.org/orphanet/', substr(str(?new),32))) AS ?old)
}

The dereferencing of existing http://purl.uniprot.org/orphanet/<ID> URIs will be maintained.

UniProt release 2019_01

Published January 16, 2019

Headline

Engaging and disengaging: CRISPR rings

CRISPR-Cas systems are an RNA-guided adaptive immune response that bacteria and archaea use to defend against invasive genetic elements of bacterial (plasmid) or viral origin. Pieces of foreign DNA incorporated into CRISPR arrays provide a "memory" of having encountered the invader. These arrays are transcribed and processed, and the resulting CRISPR RNA (crRNA) is used by the interference complex to recognize the invader if it is re-encountered. Once recognized, foreign nucleic acids are quickly degraded, providing immunity. There are different types of CRISPR-Cas systems, mainly characterized by the presence or absence of certain Cas proteins. For example, the Cas3, Cas9, and Cas10 proteins are hallmarks of the CRISPR/Cas types I, II and III, respectively. The best known system is the type II Cas9-encoding system, which has been coopted by scientists for genome editing. The most intriguing one is the type III system, which has additional, novel control mechanisms not found in the other systems.

The type III interference complex is composed of crRNA, Cas10 and proteins Csm2, Csm3, Csm4 and Csm5. Once the target RNA has bound to the Csm interference complex it is cleaved by the complex, which acts as a sequence-specific endoribonuclease (RNase). There is an additional component to this system: Csm6. Under basal conditions, Csm6 is an inactive RNase and is not part of the Csm complex, however its presence is required for full CRISPR-Cas immunity where it non-specifically degrades invader-derived RNA transcripts. How then is Csm6 RNase activity turned on and, once activated, how is it turned off, considering that an uncontrolled RNase activity could be detrimental to the cell? The answer to these questions has been revealed in recent publications. Homodimeric Csm6 is activated by cyclic oligoadenylates (cOA), ring-shaped second messengers synthesized by the C-terminal GGDEF (also called Palm) domain of Cas10. Binding of cOA to the Csm6 dimer interface pocket formed by its CARF (CRISPR-associated Rossman fold) domains allosterically regulates its RNase activity. The type of cyclic oligoadenylates produced is species-specific. Streptococcus thermophilus and Enterococcus italicus make cyclic hexaadenylate (cA6), while Csm6 of Thermus thermophilus is stimulated by cyclic tetraadenylate (cA4), suggesting Cas10 in this organism synthesizes cA4. As the target RNA associated with the CRISPR complex is degraded, the cOA synthase activity of Cas10 shuts off, halting second messenger synthesis. Additionally, 2 proteins with ring-specific nuclease activity able to degrade cOA have been recently isolated from Saccharolobus solfataricus (formerly called Sulfolobus solfataricus), which would turn down Csm6 activity and prevent uncontrolled degradation of cellular RNA.

As of this release several Cas10 proteins and the ring nucleases of S.solfataricus have been annotated and can be retrieved.

UniProtKB news

Cross-references to jPOST

Cross-references have been added to jPOST, a proteomics database containing re-analysis results with unified criteria for raw data from several ProteomeXchange (PX) repositories.

jPOST is available at https://globe.jpostdb.org/.

The format of the explicit links is:

Resource abbreviation jPOST
Resource identifier UniProtKB accession number

Example: Q8IY92

Show all entries having a cross-reference to jPOST.

Text format

Example: Q8IY92

DR   jPOST; Q8IY92; -.

XML format

Example: Q8IY92

<dbReference type="jPOST" id="Q8IY92"/>

RDF format

Example: Q8IY92

uniprot:Q8IY92
  rdfs:seeAlso <http://purl.uniprot.org/jpost/Q8IY92> .
<http://purl.uniprot.org/jpost/Q8IY92>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/jPOST> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Limb-girdle muscular dystrophy 1A
  • Limb-girdle muscular dystrophy 1B
  • Limb-girdle muscular dystrophy 1C
  • Limb-girdle muscular dystrophy 2R

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • ADP-ribosylglycine
  • ADP-ribosyltyrosine

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt release 2018_11

Published December 5, 2018

Headline

Enhanced enzyme annotation in UniProtKB using Rhea

This release marks a major advance in the way UniProt describes enzyme function, with the introduction of Rhea as a vocabulary to annotate and represent enzyme-catalysed reactions in UniProtKB.

Rhea is a comprehensive expert-curated knowledgebase of biochemical reactions that uses the ChEBI (Chemical Entities of Biological Interest) ontology to describe reaction participants, their chemical structures, and chemical transformations. Rhea provides stable unique identifiers for reactions and standard computationally tractable descriptors for chemical transformations.

The enhanced enzyme annotations created using Rhea will form the basis of new search and identifier mapping services in UniProtKB that combine knowledge of small molecules and proteins. They will help UniProt users to more easily integrate and analyse metabolomic data, annotate metabolic networks and models, or mine reaction data to study enzyme evolution and predict new pathways for drug production or bioremediation.

Recent publications provide additional information on Rhea reactions and examples of services that integrate Rhea with biological knowledge from UniProtKB; we hope these will inspire you to dig deeper into the wealth of enzyme data in UniProtKB.

For further technical details about this change see below.

UniProtKB news

Standardization of 'Catalytic activity' annotations

A 'Catalytic activity' annotation describes a catalytic activity of an enzyme, i.e. a chemical reaction that the enzyme catalyzes. Up to now, UniProt has followed the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) for the description of enzymatic activities, except for reactions that are described in the scientific literature, but that are not (yet) covered by the NC-IUBMB. The focus of the NC-IUBMB is the nomenclature and classification of enzymes by the reactions they catalyze. For this purpose the NC-IUBMB typically describes an exemplary reaction for each class of enzymes, with the understanding that individual members of the class may use alternative reactants. The NC-IUBMB use their own names for the reactants. To allow UniProt to curate reactions at the level of specific enzymes instead of enzyme classes, and to use standardized names for reactants, we now use chemical reaction descriptions from the Rhea database whenever possible. Rhea uses the ChEBI (Chemical Entities of Biological Interest) ontology to describe reaction participants that are small molecules as well as the reactive groups of large molecules (such as amino acid residues within proteins). These large molecules are identified by a RHEA-COMP identifier. For catalytic activities that can only be described in the form of free text, we continue to follow the NC-IUBMB descriptions. We have also started to curate the physiological direction of a reaction, i.e. the direction of the net flow of reactants in vivo, where evidence for it is available.

Due to their focus on nomenclature, cross-references to Enzyme Commission (EC) numbers have historically been added to the Protein names subsection of UniProtKB entries. To link the EC numbers to the reactions on which they are based, we now also add them to 'Catalytic activity' annotations.

'Catalytic activity' annotations are found in UniProtKB entries, as well as in UniRule and SAAS annotation rules.

Below is a description of how this change affects the different file formats in which UniProt entries are distributed.

Text format

Note: Regex symbols indicate whether a pattern (as delimited by parentheses) is optional (?) or may occur 1 or more times (+).

Reaction description from Rhea:

CC   -!- CATALYTIC ACTIVITY:
 CC       Reaction=<RheaText>; Xref=<RheaXref>(, <ReactantXref>)+;
 CC        ( EC=<EcNumber>;)?( Evidence={<Evidences>};)?
(CC       PhysiologicalDirection=left-to-right; Xref=<RheaXref>; Evidence={<Evidences>};)?
(CC       PhysiologicalDirection=right-to-left; Xref=<RheaXref>; Evidence={<Evidences>};)?

Where:

  • <RheaText>: Textual representation of an undirectional Rhea reaction.
  • <RheaXref>: Cross-reference to a Rhea reaction (Rhea:n).
  • <ReactantXref>: Cross-reference to a reactant from ChEBI (CHEBI:n) or Rhea (RHEA-COMP:n).
  • <EcNumber>: EC number of the corresponding enzyme class, when available.
  • <Evidences>: List of evidences, when available.

Example: O36015

Previous format (based on NC-IUBMB):

CC   -!- CATALYTIC ACTIVITY: S-adenosyl-L-methionine +
CC       cytidine(32)/guanosine(34) in tRNA = S-adenosyl-L-homocysteine +
CC       2'-O-methylcytidine(32)/2'-O-methylguanosine(34) in tRNA.
CC       {ECO:0000255|HAMAP-Rule:MF_03162}.

New format (based on Rhea):

CC   -!- CATALYTIC ACTIVITY:
CC       Reaction=cytidine(32)/guanosine(34) in tRNA + 2 S-adenosyl-L-
CC         methionine = 2'-O-methylcytidine(32)/2'-O-methylguanosine(34) in
CC         tRNA + 2 H(+) + 2 S-adenosyl-L-homocysteine;
CC         Xref=Rhea:RHEA:42396, Rhea:RHEA-COMP:10246, Rhea:RHEA-
CC         COMP:10247, ChEBI:CHEBI:15378, ChEBI:CHEBI:57856,
CC         ChEBI:CHEBI:59789, ChEBI:CHEBI:74269, ChEBI:CHEBI:74445,
CC         ChEBI:CHEBI:74495, ChEBI:CHEBI:82748; EC=2.1.1.205;
CC         Evidence={ECO:0000255|HAMAP-Rule:MF_03162};

Example: A0A0S3QTD0

Previous format (based on NC-IUBMB):

CC   -!- CATALYTIC ACTIVITY: Acetyl-CoA + H(2)O + oxaloacetate = citrate +
CC       CoA. {ECO:0000269|PubMed:29420286}.

New format (based on Rhea):

CC   -!- CATALYTIC ACTIVITY:
CC       Reaction=acetyl-CoA + H2O + oxaloacetate = citrate + CoA + H(+);
CC         Xref=Rhea:RHEA:16845, ChEBI:CHEBI:15377, ChEBI:CHEBI:15378,
CC         ChEBI:CHEBI:16452, ChEBI:CHEBI:16947, ChEBI:CHEBI:57287,
CC         ChEBI:CHEBI:57288; EC=2.3.3.16;
CC         Evidence={ECO:0000269|PubMed:29420286};
CC       PhysiologicalDirection=left-to-right; Xref=Rhea:RHEA:16846;
CC         Evidence={ECO:0000269|PubMed:29420286};
CC       PhysiologicalDirection=right-to-left; Xref=Rhea:RHEA:16847;
CC         Evidence={ECO:0000269|PubMed:29420286};

Reaction description from NC-IUBMB:

CC   -!- CATALYTIC ACTIVITY:
CC       Reaction=<IUBMBText>; EC=<EcNumber>;( Evidence={<Evidences>};)?

Where:

  • <IUBMBText>: An NC-IUBMB reaction description.
  • <EcNumber>: EC number of the corresponding enzyme class.
  • <Evidences>: List of evidences, when available.

Example: P17050

Previous format (based on NC-IUBMB):

CC   -!- CATALYTIC ACTIVITY: Cleavage of non-reducing alpha-(1->3)-N-
CC       acetylgalactosamine residues from human blood group A and AB mucin
CC       glycoproteins, Forssman hapten and blood group A lacto series
CC       glycolipids. {ECO:0000269|PubMed:19683538}.

New format (based on NC-IUBMB):

CC   -!- CATALYTIC ACTIVITY:
CC       Reaction=Cleavage of non-reducing alpha-(1->3)-N-
CC         acetylgalactosamine residues from human blood group A and AB
CC         mucin glycoproteins, Forssman hapten and blood group A lacto
CC         series glycolipids.; EC=3.2.1.49;
CC         Evidence={ECO:0000269|PubMed:19683538};

XML format

We have extended the UniProt XSD with new elements and types as shown below in red color:

<xs:complexType name="commentType">
        ...
        <xs:sequence>
            <xs:element name="molecule" type="moleculeType" minOccurs="0"/>
            <xs:choice minOccurs="0">
                ...
                <xs:sequence>
                    <xs:annotation>
                        <xs:documentation>Used in 'catalytic activity' annotations.</xs:documentation>
                    </xs:annotation>
                    <xs:element name="reaction" type="reactionType"/>
                    <xs:element name="physiologicalReaction" type="physiologicalReactionType" minOccurs="0" maxOccurs="2"/>
                </xs:sequence>
                ...
            </xs:choice>
            ...
        </xs:sequence>
        ...
    </xs:complexType>
    ...
    <xs:complexType name="reactionType">
        <xs:annotation>
            <xs:documentation>Describes a chemical reaction.</xs:documentation>
        </xs:annotation>
        <xs:sequence>
            <xs:element name="text" type="xs:string"/>
            <xs:element name="dbReference" type="dbReferenceType" minOccurs="1" maxOccurs="unbounded"/>
        </xs:sequence>
        <xs:attribute name="evidence" type="intListType" use="optional"/>
    </xs:complexType>

    <xs:complexType name="physiologicalReactionType">
        <xs:annotation>
            <xs:documentation>Describes a physiological reaction.</xs:documentation>
        </xs:annotation>
        <xs:sequence>
            <xs:element name="dbReference" type="dbReferenceType"/>
        </xs:sequence>
        <xs:attribute name="direction" use="required">
            <xs:simpleType>
                <xs:restriction base="xs:string">
                    <xs:enumeration value="left-to-right"/>
                    <xs:enumeration value="right-to-left"/>
                </xs:restriction>
            </xs:simpleType>
        </xs:attribute>
        <xs:attribute name="evidence" type="intListType" use="optional"/>
    </xs:complexType>

Reaction description from Rhea:

Example: O36015

Previous format (based on NC-IUBMB):

<comment type="catalytic activity">
  <text evidence="1">S-adenosyl-L-methionine + cytidine(32)/guanosine(34) in tRNA = S-adenosyl-L-homocysteine + 2'-O-methylcytidine(32)/2'-O-methylguanosine(34) in tRNA.</text>
</comment>

New format (based on Rhea):

<comment type="catalytic activity">
  <reaction evidence="1">
    <text>cytidine(32)/guanosine(34) in tRNA + 2 S-adenosyl-L-methionine = 2'-O-methylcytidine(32)/2'-O-methylguanosine(34) in tRNA + 2 H(+) + 2 S-adenosyl-L-homocysteine</text>
    <dbReference type="Rhea" id="RHEA:42396"/>
    <dbReference type="Rhea" id="RHEA-COMP:10246"/>
    <dbReference type="Rhea" id="RHEA-COMP:10247"/>
    <dbReference type="ChEBI" id="CHEBI:15378"/>
    <dbReference type="ChEBI" id="CHEBI:57856"/>
    <dbReference type="ChEBI" id="CHEBI:59789"/>
    <dbReference type="ChEBI" id="CHEBI:74269"/>
    <dbReference type="ChEBI" id="CHEBI:74445"/>
    <dbReference type="ChEBI" id="CHEBI:74495"/>
    <dbReference type="ChEBI" id="CHEBI:82748"/>
    <dbReference type="EC" id="2.1.1.205"/>
  </reaction>
</comment>

Example: A0A0S3QTD0

Previous format (based on NC-IUBMB):

<comment type="catalytic activity">
  <text evidence="2">Acetyl-CoA + H(2)O + oxaloacetate = citrate + CoA.</text>
</comment>

New format (based on Rhea):

<comment type="catalytic activity">
  <reaction evidence="2">
    <text>acetyl-CoA + H2O + oxaloacetate = citrate + CoA + H(+)</text>
    <dbReference type="Rhea" id="RHEA:16845"/>
    <dbReference type="ChEBI" id="CHEBI:15377"/>
    <dbReference type="ChEBI" id="CHEBI:15378"/>
    <dbReference type="ChEBI" id="CHEBI:16452"/>
    <dbReference type="ChEBI" id="CHEBI:16947"/>
    <dbReference type="ChEBI" id="CHEBI:57287"/>
    <dbReference type="ChEBI" id="CHEBI:57288"/>
    <dbReference type="EC" id="2.3.3.16"/>
  </reaction>
  <physiologicalReaction direction="left-to-right" evidence="2">
    <dbReference type="Rhea" id="RHEA:16846"/>
  </physiologicalReaction>
  <physiologicalReaction direction="right-to-left" evidence="2">
    <dbReference type="Rhea" id="RHEA:16847"/>
  </physiologicalReaction>
</comment>

Reaction description from NC-IUBMB:

Example: P17050

Previous format (based on NC-IUBMB):

<comment type="catalytic activity">
  <text evidence="6">Cleavage of non-reducing alpha-(1->3)-N-acetylgalactosamine residues from human blood group A and AB mucin glycoproteins, Forssman hapten and blood group A lacto series glycolipids.</text>
</comment>

New format (based on NC-IUBMB):

<comment type="catalytic activity">
  <reaction evidence="6">
    <text>Cleavage of non-reducing alpha-(1->3)-N-acetylgalactosamine residues from human blood group A and AB mucin glycoproteins, Forssman hapten and blood group A lacto series glycolipids.</text>
    <dbReference type="EC" id="3.2.1.49"/>
  </reaction>
</comment>

RDF format

Note: Evidence-related statements are omitted since their format does not change. In the previous format, evidence was attributed via reification of the rdfs:comment statement. In the new format, the up:catalyticActivity and up:catalyzedPhysiologicalReaction statements are reified.

Reaction description from Rhea:

Example: O36015

Previous format (based on NC-IUBMB):

uniprot:O36015
  up:annotation <O36015#SIP5A4ED6FF66BBF481> .

<O36015#SIP5A4ED6FF66BBF481>
  rdf:type up:Catalytic_Activity_Annotation ;
  rdfs:comment "S-adenosyl-L-methionine + cytidine(32)/guanosine(34) in tRNA = S-adenosyl-L-homocysteine + 2'-O-methylcytidine(32)/2'-O-methylguanosine(34) in tRNA." .

New format (based on Rhea):

uniprot:O36015
  up:annotation <O36015#SIP962CEE3C69B2533E> .

<O36015#SIP962CEE3C69B2533E>
  rdf:type up:Catalytic_Activity_Annotation ;
  up:catalyticActivity <O36015#SIP6D2D3E976AAD17F0> .

<O36015#SIP6D2D3E976AAD17F0>
  rdf:type up:Catalytic_Activity ;
  up:catalyzedReaction <http://rdf.rhea-db.org/42396> ;
  up:enzymeClass enzyme:2.1.1.205 .

Example: A0A0S3QTD0

Previous format (based on NC-IUBMB):

uniprot:A0A0S3QTD0
  up:annotation <A0A0S3QTD0#SIPF04A1EC4C8EBCB08> .

<A0A0S3QTD0#SIPF04A1EC4C8EBCB08>
  rdf:type up:Catalytic_Activity_Annotation ;
  rdfs:comment "Acetyl-CoA + H(2)O + oxaloacetate = citrate + CoA." .

New format (based on Rhea):

uniprot:A0A0S3QTD0
  up:annotation <A0A0S3QTD0#SIP8171B3125ADE4E9D> .

<A0A0S3QTD0#SIP8171B3125ADE4E9D>
  rdf:type up:Catalytic_Activity_Annotation ;
  up:catalyticActivity <A0A0S3QTD0#SIP1A91565011EC50F6> ;
  up:catalyzedPhysiologicalReaction <http://rdf.rhea-db.org/16846> ,
                                    <http://rdf.rhea-db.org/16847> .

<A0A0S3QTD0#SIP1A91565011EC50F6>
  rdf:type up:Catalytic_Activity ;
  up:catalyzedReaction <http://rdf.rhea-db.org/16845> ;
  up:enzymeClass enzyme:2.3.3.16 .

Reaction description from NC-IUBMB:

Example: P17050

Previous format (based on NC-IUBMB):

uniprot:P17050
  up:annotation <P17050#SIP0FD272930B1683DE> .

<P17050#SIP0FD272930B1683DE>
  rdf:type up:Catalytic_Activity_Annotation ;
  rdfs:comment "Cleavage of non-reducing alpha-(1->3)-N-acetylgalactosamine residues from human blood group A and AB mucin glycoproteins, Forssman hapten and blood group A lacto series glycolipids." .
  

New format (based on NC-IUBMB):

uniprot:P17050
  up:annotation <P17050#SIP0FD272930B1683DE> .

<P17050#SIP0FD272930B1683DE>
  rdf:type up:Catalytic_Activity_Annotation ;
  up:catalyticActivity <P17050#SIP0FD272930B1683DF> .

<P17050#SIP0FD272930B1683DF>
  rdf:type up:Catalytic_Activity ;
  skos:closeMatch enzyme:3.2.1.49#SIP0FD272930B1683DG ;
  up:enzymeClass enzyme:3.2.1.49 .

We have changed the RDF representation of ENZYME records in order to refer from UniProt 'Catalytic activity' annotations to individual enzymatic activities. The range of the activity predicate has been changed to the type Catalytic_Activity.

Example: 1.11.1.21

Previous format:

enzyme:1.11.1.21
  rdf:type up:Enzyme ;
  skos:prefLabel "Catalase peroxidase" ;
  up:activity "Donor + H(2)O(2) = oxidized donor + 2 H(2)O." ;
  up:activity "2 H(2)O(2) = O(2) + 2 H(2)O." ;
  ...

New format:

enzyme:1.11.1.21
  rdf:type up:Enzyme ;
  skos:prefLabel "Catalase peroxidase" ;
  up:activity <1.11.1.21#SIP017EC216DF0EDC2A> ;
  up:activity <1.11.1.21#SIP018ED427AB1BAS3X> ;
  ...

<1.11.1.21#SIP017EC216DF0EDC2A>
  rdf:type up:Catalytic_Activity ;
  rdfs:label "Donor + H(2)O(2) = oxidized donor + 2 H(2)O." .

<1.11.1.21#SIP018ED427AB1BAS3X>
  rdf:type up:Catalytic_Activity ;
  rdfs:label "2 H(2)O(2) = O(2) + 2 H(2)O." .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Deafness, autosomal recessive, 105

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • Murein peptidoglycan amidated serine

Changes in subcellular location controlled vocabulary

New subcellular location:

UniProt release 2018_10

Published November 7, 2018

Headline

You're not coming in!

Sexual reproduction is a great process to diversify the genetic pool and to accelerate evolution. However, it imposes tight constraints for success. First, sperm must meet egg, an unfertilized egg cannot develop. In addition, exactly one sperm cell has to meet one egg, polyspermy is not viable. And to ensure the survival of distinct species, the process has to be strictly species-specific. This requirement is particularly challenging in organisms in which fertilization occurs externally, as is the case for fish.

Looking for factors required for fertilization in vertebrates, Herberg et al. identified a small protein highly expressed in zebrafish (Danio rerio) oocytes. They called the protein Bouncer. Bouncer is located at the cell surface where it is attached to the membrane through a glycosylphosphatidylinositol (GPI) anchor, following cleavage of the C-terminal propeptide.

Bouncer function was investigated in knockout zebrafish. At first glance, the mutant animals did not show any overt phenotype. They were produced at the expected Mendelian rates and developed normally. When fertility was tested, there was no difference between knockout and wild-type males, but knockout females were almost completely sterile. Delivery of sperm into Bouncer-deficient eggs by intracytoplasmic sperm injection restored embryonic development, suggesting that Bouncer was involved in sperm entry during fertilization. Bouncer was indeed shown to promote sperm-egg binding. Could Bouncer play a role in species recognition during fertilization? To test this hypothesis, zebrafish Bouncer knockout eggs expressing the medaka fish (Oryzias latipes) Bouncer ortholog were generated. Medaka sperm cannot normally fertilize zebrafish eggs. Both species split apart some 200 million years ago, much earlier than we did from mice, and they share only 40% sequence identity. Amazingly the transgenic knockout eggs could be fertilized by medaka, but not zebrafish sperm. Fertility rates of individual transgenic medaka Bouncer females were found to correlate with expression levels of medaka Bouncer mRNA in eggs. In conclusion, the small 80-amino acid-long Bouncer protein plays a crucial role in species-specific fertilization. The rescue was not complete. The fertility rate was low, suggesting that other factors likely contribute to species-specific sperm-egg interaction.

Bouncer homologs exist in other vertebrate species. Its closest relative in mammals is the SPACA4 gene. Bouncer/SPACA4 germline-restricted expression was confirmed in all vertebrates tested. However, Bouncer ovary-specific expression was observed only in externally fertilizing animals, such as fish or amphibians; surprisingly, internally fertilizing vertebrates, such as reptiles and mammals, show testis-specific expression. The reason for this difference is not clear and the function of mammalian SPACA4 is not yet known.

As of this release, zebrafish and medaka Bouncer proteins have been annotated and integrated into UniProtKB/Swiss-Prot.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified disease:

Changes to keywords

New keyword:

UniProt release 2018_09

Published October 10, 2018

Headline

Tubulin code: a long sought-after player identified

In eukaryotes, the cytoskeleton helps cells maintain their shape and internal organization, and provides mechanical support that enables cells to carry out essential functions, like division and movement. It is made of filamentous proteins, microtubules being the largest type of cytoskeletal filament. Microtubules are dynamically assembled from alpha-tubulin and beta-tubulin heterodimers, creating specific structures adapted to the cell's needs, structures that can be as different from each other as a cilium can be from a mitotic spindle. How is this variety achieved using the same highly conserved building blocks? Part of the answer lies in the so-called 'tubulin code' which involves not only the differential expression of alpha-and beta-tubulin genes (tubulin isotypes), but also a plethora of post-translational modifications (PTMs). Tubulins have a globular core and a more variable C-terminal tail that is exposed at the microtubule surface, where many PTMs occur. One of first PTMs to be reported, back in the 70s, was C-terminal reversible detyrosination, which occurs on most alpha-, but not beta-, tubulins. The enzyme catalyzing the addition of tyrosine, tubulin-tyrosine ligase or TTL, was identified not long after, but the carboxypeptidase responsible for tubulin detyrosination remained elusive until recently.

Aillaud et al. tackled the problem by developing an irreversible inhibitor of tubulin carboxypeptidase activity, followed by mass spectrometry analysis of the inhibitor targets. Nieuwenhuis et al. performed gene-trapping mutagenesis in a haploid human cell line aimed at regulators of tubulin detyrosination. Both groups identified vasohibin-1 (VASH1) and 2 (VASH2) as the major alpha-tubulin-specific carboxypeptidases. Vasohibins were formerly predicted to have a protease fold, but their enzymatic activity had not been investigated. Actually, both enzymes show low carboxypeptidase activity when assayed on their own. Full activity requires the formation of a complex with another protein, called small vasohibin-binding protein, or SVBP. This may explain why previous attempts to identify tubulin carboxypeptidase have failed. SVBP-VASH complexes act preferentially on polymerized tubulins. When microtubules disassemble, TTL adds back a tyrosine residue at the C-terminus and the tubulin detyrosination/tyrosination cycle is closed.

The physiological importance of detyrosination has to be investigated. SVBP or vasohibin knockdown in mouse hippocampal neurons results in delayed axonal differentiation. In embryos, it affects neuronal migration during brain cortex differentiation. However, mice lacking VASH1 or VASH2 do not exhibit a dramatic phenotype. It should also be noted that vasohibin depletion in cells could not completely abolish activity, suggesting the existence of yet another enzyme.

The VASH1 and VASH2 protein entries have been updated and are now available in UniProtKB/Swiss-Prot.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Alport syndrome, with macrothrombocytopenia
  • Bannayan-Riley-Ruvalcaba syndrome
  • Cowden syndrome 2
  • Cowden syndrome 3
  • Epstein syndrome
  • Fechtner syndrome
  • Macrothrombocytopenia and progressive sensorineural deafness
  • Sebastian syndrome

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • (2S)-4-hydroxyleucine
  • (3S)-3-hydroxylysine
  • (4S)-4,5-dihydroxyleucine
  • 2-hydroxyproline
  • 3',4',5'-trihydroxyphenylalanine
  • 4-hydroxylysine

Modified term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • Hydroxylated arginine -> Hydroxyarginine

UniProt website news

Deprecation of legacy REST URLs /batch and /mapping - please replace by /uploadlists

Programmatic access to our "Retrieve/IDmapping" service should be addressed to the URL path /uploadlists as shown in the code examples in the respective service help pages ID mapping and Batch retrieval.

If you have existing code for batch retrieval, you also need to specify that you are mapping to and from UniProtKB, i.e.

'from' => 'ACC+ID',
'to' => 'ACC',

(See the Perl code example in Batch retrieval.)

The obsolete URL paths /batch and /mapping have been deprecated and are no longer supported as of release 2018_09.

UniProt release 2018_08

Published September 12, 2018

Headline

Human brain development: slow and steady wins the race

As mammals, we share most of our physiological processes with other animals and these similarities allow the wide use of model organisms for medical research purposes. Yet there is something special about us, abstract thought, creativity, art, culture, something linked to our big brains. The increase in size and complexity of our cerebral cortex happened recently on the evolution time scale, i.e. after the Homo lineage split apart from that of other related primates. In order to try to understand this distinctive feature of ours, several new human-specific genes involved in corticogenesis have been identified. They are produced by segmental duplications, but their functional impact on brain development remains mysterious. However, among these genes are 3 nearly identical NOTCH2 paralogs, called NOTCH2NLA, B and C for which functional clues have been recently obtained. The evolutionary history of NOTCH2NL genes is peculiar. NOTCH2 partial duplication occurred prior to the last common ancestor of human, chimpanzee, and gorilla (some 14 million years ago) leading to the creation of a truncated inactive copy, called NOTCH2NL (standing for Notch homolog 2 N-terminal-like). In the hominin lineage, some 3 to 4 million years ago, the NOTCH2 dopplegänger was repaired by gene conversion and duplicated, creating 3 new human-specific active genes NOTCH2NLA, B and C. This timeframe corresponds to the early stages of the expansion of the human neocortex.

NOTCH2NLA, B and C are expressed in radial glia neural stem cells during cortical development. These cells undergo multiple cycles of regenerative, mostly asymmetric, cell divisions, leading to the generation of diverse types of neurons while maintaining a pool of progenitors. NOTCH2NL gene expression activates the NOTCH signaling pathway, down-regulates neuronal differentiation genes, and delays the differentiation of neuronal progenitors, increasing their number, all of which ultimately results in an increase in neurons. In this context, slow development produces a huge benefit.

The chromosome 1q21.1 region hosting NOTCH2NL genes is associated with chromosome 1q21.1 deletion / duplication syndromes, where duplications are associated with macrocephaly and autism, and deletions with microcephaly and schizophrenia. 11 patients were analyzed : those with microcephaly had NOTCH2NLA and/or NOTCH2NLB deletions, while the macrocephaly cases were consistent with NOTCH2NLA and/or NOTCH2NLB duplications. If confirmed, these results are consistent with a crucial role for NOTCH2NL genes in human neocortex development. Thus, the emergence of human-specific NOTCH2NL genes may have contributed to the rapid evolution of the larger human neocortex, at the expense of susceptibility to recurrent neurodevelopmental disorders.

Using our big brains, we have annotated all 3 NOTCH2NL gene products in UniProtKB/Swiss-Prot and they are publicly available as of this release.

UniProtKB news

Change of the annotation topic 'Enzyme regulation' to 'Activity regulation'

In UniProtKB entries, the topic 'Enzyme regulation' was used to display information about factors that regulate the activity of enzymes, but also of transporters and microbial transcription factors. To clarify the situation, we have renamed this topic to 'Activity regulation'.

Text format

Example: P02730

Previous format:

CC   -!- ENZYME REGULATION: Phenyl isothiocyanate inhibits anion transport
CC       in vitro.

New format:

CC   -!- ACTIVITY REGULATION: Phenyl isothiocyanate inhibits anion transport
CC       in vitro.

XML format

Example: P02730

Previous format:

<comment type="enzyme regulation">
  <text>Phenyl isothiocyanate inhibits anion transport in vitro.</text>
</comment>

New format:

<comment type="activity regulation">
  <text>Phenyl isothiocyanate inhibits anion transport in vitro.</text>
</comment>

RDF format

Example: P02730

Previous format:

uniprot:P02730
  up:annotation <P02730#SIPC58AB4FDB0DD7DCA> .

<P02730#SIPC58AB4FDB0DD7DCA>
  rdf:type up:Enzyme_Regulation_Annotation ;
  rdfs:comment "Phenyl isothiocyanate inhibits anion transport in vitro." .

New format:

uniprot:P02730
  up:annotation <P02730#SIPC58AB4FDB0DD7DCA> .

<P02730#SIPC58AB4FDB0DD7DCA>
  rdf:type up:Activity_Regulation_Annotation ;
  rdfs:comment "Phenyl isothiocyanate inhibits anion transport in vitro." .

Change to the cross-references to Bgee

We have introduced an additional field in the cross-references to the Bgee database to indicate the expression pattern of the gene.

Text format

Example: P10361

DR   Bgee; ENSRNOG00000010756; Expressed in 10 organ(s), highest expression level in spleen.

XML format

Example: P10361

<dbReference type="Bgee" id="ENSRNOG00000010756">
  <property type="expression patterns" value="Expressed in 10 organ(s), highest expression level in spleen"/>
</dbReference>

This change does not affect the XSD, but may nevertheless require code changes.

RDF format

Example: P10361

uniprot:P10361
  rdfs:seeAlso <http://purl.uniprot.org/bgee/ENSRNOG00000010756> .
<http://purl.uniprot.org/bgee/ENSRNOG00000010756>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Bgee> ;
  rdfs:comment "Expressed in 10 organ(s), highest expression level in spleen" .

Changes to the controlled vocabulary of human diseases

New diseases:

Deleted diseases

  • Ehlers-Danlos syndrome 7B

UniProt website news

New advanced search interface

We have revamped the advanced search interface to make it easier for you to browse the different search fields and options within the dropdown menus. Most importantly, there is now a search box right at the top when you open the blue dropdown menu that allows you to type a concept name (e.g. "structure") and receive some autocompleted suggestions from which you can then select the most suitable one:

image

Automatic gene-centric isoform mapping for eukaryotic reference proteome entries

Some proteomes have been (manually and algorithmically) selected as reference proteomes. They cover well-studied model organisms and other organisms of interest for biomedical research and phylogeny. In this context, we provide data sets for reference proteomes where only one form of a protein, usually the best annotated version in UniProtKB, is present. The relationships identified when generating these data sets are now also used when displaying individual entries on the UniProt website:

A single gene can code for multiple proteins through biological events such as alternative splicing, initiation and promoter usage. While the UniProtKB/Swiss-Prot expert curation process includes the identification and review of different forms of a protein and their description in a single UniProtKB/Swiss-Prot entry, its focus is the functional annotation of proteins. For this reason, not all potential isoforms of a protein that are available in UniProtKB/TrEMBL can be reviewed and merged into a single entry. This results in a larger number of UniProtKB entries than genes for many of the eukaryotic reference proteomes. In order to identify potential isoforms that have not (yet) been reviewed by a biocurator, we have established an automatic gene-centric mapping between entries from eukaryotic reference proteomes that are likely to belong to the same gene. This mapping is based on gene identifiers from Ensembl, EnsemblGenomes and model organism databases and, in cases where none of these are available, on gene names assigned by the original sequencing projects.

Example: Q15286

UniProt release 2018_07

Published July 18, 2018

Headline

Ubiquitin ligation: new insight into mechanistic diversity

Protein ubiquitination is a reversible post-translational modification that is crucial for many physiological processes, from cell survival and differentiation to innate and adaptive immunity. It can affect protein functions at many levels, marking them for degradation, as well as regulating their cellular location, activity and interactions. Most frequently ubiquitin is linked to the amine group of a lysine side chain via an isopeptide bond, but a growing number of non-canonical linkages has been reported in recent years that involves the N-terminal amine group, thiol groups of cysteine side chains, and also serine and threonine hydroxyl groups.

A cascade of enzymatic reactions catalyzes the process of protein ubiquitination. The first step consists of ATP-dependent ubiquitin activation by E1 enzymes. Activated ubiquitin is transferred onto E2-conjugating enzymes, producing a covalently linked intermediate (E2-Ub). The transfer of ubiquitin onto the target protein is mediated by E3 protein ligases, which ensure the specificity of the reaction. The whole process grows in complexity with each step. The human genome is thought to encode only 2 E1 enzymes, some 40 E2s and over 600 E3 ligases. E3 ligases can be grouped into 3 classes based on their domain structure and mode of action. E3s of the 'really interesting new gene' (RING) family recruit E2-Ub via their RING domain and then mediate direct transfer of ubiquitin to substrates. By contrast, HECT E3 ligases undergo a catalytic cysteine-dependent transthiolation reaction with E2-Ub, forming a covalent E3-Ub intermediate. Finally, RING-between-RING (RBR) E3 ligases have a canonical RING domain linked to an ancillary domain. This ancillary domain contains a catalytic cysteine that enables a hybrid RING-HECT mechanism.

In order to identify new E3 enzymes of HECT or RBR classes, Pao et al. established an activity based assay, in which a biotinylated probe exhibiting the properties of a HECT/RBR substrate acts as a 'suicide' substrate and covalently traps target E3s. The assay worked as expected, identifying most known HECT/RBR, but much to their surprise, the authors also isolated 33 RING E3s that lacked HECT or RBR ancillary domains. One of these, MYCBP2, an E3 ligase involved in axon guidance and synapse formation in the developing nervous system, was found to mediate ubiquitination of serines and threonines, but not on lysines, with a strong preference for threonine. The enzymatic mechanism was also found to be novel: MYCBP2 relays ubiquitin to the target threonine via thioester intermediates involving 2 essential cysteines, a mechanism termed the 'RING-Cys-relay' (RCR).

Although non-canonical ubiquitination has already been observed, this is the first report of the identification of an enzyme catalyzing this reaction and along with it, a novel E3 mechanism has been unraveled. The annotation in MYCBP2 entries has been updated with this new knowledge and is publicly available as of this release.

UniProtKB news

Cross-references to UniLectin

Cross-references have been added to the UniLectin database, a database of carbohydrate-binding proteins.

UniLectin is available at https://unilectin.eu.

The format of the explicit links is:

Resource abbreviation UniLectin
Resource identifier UniProtKB accession number

Example: P84801

Show all entries having a cross-reference to UniLectin.

Text format

Example: P84801

DR   UniLectin; P84801; -.

XML format

Example: P84801

<dbReference type="UniLectin" id="P84801"/>

RDF format

Example: P84801

uniprot:P84801
  rdfs:seeAlso <http://purl.uniprot.org/unilectin/P84801> .
<http://purl.uniprot.org/unilectin/P84801>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/UniLectin> .

Changes to the controlled vocabulary of human diseases

New diseases:

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt news

Change of UniProt license

We have changed the license that applies to all copyrightable parts of our databases from the Creative Commons Attribution-NoDerivs (CC BY-ND 3.0) to the Creative Commons Attribution (CC BY 4.0) License. This change will make it easier for others to reuse UniProt data in their own works. The updated license information is available on the UniProt website and FTP site. As with the previous license users must give appropriate credit for use of UniProt data. The change in license means that our users can remix, transform, and build upon UniProt for any purpose, including commercially, without seeking permission from us. However, when doing so users must provide a link to the license and indicate if changes were made.

UniProt release 2018_06

Published June 20, 2018

Headline

Neuronal express mRNA delivery service

In mammals, the activity-regulated cytoskeleton-associated protein ARC is a key regulator of synaptic plasticity, being involved in many aspects of synapse formation, maturation, and plasticity, as well as in learning and memory. ARC expression is known to be induced by synaptic activity and its mRNA accumulates at sites of local synaptic activity where it is locally translated.

ARC originates from the Ty3/Gypsy retrotransposon family and it has retained some retroviral features. Its protein architecture is remarkably similar to that of the capsid domain of human immunodeficiency virus (HIV) GAG protein. GAG proteins are essential for viral infection. They can self-assemble to form capsids and encapsulate genomic RNA via direct sequence-specific interactions. At first glance, these properties do not seem crucial for eukaryotic proteins, but two recent studies unravel a quite unexpected means of neuronal communication, that is reminiscent of viral infection.

The intriguing observation was that ARC protein and mRNA are not only present at synapses, but also enriched in extracellular vesicles (EVs) released by neurons. These EVs are endocytosed by target cells, where ARC mRNA is postsynaptically translated, as has been described both at the Drosophila neuromuscular junction (NMJ), between motor neurons and muscles, and in rat hippocampal neurons. How is this achieved? Presynaptic ARC proteins bind the 3'-UTR of ARC mRNA, oligomerize and form capsid-like structures, in which the mRNA is packaged. These eukaryotic 'capsids' are then released by neurons in EVs and they mediate ARC mRNA transfer into postsynaptic target cells. In flies, ARC knockdown in motor neurons results in a decrease in ARC mRNA and protein in muscles, and leads to impaired expansion of the NMJ, synaptic bouton maturation, and activity-dependent synaptic bouton formation. This phenotype is not rescued by the expression of an ARC construct in muscle alone, nor if the neuronal ARC mRNA construct is missing its 3'-UTR. Overall, these data suggest that it is not just the presence of ARC in presynaptic terminals, but the actual transfer to the postsynaptic region that is required for ARC function.

This exciting piece of information has been transferred to UniProtKB/Swiss-Prot rat and Drosophila ARC entries by means of the classical pathway of expert curation and, as of this release, the updated records are publicly available.

International protein nomenclature guidelines

The European Bioinformatics Institute (EMBL-EBI), the National Center for Biotechnology Information (NCBI), the Protein Information Resource (PIR) and the Swiss Institute for Bioinformatics (SIB) have worked together to produce a shared set of protein naming guidelines. These guidelines are intended for use by anyone who wants to name a protein and aim to promote consistent nomenclature which is indispensable for communication, literature searching and data retrieval. They replace the previous UniProt protein naming guidelines and are available on the UniProt website as part of this release.

UniProtKB news

Cross-references to ComplexPortal

Cross-references have been added to ComplexPortal, a manually curated resource of macromolecular complexes.

ComplexPortal is available at https://www.ebi.ac.uk/complexportal/.

The format of the explicit links is:

Resource abbreviation ComplexPortal
Resource identifier Resource identifier
Optional information 1 Complex name

Example: Q8IY92

Show all entries having a cross-reference to ComplexPortal.

Cross-references to ComplexPortal may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Text format

Example: Q8IY92

DR   ComplexPortal; CPX-484; SLX4-TERF2 complex.

XML format

Example: Q8IY92

<dbReference type="ComplexPortal" id="CPX-484">
  <property type="entry name" value="SLX4-TERF2 complex"/>
</dbReference>

RDF format

Example: Q8IY92

uniprot:Q8IY92
  rdfs:seeAlso <http://purl.uniprot.org/complexportal/CPX-484> .
<http://purl.uniprot.org/complexportal/CPX-484>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/ComplexPortal> ;
  rdfs:comment "SLX4-TERF2 complex" .

Cross-references to ProteomicsDB

Cross-references have been added to the ProteomicsDB, a human proteome resource.

ProteomicsDB is available at https://www.proteomicsdb.org/.

The format of the explicit links is:

Resource abbreviation ProteomicsDB
Resource identifier Resource identifier

Example: P41182

Show all entries having a cross-reference to ProteomicsDB.

Cross-references to ProteomicsDB may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Text format

Example: P41182

DR   ProteomicsDB; 55413; -.
DR   ProteomicsDB; 55414; -. [P41182-2]

XML format

Example: P41182

<dbReference type="ProteomicsDB" id="55413"/>
<dbReference type="ProteomicsDB" id="55414">
   <molecule id="P41182-2"/>
</dbReference>

RDF format

Example: P41182

uniprot:P41182
  rdfs:seeAlso <http://purl.uniprot.org/proteomicsdb/55413> ;
  rdfs:seeAlso <http://purl.uniprot.org/proteomicsdb/55414> .
<http://purl.uniprot.org/proteomicsdb/55413> rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/ProteomicsDB> .
<http://purl.uniprot.org/proteomicsdb/55414> rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/ProteomicsDB> ;
  rdfs:seeAlso isoform:P41182-2 .

Cross-references to MoonDB

Cross-references have been added to MoonDB, a database of extreme multifunctional and moonlighting proteins.

MoonDB is available at http://moondb.hb.univ-amu.fr.

The format of the explicit links is:

Resource abbreviation MoonDB
Resource identifier UniProtKB accession number
Optional information 1 Entry type ("Curated" or "Predicted")

Example: Q13492

Show all entries having a cross-reference to MoonDB.

Text format

Example: Q13492

DR   MoonDB; Q13492; Curated.

XML format

Example: Q13492

<dbReference type="MoonDB" id="Q13492">
  <property type="type" value="Curated"/>
</dbReference>

RDF format

Example: Q13492

uniprot:Q13492
  rdfs:seeAlso <http://purl.uniprot.org/moondb/Q13492> .
<http://purl.uniprot.org/moondb/Q13492>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/MoonDB> ;
  rdfs:comment "Curated" .

Changes to the controlled vocabulary of human diseases

New diseases:

UniProt website news

To improve security and privacy, we have moved our web pages and services from HTTP to HTTPS.

The HTTP protocol does not provide encryption - anyone who can see web traffic between a client (e.g. a web browser) and a server can intercept potentially sensitive information and/or inject malware into users' browsers or operating systems. HTTPS solves this problem by encrypting web traffic between a client and a server in both directions, so that observers cannot intercept or tamper with the client's requests or the server's responses. It also provides authentication, ensuring that the client is communicating with the intended server given by the hostname, and not some impostor.

Timeline

We supported separate HTTP and HTTPS services until release 2018_06 (June 20, 2018). From this date, the HTTP traffic is automatically redirected to HTTPS. We intend to maintain these redirects indefinitely, but it is to your advantage to update your applications to use HTTPS as soon as possible, both for performance and security reasons.

Interactive users

If you access our pages only through a Web browser (like Chrome, Firefox, Safari, Internet Explorer, Opera, etc.), the only change after the switchover date is that a green lock icon should appear inside the URL box of your browser, and the web addresses of the pages you visit will start with https://. We recommend that you update your bookmarks and links accordingly.

Programmatic users

Applications that access web servers using http:// URLs instead of https:// URLs may fail after a switch to HTTPS for the following reasons:

  • Your programming environment's HTTP facility does not automatically follow redirects from HTTP to HTTPS. Some libraries follow redirections from HTTP to HTTPS, others do not (e.g. Java's URLConnection).
  • Your application uses HTTP requests other than GET and HEAD. These requests (including especially POST and PUT) will fail with HTTP 403 Forbidden after the switchover date.
  • Your application accesses our servers through a proxy. Check with your proxy vendor about HTTPS support and how to add or update certificates.
  • Your programming environment does not support HTTPS.

After the switchover date, our servers:

  • respond with a server-side redirect (HTTP 301 Moved permanently) to the corresponding HTTPS URL for HTTP GET and HEAD requests
  • respond with HTTP 403 Forbidden and an error message to all HTTP requests other than GET and HEAD (including and especially HTTP POST).

URLs that start with http://purl.uniprot.org/, which are used as URIs in the UniProt RDF distribution and SPARQL service, are redirected to the corresponding HTTPS web page when used in a web context.

UniProt release 2018_05

Published May 23, 2018

Headline

Selenium vs. Sulfur: and the winner is...

Selenium is a chemical element that, in trace amounts, is essential for cellular function in many, though not all, organisms from all kingdoms of life. Proteins incorporate selenium as selenocysteine (Sec), where selenium replaces the sulfur of cysteine, when an UGA stop codon is "recoded" by a Sec-tRNA and a selenocysteine insertion sequence (SECIS) within target mRNA. Sec is indispensable for mammalian life and deficiency in Sec-tRNA is embryonic-lethal (shortly after implantation) in mice, yet this process is complex, inefficient and energetically costly. Why then does Mother Nature continue to produce selenoproteins in spite of these drawbacks?

Recent work from Ingold et al. suggests that one reason may be the ability of selenium to protect cells from a specific form of oxidative stress leading to cell death. The authors focused on the phospholipid hydroperoxide glutathione peroxidase GPX4, an essential selenoprotein and the only one whose knockout phenotype mimics that of Sec-tRNA gene disruption. GPX4 catalyzes the reduction of toxic lipid hydroperoxides formed when ferrous iron is imported into cells in the presence of reactive oxygen species produced during aerobic metabolism. If left unchecked, lipid peroxides can spontaneously propagate, directly damaging membranes or generating other toxic products, leading to a specific form of cell death, called ferroptosis. Mice in which the active site of GPX4 (Sec-73) is replaced by cysteine (GPX4-Cys) develop normally, but experience fatal seizures 2-3 weeks after birth. This phenotype is due to the lack of parvalbumin-positive GABAergic interneurons, which are important regulators of cortical network excitability. Hence the presence of Sec is essential for specific developmental events, such as the maturation of a specific class of neurons. In adult mice, the conditional expression of the GPX4-Cys mutant did not show any peculiar phenotype.

Cys substitution greatly reduces GPX4 activity, although it does not abolish it. In the presence of increasing levels of H2O2, GPX4-Cys readily undergoes irreversible oxidation and the mutant GPX4-Cys cells become exquisitely sensitive to peroxide-induced ferroptosis. In conclusion, the critical advantage of selenolate-versus thiolate-based catalysis may lie in its resistance to overoxidation when cells increase their metabolic rates and mitochondrial H2O2 production.

Selenium was discovered in 1817, almost exactly 200 years ago, and it is quite exciting to celebrate this anniversary with a new discovery about its role in higher organisms. As of this release, the updated GPX4 entries are publicly available.

Changes to the controlled vocabulary of human diseases

New diseases:

Deleted diseases

  • Epidermolysis bullosa dystrophica, Hallopeau-Siemens type
  • Epidermolysis bullosa dystrophica, Pasini type

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • Pyruvic acid (Tyr)

UniRef news

GO annotation to UniRef90 and UniRef50 clusters (also in clusters with one member)

In release 2017_05, we announced the addition of Gene Ontology (GO) annotations for UniRef90 and UniRef50 clusters: In this first approach, GO terms were assigned to clusters with at least 2 members, and a GO term was added to a cluster when it was found in all UniProtKB members, or when it was a common ancestor of at least one GO term of each member.

As of this release, 2018_05, we also adding GO annotations to UniRef90 and UniRef50 singleton clusters, i.e. clusters that have only one member. These clusters inherit the GO terms of their single member.

UniProt release 2018_04

Published April 25, 2018

Headline

The Matrix (enzymes) Reloaded

Collagen is the major protein that stitches together animal tissues, and is the most abundant protein in mammals, making up to 25-35% of our body weight. It comprises three individual protein molecules which coil together to form tropocollagen fibers which in turn make microfibrils. Collagen is extremely stable and extremely ancient; collagen fragments have been sequenced from 80 million year old dinosaurs, such as Brachylophosaurus canadensis and Tyrannosaurus rex, and is found in all extant metazoans. The breakdown of collagen is essential to permit tissue growth, and all animals have the ability to metabolize collagen in a very controlled way by cutting a single site. Infectious bacteria, such as gas gangrene-causing Clostridium perfringens and Hathewaya histolytica, on the other hand digest collagen indiscriminately, using collagenases with both endopeptidase and tripeptidylcarboxypeptidase activities. This rampant activity causes massive tissue disruption, favoring bacterial colonization and virulence, and is obviously severely problematic in a clinical setting.

Despite their different approaches to collagen degradation (cautious versus gung-ho), mammalian and clostridial collagenases have similar enzymatic mechanisms and many inhibitors work on both types of collagenases, making them unsuitable for antibacterial therapy. Recent work by Schönauer et al. has found promising new molecules that inhibit only bacterial and not mammalian collagenases, pointing to a possible way to block bacterial collagenase action in a wound setting for example. By not attacking the bacteria directly, these inhibitors should provide novel, non-selective ways to treat some of the damage inflicted by these bacteria, while minimizing potential resistance. While these inhibitors are undoubtedly very useful, there are also many applications in which potentially undesirable bacterial collagenase activities are actively exploited. The H.histolytica collagenases (ColG and ColH) are used to isolate pancreatic islet cells for transplantation, remove retained placenta in cattle and horses, to debride wounds, ulcers and severely burned patients (SANTYL Ointment, Smith and Nephew, Inc.), and to treat human diseases caused by abnormal accumulation of collagen plaques such as Dupuytren's disease and Peyronie's disease (Xiaflex, Endo Pharmaceuticals, Inc.). Dupuytren's disease is an abnormal deposition of collagen in the hand that causes permanent contraction. In Peyronie's disease, collagen forms fibrous plaques in the penis, restricting erection. Collagenase injection relieves this accumulation, leading to an increased quality of life. The collagen-binding domain of collagenases when attached to other proteins, promotes their retention at injection sites for as long as 10 days. Although this is far from the only example of a repurposed enzyme (think of Botox, another clostridial protein), it is fascinating how a protein class that can be so dangerous to life, when harnessed, can be so very helpful.

As of this release 3 clostridial collagenases have been expertly updated in UniProtKB/Swiss-Prot.

Cross-references to GlyConnect

Cross-references have been added to the GlyConnect database and protein glycosylation platform.

GlyConnect is available at https://glyconnect.expasy.org.

The format of the explicit links is:

Resource abbreviation GlyConnect
Resource identifier Resource identifier

Example: P00742

Show all entries having a cross-reference to GlyConnect.

Cross-references to GlyConnect may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Text format

Example: P00742

DR   GlyConnect; 102; -.

XML format

Example: P00742

<dbReference type="GlyConnect" id="102"/>

RDF format

Example: P00742

uniprot:P00742
  rdfs:seeAlso <http://purl.uniprot.org/glyconnect/102> .
<http://purl.uniprot.org/glyconnect/102>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/GlyConnect> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2018_03

Published March 28, 2018

Headline

Ama-(not a)-toxin: a cap on death

Amanita and Galerina mushrooms are responsible for a large number of food poisoning cases and deaths across the world. Like other poisonous mushrooms, Amanita and Galerina express a cocktail of toxic peptides, but the major lethal components are amatoxins. The typical symptoms of amatoxin poisoning are gastro-intestinal distress beginning 6 to 12 hours after ingestion, a remission phase lasting 12 to 24 hours, and progressive loss of liver function culminating in death 3 to 5 hours later. One of the few effective treatments is liver transplantation.

Amatoxins are bicyclic octapeptides that act by binding non-competitively to RNA polymerase II and greatly slowing transcriptional elongation. Most mycotoxic cyclic peptides are synthesized by nonribosomal peptide synthetases. This is not the case for amatoxins (and related compounds) which are encoded by the genome and synthesized by ribosomes. The amatoxin genes encode 35 amino acid-long propeptides that are processed by a dual macrocyclase-peptidase, called POPB. They belong to a extended family called MSDIN (after the 5 N-terminal amino acids of the propeptide), a family that also includes phallotoxins, such as phalloidin and phallicidin. Although structurally related to amatoxins, phallotoxins are bicyclic heptapeptides and have a different mode of action: they stabilize F-actin. Luckily, phallotoxins are poorly absorbed through the gut, and therefore make only a small contribution to toxicity after mushroom ingestion.

While the amatoxins are undoubtedly extremely dangerous, some MSDIN cyclopeptides may actually be beneficial. One example is the antamanide protein of Amanita phalloides (the 'death cap' mushroom), which can act as a competitive antagonist and a natural antidote to the lethal toxins, if administered before, or simultaneously with, the poisons. In addition, antamanide may also protect cells from death by targeting cyclophilin D and inhibiting the mitochondrial permeability transition pore, a central effector of cell death induction. Another mushroom, A. exitialis, produces a structurally closely related cyclic nanopeptide, called amanexitide, that has been suggested to have a similar antidote activity. Unfortunately the concentration of such natural antidotes tends to be much lower than that of the toxins they protect against, meaning that consumers of these deadly mushrooms don't feel the benefit, and we strongly recommend that readers refrain from their consumption.

Toxic MSDIN family members, as well as a number of natural antidotes, have been identified in several Amanita species, including A. bisporigera, A. phalloides, A. exitialis, A. fuligineoides, A. fuliginea, A. ocreata, A. pallidorosea and A. rimosa as well as in Galerina marginata. Expert curated entries describing their biology can be found in UniProtKB/Swiss-Prot, publicly available as of this release.

Cross-references to VGNC (Vertebrate Gene Nomenclature Database)

Cross-references have been added to the VGNC Vertebrate Gene Nomenclature Database.

VGNC is available at https://vertebrate.genenames.org/.

The format of the explicit links is:

Resource abbreviation VGNC
Resource identifier Resource identifier
Optional information 1 Gene designation

Example: P11613

Show all entries having a cross-reference to VGNC.

Text format

Example: P11613

DR   VGNC; VGNC:37509; ACKR3.

XML format

Example: P11613

<dbReference type="VGNC" id="VGNC:37509">
  <property type="gene designation" value="ACKR3"/>
</dbReference>

RDF format

Example: P11613

uniprot:P11613
  rdfs:seeAlso <http://purl.uniprot.org/vgnc/37509> .
<http://purl.uniprot.org/vgnc/37509>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/VGNC> ;
  rdfs:comment "ACKR3" .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified disease:

Deleted diseases

  • Mental retardation, autosomal dominant 8
  • Mental retardation, X-linked, syndromic, Borck type

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • Cyclopeptide (Cys-Pro)
  • Cyclopeptide (Gly-Pro)
  • Cyclopeptide (His-Pro)
  • Cyclopeptide (Leu-Pro)
  • Cyclopeptide (Met-Pro)
  • Cyclopeptide (Phe-Pro)
  • Cyclopeptide (Ser-Pro)
  • Cyclopeptide (Trp-Pro)
  • Cyclopeptide (Tyr-Pro)
  • Cyclopeptide (Val-Pro)

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt release 2018_02

Published February 28, 2018

Headline

Escaping friendly fire

During the first hours of an infection, our safety relies almost entirely on the innate immune system, and predominantly on neutrophils. The encounter between neutrophils and invading microbes leads to neutrophil activation and to the engulfment of pathogens into intracellular phagosomes, where exposure to high concentrations of reactive oxygen species (ROS) and antimicrobial peptides eventually kill them. Neutrophils defend us not only in life but also in death, when they release chromatin and granule proteins that together form extracellular fibers, called 'neutrophil extracellular traps' or NETs, which catch and prevent the spread of microorganisms. NETs are covered with antimicrobial compounds, such as cathelicidin peptides, as well as histones, which can also effectively neutralize intruders. This process is so efficient that extracellular DNases able to catalyze NET disruption serve as virulence factors in several pathogenic bacteria, such as in group A Streptococcus.

NETs are a double-edged sword and have to be regulated very tightly. Indeed, free extracellular DNA is a potent trigger of autoimmune response, such as that encountered in systemic lupus erythematosus (SLE) that is characterized by circulating anti-DNA antibodies. NETs can also initiate vascular occlusion in a fibrin-independent manner. In other words, NETs are not an innocuous therapy in the middle/long term and the host has to get rid of them quickly. Timely removal of NET chromatin by DNases DNASE1 and DNASE1L3 has been shown to play a crucial role in the prevention of autoimmunity. However, it was not known until recently what mechanism was involved in NET clearance under inflammatory conditions. This issue was addressed by Jimenez-Alcazar and colleagues. They created knockout mice lacking both DNASE1 and DNASE1L3. Mutant animals were treated with granulocyte colony-stimulating factor (G-CSF) to induce chronic neutrophilia, a condition mimicking acute inflammation. While wild-type mice showed no sign of distress, all double knockout animals exhibited features of infection-induced thrombotic microangiopathies (TMAs) and died within 6 days. This phenotype could be reversed by the reintroduction of DNASE1 or DNASE1L3, but not by an anti-thrombotic treatment, further supporting the idea that NETs can clog vessels by themselves. TMAs are a well-known complication encountered by patients suffering from systemic bacterial infections. Analysis of lungs from patients with acute respiratory distress syndrome and/or sepsis revealed numerous NET-derived clots in their blood vessels. It is too early yet to propose DNase treatment for TMA patients, but at least it opens new therapeutic perspectives.

As of this release, murine DNASE1 and DNASE1L3 and their orthologs in other mammalian species have been updated and are now publicly available.

UniProtKB news

UniProtKB FASTA headers: Addition of NCBI taxonomy identifier

In order to avoid ambiguities and simplify parsing, we have added the NCBI taxonomy identifier to UniProtKB FASTA headers.

Previous format:

>db|UniqueIdentifier|EntryName ProteinName OS=OrganismName [GN=GeneName ]PE=ProteinExistence SV=SequenceVersion

New format:

>db|UniqueIdentifier|EntryName ProteinName OS=OrganismName OX=OrganismIdentifier [GN=GeneName ]PE=ProteinExistence SV=SequenceVersion

Where:

  • db is 'sp' for UniProtKB/Swiss-Prot and 'tr' for UniProtKB/TrEMBL.
  • UniqueIdentifier is the primary accession number of the UniProtKB entry.
  • EntryName is the entry name of the UniProtKB entry.
  • ProteinName is the recommended name of the UniProtKB entry as annotated in the RecName field. For UniProtKB/TrEMBL entries without a RecName field, the SubName field is used. In case of multiple SubNames, the first one is used. The 'precursor' attribute is excluded, 'Fragment' is included with the name if applicable.
  • OrganismName is the scientific name of the organism of the UniProtKB entry.
  • OrganismIdentifier is the unique identifier of the source organism, assigned by the NCBI.
  • GeneName is the first gene name of the UniProtKB entry. If there is no gene name, OrderedLocusName or ORFname, the GN field is not listed.
  • ProteinExistence is the numerical value describing the evidence for the existence of the protein.
  • SequenceVersion is the version number of the sequence.

Examples:

>sp|Q8I6R7|ACN2_ACAGO Acanthoscurrin-2 (Fragment) OS=Acanthoscurria gomesiana OX=115339 GN=acantho2 PE=1 SV=1
>sp|P27748|ACOX_CUPNH Acetoin catabolism protein X OS=Cupriavidus necator (strain ATCC 17699 / H16 / DSM 428 / Stanier 337) OX=381666 GN=acoX PE=4 SV=2
>sp|P04224|HA22_MOUSE H-2 class II histocompatibility antigen, E-K alpha chain OS=Mus musculus OX=10090 PE=1 SV=1

>tr|Q3SA23|Q3SA23_9HIV1 Protein Nef (Fragment) OS=Human immunodeficiency virus 1  OX=11676 GN=nef PE=3 SV=1
>tr|Q8N2H2|Q8N2H2_HUMAN cDNA FLJ90785 fis, clone THYRO1001457, moderately similar to H.sapiens protein kinase C mu OS=Homo sapiens OX=9606 PE=2 SV=1

The same modification has been applied to FASTA headers of alternative isoforms in UniProtKB/Swiss-Prot), where the new format is:

>sp|IsoID|EntryName Isoform IsoformName of ProteinName OS=OrganismName OX=OrganismIdentifier[ GN=GeneName]

Example:

>sp|Q4R572-2|1433B_MACFA Isoform Short of 14-3-3 protein beta/alpha OS=Macaca fascicularis OX=9541 GN=YWHAB

Cross-references to CarbonylDB

Cross-references have been added to the CarbonylDB database, a resource of protein carbonylation sites.

CarbonylDB is available at http://digbio.missouri.edu/CarbonylDB/.

The format of the explicit links is:

Resource abbreviation CarbonylDB
Resource identifier UniProtKB accession number

Example: P02768

Show all entries having a cross-reference to CarbonylDB.

Text format

Example: P02768

DR   CarbonylDB; P02768; -.

XML format

Example: P02768

<dbReference type="CarbonylDB" id="P02768"/>

RDF format

Example: P02768

uniprot:P02768
  rdfs:seeAlso <http://purl.uniprot.org/carbonyldb/P02768> .
<http://purl.uniprot.org/carbonyldb/P02768>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/CarbonylDB> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • O-AMP-serine

RDF news

Change of URIs for OrthoDB

For historic reasons, UniProt had to generate URIs to cross-reference databases that did not have an RDF representation. Our policy is to replace these by the URIs generated by the cross-referenced database once it starts to distribute an RDF representation of its data.

The URIs for the OrthoDB database have therefore been updated from:

http://purl.uniprot.org/orthodb/<ID>

to:

http://purl.orthodb.org/odbgroup/<ID>

If required for backward compatibility, you can use the following query to add the old URIs:

PREFIX owl:<http://www.w3.org/2002/07/owl#>
PREFIX up:<http://purl.uniprot.org/core/>
INSERT
{
   ?protein rdfs:seeAlso ?old .
   ?old owl:sameAs ?new .
   ?old up:database <http://purl.uniprot.org/database/orthodb> .
}
WHERE
{
   ?protein rdfs:seeAlso ?new .
   ?new up:database <http://purl.uniprot.org/database/OrthoDB> .
   BIND(iri(concat('http://purl.uniprot.org/orthodb/', substr(str(?new),31))) AS ?old)
}

The dereferencing of existing http://purl.uniprot.org/orthodb/<ID> URIs will be maintained.

UniProt release 2018_01

Published January 31, 2018

Zika virus: from petty crime to banditry

A Zika virus (ZIKV) outbreak in Brazil in 2015 drew the world's attention to this microbe (see UniProt headline). The situation was so severe that in February 2016 it was declared to be a 'public health emergency of international concern' by the World Health Organization (WHO), indicating that it constituted a public health risk to other States through the international spread of the disease and potentially required a coordinated international response.

ZIKV has been known for over 70 years since its first isolation from a febrile rhesus macaque in the Ugandan Zika forest. The clinical symptoms caused by ZIKV infection in humans were mild at that time, consisting of a self-limiting flu-like febrile illness that resolved within days and occurred in an estimated 20% of infected individuals. The picture of the recent epidemic was however dramatically different. ZIKV infection was associated with severe symptoms, including multi-organ failure. The most alarming feature was its ability to cause microcephaly, congenital malformations, and fetal demise in pregnant women.

When did the metamorphosis from an almost innocuous agent to a congenital pathogen with global impact occur? In the decades following its discovery, sporadic human ZIKV infections were reported in a few countries in Africa, and then the virus started spreading, first to Southeast Asia, to Micronesia in 2007, to French Polynesia in 2013-2014, and soon after to South and Central America. Comparison of ZIKV neurovirulence between 'ancestral' (African/Southeast Asian) and 'contemporary' (Polynesian/South American) strains was done by intracerebral injections of the virus in neonatal mice. All 3 contemporary strains led to 100% mortality, with typical neurological manifestations. By contrast, the 'ancestral' strain killed less than 17% of the animals. Moreover in a mouse embryonic microcephaly model, infection with a 'contemporary' ZIKV strain resulted in brains exhibiting a substantial degree of microcephaly contrary to the ancestral strain which caused less severe symptoms. Both viruses targeted neural progenitor cells, but the 'contemporary' strain showed significantly enhanced replication in the brain compared with the 'ancestral' one. Obviously something had changed between the 'ancestral' ZIKV and its 'contemporary' version, something that boosted ZIKV neurovirulence, but what?

This question was addressed by Yuan et al. Sequence alignments between 'ancestral' (INSDC accession number AY632535) and 'contemporary' (KJ776791) strains show many differences at the amino acid level. To find out which changes account for increased neurovirulence, several 'contemporary' strain-specific substitutions were introduced in the 'ancestral' strain and tested in neonatal mice. One of them, the substitution of a serine residue by an asparagine at position 139 (in the precursor polyprotein), S139N, greatly increased the neurovirulence of the ancestral strain. It also showed enhanced replication in neural progenitor cells and caused more extensive cell death compared with the original 'ancestral' virus. Conversely, when this residue was mutated back to serine in the 'contemporary' strain, mortality caused by the 'contemporary' virus in neonatal mice was significantly decreased. The ZIKV S139N substitution probably emerged in May 2013, a few months before the outbreak in French Polynesia, and was then stably maintained in the epidemic strain during its subsequent spread to the Americas. Its emergence correlates with reports of microcephaly and other severe neurological abnormalities.

After maturation of the genome polyprotein, position 139 is found in viral protein prM, which, in flaviviruses, closely associates with the envelope protein E and is believed to prevent premature fusion of immature virions inside infected cells. However, the mechanism through which the S139N substitution increases neurovirulence is not yet known.

At the beginning of 2016, UniProtKB/Swiss-Prot released the annotated sequence of a ZIKV genome polyprotein, corresponding to the East African 'ancestral' strain. In order to meet the needs of the scientific community, we have now released that of a 'contemporary' strain, isolated from a French Polynesian sample.

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • S-carbamoylcysteine
  • S-cyanocysteine

UniProt release 2017_12

Published December 20, 2017

Headline

Swiss-Prot in the sky with psilocybin: the biosynthesis pathway of a psychedelic drug unveiled

Psychedelic mushrooms, also called 'magic mushrooms', have been used by humans since prehistoric times and can be found depicted in Stone Age rock art in Europe and Africa. Some cultures have used them for religious rites and ceremonies, especially in pre-Columbian Mesoamerica. Aztecs and Mazatecs referred to them as genius mushrooms, divinatory mushrooms, and wondrous mushrooms. A Psilocybe species was known to the Aztecs as 'teōnanācatl', literally 'the divine mushroom'.

The effects of many psychedelic mushrooms come from the pro-drug psilocybin. When psilocybin is ingested, this natural compound is rapidly metabolized to yield psilocin. This latter acts as a serotonergic psychedelic substance. Its effects include euphoria, altered thinking, visual hallucinations, altered sense of time and spiritual experiences. Some consider the drug as an entheogen and a tool to supplement practices for transcendence. Psilocybin is considered to have low toxicity and harm potential, although some very rare cases of lethality have been reported. In most countries, psilocybin and psilocin are listed as schedule I drugs, i.e. compounds that have a high potential for abuse and are not recognized for medical use.

Nevertheless, over the last 30 years, the potential medical and psychological therapeutic benefits of psilocybin have been investigated. Clinical studies revealed a positive trend in the treatment of existential anxiety with advanced-stage cancer patients and for nicotine addiction. Studies on the clinical use of psilocybin against depression are ongoing.

The structures of both psilocybin and psilocin were determined in 1959 by Hofmann et al., but the basis of their biosynthesis has remained obscure for almost 60 years. The locus for the biosynthesis of psilocybin, called psi, has been recently identified in 2 out of over 100 species of psilocybin mushrooms, namely Psilocybe cubensis and Psilocybe cyanescens.

The psi locus encodes 4 psilocybin biosynthesis enzymes, including a new type of fungal L-tryptophan decarboxylase (psiD), a kinase (psiK), a methyltransferase (psiM), and a cytochrome P450 monooxygenase (psiH). All 4 have been characterized and are sufficient to produce psilocybin from the amino acid L-tryptophan. The first step of the psilocybin biosynthetic pathway is the decarboxylation of L-tryptophan to tryptamine by psiD. The cytochrome P450 monooxygenase psiH then converts tryptamine to 4-hydroxytryptamine. The kinase psiK catalyzes the 4-O-phosphorylation step by converting 4-hydroxytryptamine into norbaeocystin. The methyltransferase psiM eventually catalyzes iterative methyl transfer to the amino group of norbaeocystin to yield psilocybin via a monomethylated intermediate, called baeocystin. The psi locus also contains 2 major facilitator-type transporters (psiT1 and psiT2), as well as a cluster-specific transcriptional regulator (psiR).

As of this release, expertly annotated Psilocybe cubensis psi locus proteins psiD, psiH, psiK, psiM, psiR, psiT1, and psiT2 are publicly and legally available in UniProtKB/Swiss-Prot.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease

  • Weissenbacher-Zweymueller syndrome

Changes in subcellular location controlled vocabulary

New subcellular locations:

Changes to keywords

New keyword:

UniProt release 2017_11

Published November 22, 2017

Headline

Sex determination in insects: 50 ways to achieve sex-specific splicing

The primary signals triggering sex determination in insects are amazingly diverse among various species and sometimes even between strains of the same species. These various signals converge on a single downstream conserved transformer gene (tra) which undergoes sex-specific splicing. In developing females, splicing results in the production of an active tra protein. Tra in turn regulates sex-specific splicing of another highly conserved gene of this signaling cascade, namely double-sex (dsx) which ultimately decides the sexual fate of the embryo. In males, tra splicing includes an exon containing several in-frame stop codons, resulting in a truncated, inactive isoform, unable to affect dsx splicing, resulting in a male-specific dsx isoform.

The primary signals can be environmental and genetic. In some species, temperature, population density or nutritional status can trigger the sexual fate of the embryo. In the most studied organism, Drosophila (fruit fly), the number of X chromosomes in the embryo is crucial: 2 X chromosomes lead to female development, 1 X results in males. Counting X chromosomes is a mechanism common to drosophilids, but rarely observed outside this genus. In other species, such as wasps, ants and bees, sexual fate depends upon the fertilization process: unfertilized eggs (haploid) give rise to males and fertilized diploid eggs to females. Yet other insects involve dominant Mendelian cues, which can be either male-determining (usually referred to as M-factor) as in many dipterans, or female-determining (F-factor) as in butterflies. Due to their bewildering diversity, these cues are difficult to pinpoint. Nevertheless recent years have seen a few major breakthroughs in the identification of M-factors.

In 2015, Hall et al. identified the M-factor Nix in the yellow fever mosquito Aedes aegypti. Nix is expressed very early in male embryonic development. Knockout of Nix results in the production of the dsx female isoform and feminization, while ectopic expression of Nix in females leads to the formation of nearly complete male genitalia. The evolution of Nix appears confined to a subset of mosquitoes: only the Asian tiger mosquito (Aedes albopictus) has an orthologous gene, while other genera, such as Anopheles or Culex, are negative.

The M-factor of Anopheles gambiae, identified in 2016, is encoded by the Yob gene and consists of a short, 56 amino acid protein. It is not homologous to Nix. Yob is activated at the beginning of zygotic transcription and expressed throughout a male's life. It controls male-specific splicing of dsx and several lines of evidence suggest that it is also involved in dosage compensation in this species in which females are XX and males XY. Indeed, the ectopic delivery of Yob mRNA is lethal to genetically female embryos, but has no discernible effect on the sexual development of genetic males. Its silencing in nonsexed embryos yields highly significant male deficiency in surviving mosquitoes.

Last, but not least, the third M-factor to be reported was that of the housefly. It was called Mdmd standing for Musca domestica male determiner. It encodes a 1,174 amino acid-long protein that is expressed very early in the zygote and maintained throughout male development until adulthood. In the absence of Mdmd, males turn into females capable of sexual reproduction. Here again, diversity is not an empty word: Mdmd is not conserved in all houseflies. It is absent in at least one strain for which the M-factor has been mapped onto a different chromosome. Mdmd does not share any similarity with Nix or Yob, but it has a paralog, namely the pre-mRNA-splicing factor Cwc22. Cwc22 is a spliceosome-associated protein that is indispensable for the assembly of the exon junction complex (EJC). Interestingly, it has been shown that changes in expression levels of EJC components also affect the splice site selection of alternatively spliced genes. The homology between Mdmd and Cwc22 brings us one step closer to alternative splicing and the mechanism of sex-specific tra production.

Multiple copies of the Mdmd gene have been found on chromosomes Y, II, III, or V. All 4 encoded proteins have been annotated and, along with the A. aegypti Nix and A. gambiae Yob products, they are now publicly available in UniProtKB/Swiss-Prot.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • Glycyl cysteine thioester (Gly-Cys) (interchain with C-...)

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • 2,3-didehydroalanine (Tyr)

UniProt release 2017_10

Published October 25, 2017

Headline

Of smell and social life

Ants are arguably the greatest success story in the history of terrestrial metazoa. On average, ants constitute 15-20% of the terrestrial animal biomass. All the ~ 12,000 known ant species are eusocial, i.e. they have a sophisticated collective behavior characterized by a division of labor that creates groups, sometimes called castes, specialized in tasks such as reproduction, brood care and foraging . Individuals of one caste usually lose the ability to perform at least one behavior characteristic of individuals in another caste. It is thought that ant sociality depends on their sense of smell. Indeed while the Drosophila melanogaster genome contains about 69 odorant receptor (Or) genes, ant genomes have undergone a dramatic expansion with close to 350 Ors, representing one of the largest yet known repertoires of Ors among insects. Other chemosensory receptor genes, such as gustatory receptors and ionotropic glutamate receptors, have not undergone a similar expansion. Ors are expressed by Or neurons, which project axons to the antennal lobe, a region analogous to the olfactory bulb in vertebrates. The antennal lobe consists of numerous globule-shaped neuropils known as glomeruli, where initial synaptic integration occurs before olfactory information is sent to the central brain. Here again a drastic amplification occurred. Close to 450 glomeruli have been identified in the Camponotus floridanus ant versus only 42 in Drosophila. These observations are consistent with a crucial role for odorant perception in the complex chemical communication in ants, but so far there has no genetic confirmation of this hypothesis.

In insects, Ors dimerize with a highly conserved 7-transmembrane protein called Orco (Odorant Receptor COreceptor) and form ligand-gated ion channels that activate Or neurons upon odorant binding. Orco knockout in fruit flies, locusts, mosquitoes, and moths impairs responses to odorants. An Orco knockout in ants would allow testing of the hypothesis that the expanded Or repertoire is required for chemical communication. However social insects are especially hard to genetically modify, the eggs of ants are very sensitive and difficult to raise without workers, and the life cycle is complicated and drawn out, making it difficult to obtain large quantities of genetically modified offspring in a reasonable time frame.
In spite of these difficulties, 2 teams managed to successfully knockout Orco using CRISPR/Cas9 technology, providing the scientific community with the first genetically modified ants. This achievement was made possible through tenacity and a smart choice of the ant species. Yan et al. worked on Harpegnathos saltator ants. This species shows a remarkable reproductive plasticity: in the absence of a queen or when a worker is completely isolated, non-reproductive workers can become reproductive pseudoqueens (or gamergaters). It is thought that this transition is induced by the lack of queen pheromones which normally would repress it. When isolated, unmated gamergaters lay unfertilized eggs that develop into haploid males. Taking advantage of the gamergate transition, Yan et al. generated hemizygous mutant males. The transgenic males were identified by forewing genotyping. They did not exhibit any overt phenotype and were fully fertile. They could be crossed to receptive females to produce heterozygous and homozygous mutant females. Identification of transgenic females was more complicated. Females have no wings and could be genotyped only after being sacrified. All experiments were therefore done blindly. This denotes a rare enthusiasm for science that merits being emphasized!

Trible et al. chose Ooceraea biroi, a very distantly related species, as it diverged some 100 million years ago from H. saltator. Unlike most other ant species, O. biroi reproduces via parthenogenesis, so stable germ-line modifications can be obtained from the clonal progeny of injected individuals without laboratory crosses.

Both groups observed consistent phenotypes. The response to general odorants was reduced. Mutant insects wandered out of the social group and were unable to forage successfully. They did not produce progeny because they laid very few eggs and did not care for their brood. They appeared to be largely unable to communicate with conspecifics. Unexpectedly they exhibited a dramatic decrease in the size of the antennal lobes, as well as in the number of glomeruli. The remaining glomeruli tended to be bigger than in wild-type ants. The reason for this neuro-anatomical phenotype is unclear at this stage. However these results confirm the central role of olfaction in eusocial behavior.

As of this release, freshly annotated Harpegnathos saltator and Ooceraea biroi Orco entries are available in UniProtKB/Swiss-Prot.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified disease:

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • Cyclopeptide (Gly-Arg)
  • Cyclopeptide (Ser-Lys)

New term for the feature key 'Lipidation' ('LIPID' in the flat file):

  • S-palmitoleoyl cysteine

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • ADP-ribosyl glutamic acid
  • N6-(2-hydroxyisobutyryl)lysine
  • N6-butyryllysine
  • N6-poly(beta-hydroxybutyryl)lysine
  • N6-propionyllysine
  • O3-poly(beta-hydroxybutyryl)serine
  • S-poly(beta-hydroxybutyryl)cysteine

Modified term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • N6-(beta-hydroxybutyrate)lysine -> N6-(beta-hydroxybutyryl)lysine

Modified term for the feature key 'Lipidation' ('LIPID' in the flat file):

  • O-palmitoleyl serine -> O-palmitoleoyl serine

Changes to keywords

New keyword:

Modified keyword:

UniProt release 2017_09

Published September 27, 2017

Headline

Protein translation goes round in circles

Covalently closed circular RNA molecules (circRNAs) were observed over 40 years ago in viruses. Later on, they were discovered in non-infected eukaryotes. In 1993, Capel et al. reported the existence of unusual circular Sry transcripts in mouse testis where they represented the most abundant transcript. These peculiar RNA species have generally been considered to be of low abundance, likely representing errors in splicing. Recent studies have shown however that they may actually be quite numerous and produced by thousands of genes. In addition, they are evolutionarily conserved. CircRNAs are generated by the spliceosome via backsplicing, a process in which the 3'-end of an exon is covalently linked to the 5'-end of an upstream exon. As a result, they lack typical mRNA terminal structures, such as 5' cap and polyA tail. This feature leads to exonuclease resistance, allowing circRNAs to escape from normal RNA turnover processes.

The physiological functions of circRNAs have not yet been extensively explored. Some have been shown to act as microRNA sponges. They can also function as platforms for protein interaction. For instance, circ-FOXO3 represses cell cycle progression by binding to the cell cycle proteins CDK2 and CDKN1A (p21), resulting in the formation of a ternary complex. Circ-MBL/MBNL1 binds to the RNA-binding MBNL1 protein and regulates gene expression by competing with pre-mRNA linear splicing of its linear counterpart.

At this point, you may wonder why UniProtKB, a protein resource, is interested in circRNAs. Most circRNAs originate from protein-coding genes and contain complete exons. In theory they could be translated, but there has been no direct evidence for in vivo translation of endogenous transcripts, and they were classified as non-coding RNAs.

A major breakthrough came from a study done in human and mouse muscles published last April. Muscles not only produce thousands of circular splicing events, but the expression of circRNAs is also differentially regulated during myoblast differentiation. Among them, circ-ZNF609, a transcript that originates from the circularization of the first coding exon of ZNF609 gene, is down-regulated during myogenesis. Circ-ZNF609 contains the initiation codon of the linear ZNF609 transcript, a putative 753-nucleotide open reading frame and a STOP codon created 3 nucleotide after the splice junction by the circularization event with the upstream ZNF609 5'-UTR. In human myoblasts, the knockdown of circ-ZNF609, but not that of its linear transcript, reduces cell proliferation by about 80%, suggesting a specific role in the regulation of myoblast proliferation. Circ-ZNF609 transcripts are located in the cytoplasm where they are associated with heavy polysomes. They are translated in a cap-independent manner, though less efficiently than their linear counterparts and produce a new 250-amino acid long ZNF609 isoform, both in human and mouse cells. The translation is driven by an internal ribosomal entry site (IRES) located within the 5'-UTR. In vivo translation of at least some circRNAs was confirmed in Drosophila in the same issue of Molecular Cell.

In June 1963, Sidney Brenner wrote to Max Perutz: 'It is now widely realized that nearly all the 'classical' problems of molecular biology have either been solved or will be solved in the next decade.' One could think that the process in which genetic information is transcribed and processed into functional RNAs would be such 'classical' problem, but it seems that there are still plenty of discoveries to be made in this field, for our greatest pleasure.

Human and mouse ZNF609 UniProtKB/Swiss-Prot entries have been updated and the new isoforms encoded by circ-ZNF609 integrated, with the help of Dr. Legnini whom we want to sincerely thank. The revised entries are publicly available as of this release.

Cross-references to CORUM

Cross-references have been added to the CORUM database, a resource of manually annotated protein complexes from mammalian organisms.

CORUM is available at http://mips.helmholtz-muenchen.de/corum/

The format of the explicit links is:

Resource abbreviation CORUM
Resource identifier UniProtKB accession number

Example: P41182

Show all entries having a cross-reference to CORUM.

Text format

Example: P41182

DR   CORUM; P41182; -.

XML format

Example: P41182

<dbReference type="CORUM" id="P41182"/>

RDF format

Example: P41182

uniprot:P41182
  rdfs:seeAlso <http://purl.uniprot.org/corum/P41182> .
<http://purl.uniprot.org/corum/P41182>
rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/CORUM> .

Changes to the controlled vocabulary of human diseases

New diseases:

Deleted diseases

  • Adrenocortical insufficiency, without ovarian defect

Changes in subcellular location controlled vocabulary

New subcellular location:

UniProt release 2017_08

Published August 30, 2017

Headline

Curation of human immunoglobulin genes: a fruitful collaboration between UniProtKB/Swiss-Prot and IMGT®

The existence of an agent in the blood that could neutralize diphteria toxin was reported as early as 1890. Over a century after this major discovery, much is known about immunoglobulins (IG) or antibodies. They are large heterodimeric proteins made up of 2 heavy (H) chains and 2 light (L) kappa or lambda chains, held together by disulfide bonds to form a 'Y' shaped molecule. Each chain comprises one variable (V) domain at the N-terminal end and one or several (for L and H, respectively) constant (C) domains. The antigen binding site is formed by the V domain of one H chain, together with that of its associated L chain. Thus, each immunoglobulin has 2 antigen binding sites with remarkable affinity for a particular antigen. Each variable domain is encoded by a variable (V) gene, a diversity (D) gene (only for H) and a joining (J) gene which are assembled by a process called V-(D)-J rearrangement and can then be subjected to somatic hypermutations which, after exposure to antigen and selection, allow affinity maturation for a particular antigen. The resulting rearranged V-(D)-J genes are further spliced to C genes. The C region determines the effector properties and the mechanism used to destroy the antigen, such as activation of complement or binding to Fc receptors. An immunoglobulin is encoded by 7 genes (IGHV, IGHD, IGHJ, IGHC for the H chain and IGKV, IGKJ, IGKC for a kappa or IGLV, IGLJ or IGLC for a lambda L chain). The human genome contains 176 functional immunoglobulin genes clustered in 3 loci, IGH on chromosome 14 (50 V, 23 D, 6 J and 9 C), IGK on chromosome 2 (40 V, 5 J and 1 C) and IGL on chromosome 22 (32 V, 5 J and 5 C). During the development of B cells, the mechanisms of diversity involved in the immunoglobulin synthesis (combinatorial V-(D)-J diversity, junctional diversity and somatic hypermutations) lead to the huge potential antibody repertoire of each individual, estimated to comprise 1012 different immunoglobulins, the limiting factor being only the number of B cells that an organism is genetically programmed to produce.

In 2008, we announced the first draft of the complete human proteome in UniProtKB/Swiss-Prot, and have been continuing to update this resource ever since. Recent work performed in collaboration with the IMGT® team has included a thorough review and update of the immunoglobulin genes, for which we now present a representative set of full-length germline immunoglobulin protein sequences. 15 entries showing the sequence of all C gene products and 122 representing all V gene products are now publicly available. These entries can be retrieved with the keyword 'Immunoglobulin C region' and 'Immunoglobulin V region', respectively. D and J gene products are extremely small, with an average of 5 amino acids for D genes and 15-30 for J. In other words, they are too short to be informative on their own. Therefore we have decided to curate a single peptide representative of D gene products and 3 of J gene products, one for H chains and 2 for L chains kappa and lambda. As for other human proteins, the sequences shown match the translation of the reference genome (Genome Reference Consortium GRCh38/hg38). The nomenclature used is the official one from IMGT/GENE-DB, approved by HGNC and endorsed by NCBI Gene and the IUIS-Nomenclature SubCommittee. Cross-references were implemented in the 141 UniProtKB/Swiss-Prot immunoglobulin entries, providing direct access to the dedicated IMGT® resource and its comprehensive sequence repertoire, which currently describes 927 alleles from 462 functional and non-functional genes together with a wealth of additional information concerning immunoglobulins. Reciprocal links to UniProtKB from IMGT® ensure easy navigation between both resources.

We also provide several examples of full-length rearranged immunoglobulins. Among the 1012 predicted sequences, we have selected some of those that have been entirely sequenced at the amino acid level. However, the representation of the full repertoire is beyond the scope of our knowledgebase and UniProtKB users interested in these complex molecules are advised to visit IMGT®.

We would like take this opportunity to thank Marie-Paule Lefranc, Sofia Kossida and the IMGT® team for this fruitful collaboration, which is beneficial not only for both resources, but hopefully also for the scientific community as a whole.

Cross-references to ELM

Cross-references have been added to the Eukaryotic Linear Motif (ELM) resource for functional sites in proteins.

ELM is available at http://elm.eu.org.

The format of the explicit links is:

Resource abbreviation ELM
Resource identifier UniProtKB accession number

Example: P12931

Show all entries having a cross-reference to ELM.

Text format

Example: P12931

DR   ELM; P12931; -.

XML format

Example: P12931

<dbReference type="ELM" id="P12931"/>

RDF format

Example: P12931

uniprot:P12931
  rdfs:seeAlso <http://purl.uniprot.org/elm/P12931> .
<http://purl.uniprot.org/elm/P12931>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/ELM> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Mental retardation, X-linked, syndromic, 10

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniParc news

UniParc XSD change for InterPro annotations

To reduce the sequence redundancy in UniProtKB, we apply a procedure to identify highly redundant proteomes within selected species groups to exclude them from UniProtKB. Their sequences are still available for download from the UniParc sequence archive, which stores protein sequences that are 100% identical and the same length in a single record, with cross-references to the source database where the protein exists. UniParc also includes basic annotation data (taxonomy, gene and protein names, proteome identifier and component) to allow users interested in redundant proteomes to retrieve meaningful data sets, and we have now further enhanced UniParc with InterPro annotations and for this purpose extended the UniParc XSD with new elements and types as shown below in red color:

<xs:element name="entry">
  <xs:complexType>
   <xs:sequence>
       ...
    <xs:element name="signatureSequenceMatch" type="seqFeatureType" minOccurs="0" maxOccurs="unbounded"/>
                ...
   </xs:sequence>
            ...
  </xs:complexType>
 </xs:element>
    ...
    <xs:complexType name="seqFeatureType">
        <xs:sequence>
   <xs:element name="ipr" type="seqFeatureGroupType" minOccurs="0" maxOccurs="1"/>
   <xs:element name="lcn" type="locationType" minOccurs="1" maxOccurs="unbounded"/>
        </xs:sequence>
        <xs:attribute name="database" type="xs:string" use="required"/>
        <xs:attribute name="id" type="xs:string" use="required"/>
    </xs:complexType>

    <xs:complexType name="seqFeatureGroupType">
        <xs:attribute name="name" type="xs:string"/>
        <xs:attribute name="id" type="xs:string" use="required"/>
    </xs:complexType>

 <xs:complexType name="locationType">
  <xs:attribute name="start" type="xs:int" use="required"/>
  <xs:attribute name="end" type="xs:int" use="required"/>
 </xs:complexType>

UniProt release 2017_07

Published July 5, 2017

Headline

A pseudogene turns into an active DNA methyltransferase dedicated to male fertility

It is well established that in mammals, the DNA methylation machinery is composed of 3 DNA methyltransferase (DNMT) enzymes, DNMT1, DNMT3A, and DNMT3B, and one catalytically inactive cofactor, DNMT3L. Some 46 million years ago, in the last common ancestor of the muroid rodents, the DNMT3B gene was duplicated, giving rise to Gm14490. The genes share about 70% identity, but Gm14490 underwent pseudogenization, and there is no evidence for its transcription. Germline-specific knockouts of DNMT3A or DNMT3B demonstrate the crucial role of these genes in methylation of most imprinted loci in germ cells (and somatic tissues), but some transposon loci, such as minor satellite DNA and intracisternal A particle (IAP) repeats, are only minimally affected, an observation which can be attributed to the functional redundancy of the 2 genes. This is what was thought and published, until recently.

Retrotransposon silencing is of paramount importance, especially in the male germline. Indeed, in the absence of silencing, retrotransposon reactivation leads inexorably to meiotic failure, azoospermia, and sterility marked by small testis size, a phenotype called hypogonadism. It is therefore essential to understand which actors are involved in this process. Barau et al. tackled the issue by generating mutant mice through N-ethyl-N-nitrosourea (ENU) mutagenesis and screening hypogonadal male mice for ectopic retrotransposon activity, followed by whole genome sequencing to identify the culprits. This approach led to the discovery of an ENU-independent mutation, which was identified as a de novo IAP insertion located in an unexpected locus, the last intron of the Gm14490 pseudogene. Serendipity definitely is a scientist's best friend!

This was only the beginning of surprises. Contrary to what had been previously reported, the Gm14490 gene proved to be expressed, but exclusively in male germ cells. This restriction could explain the absence of corresponding ESTs in databases and the erroneous former assumption that it was untranscribed. During embryonic development, its expression peaks at the time of de novo DNA methylation (between 16.5 to 18.5 dpc) in prospermatogonia. Moreover, Gm14490 appeared to be catalytically active when transfected in ES cells. A new genuine DNA methylase was born and renamed DNMT3C!

In the absence of DNMT3C, either by knockout or by IAP insertion, retrotransposons, and more specifically some types of long interspersed nuclear elements (LINEs) and some endogenous retroviruses (ERV), are reactivated. Interestingly, this reactivation is particularly strong for evolutionarily 'young' subfamilies, indicating DNMT3C's unique selectivity. The existence of a 5th DNA methylase selectively targeted at young retrotransposons, acting only in the context of fetal spermatogenesis, may be of particular relevance in Muroidea, including mice and rats. This lineage is particularly enriched in young transposons with about 25% that have integrated into the genome in the last 25 million years with currently thousands of active copies. In comparison, in the primate ancestor, massive integration occurred long before (80 million years ago for elements such as LINEs) and these transposons have since become extinct.

In view of these results, DNMT3C has been deleted from our pseudogene list, annotated and integrated into UniProtKB/Swiss-Prot, where it is available to you. The knowledgebase contains some other sequences derived from putative pseudogenes (see headline of November 2009). Like all other UniProtKB/Swiss-Prot entries, they are continuously reviewed. Some of them are deleted from UniProtKB, when data pointing at an inactive gene are overwhelming, but they can always be retrieved from UniParc. Other entries are progressively 'upgraded', when new data become available, to bona fide proteins as was the case for DNMT3C.

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to keywords

New keyword:

UniProt release 2017_06

Published June 7, 2017

Headline

Sexual reproduction: good ideas shared with viruses

Sexual reproduction is a brilliant eukaryotic invention that allows the reassortment of alleles through recombination. The first step is the formation of haploid male and female gametes that unite to form a new individual. Most gametes unite by membrane fusion, a process mediated by specialized proteins, called fusogens. The study of these proteins is difficult, since they are often scarce. The few identified so far are clade-specific, such as bindin in echinoderms or izumo in mammals, suggesting that each clade has evolved its own fusion strategy. This is at least what was thought until the discovery of hapless-2 (HAP2[1]), also called generative cell specific-1 (GCS1).

Hapless-2 is a single-span transmembrane protein located at the gamete cell surface, typically at mating structures. It is essential for gamete fusion in the green alga Chlamydomonas reinhardtii, but also in other plants, including Arabidopsis thaliana, and Lilium longiflorum and in protozoans, such as Plasmodium berghei or Tetrahymena thermophila. A thorough eukaryotic genome examination reveals the existence of this gene in many major eukaryotic taxa, from slime molds to the honey bee. It is however not present in fungi, nor in most animals, including humans. The wide evolutionary distribution of hapless-2 suggests it was present in the last eukaryotic common ancestor and lost in some clades later on. Disruption of hapless-2 blocks gamete fusion, but not adhesion to gametes of the opposite mating type (or sex), suggesting that gamete adhesion relies on proteins that are species-specific, but that fusion itself is mediated by an ancestral common gene product.

Earlier this year, the 3D-structure of Chlamydomonas reinhardtii hapless-2 was unraveled. The secondary and tertiary structures of the ectodomain are almost identical to viral class II proteins, such as the envelope protein E of flaviviruses, with which hapless-2 shares very low identity at the amino acid level, and which are also involved in membrane fusion. Fédry et al. hypothesize that these fusion proteins most certainly derived from a common ancestor, whose gene has likely been transferred via horizontal exchange.

Like the flavivirus class II proteins, the hapless-2 ectodomain trimerizes concomitantly with insertion into the membrane of the partner gamete. The trigger for trimerization of hapless-2 is not yet known, although acidification, which drives trimerization of flavivirus class II proteins in late endosomes, is not required.

Information gained from the 3D structure of hapless-2 may help in the development of transmission-blocking vaccines (TBVs), a new strategy to fight malaria (and other protozoan diseases). Successful transmission of Plasmodium from humans to mosquitoes relies on hapless-2-dependent fusion of the parasite gametes and fertilization, which occurs rapidly after ingestion by the mosquito. If TBVs could be designed to induce anti-hapless-2 antibodies in human hosts, these would be ingested by Anopheles mosquitoes along with blood Plasmodium gametocytes. The initial gamete fusion step could be prevented and the deadly cycle of transmission blocked. This approach has already been tested in model animals and, although the preliminary results look promising, they are not yet sufficient for clinical development. The identification of new peptides, that are both functionally crucial and immunogenic, may prove very helpful in the design of efficient anti-malaria TBVs.

As of this release, hapless-2 UniProtKB/Swiss-Prot entries have been created and are publicly available.

fn1. The acronym HAP2 is somewhat unfortunate, since this protein has nothing to do with the yeast HAP2 transcription factor. These are the mysterious ways of nomenclature, which sometimes may be quite confusing...

UniProtKB news

Modification of cross-references to PATRIC

We have modified our cross-references to the PATRIC database in order to reflect the new PATRIC primary identifier scheme. The earlier identifier scheme used simple numeric ids, e.g.

32117610
which were replaced by more informative primary identifiers such as
fig|1427269.3.peg.1028.

Text format

Example: Q9ZNI1

Previous format:

DR   PATRIC; 19579917; VBIStaAur99865_1117.

New format:

DR   PATRIC; fig|93061.5.peg.1117; -.

XML format

Example: Q9ZNI1

Previous format:

<dbReference type="PATRIC" id="19579917">
  <property type="gene designation" value="VBIStaAur99865_1117"/>
</dbReference>

New format:

<dbReference type="PATRIC" id="fig|93061.5.peg.1117"/>

RDF format

Example: Q9ZNI1

Previous format:

uniprot:Q9ZNI1
  rdfs:seeAlso <http://purl.uniprot.org/patric/19579917> .
<http://purl.uniprot.org/patric/19579917>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/PATRIC> ;
  rdfs:comment "VBIStaAur99865_1117" .

New format:

uniprot:Q9ZNI1
  rdfs:seeAlso <http://purl.uniprot.org/patric/fig%7C93061.5.peg.1117> .
<http://purl.uniprot.org/patric/fig%7C93061.5.peg.1117>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/PATRIC> .

New file linking deleted entries to their subsequently reinstated versions

Since release 2015_04, we are applying at each release a procedure to identify highly redundant proteomes within selected species groups using a combination of manual and automatic methods. This procedure prevents the creation of UniProtKB/TrEMBL entries from these redundant proteomes, but also means that a huge number of previously existing entries had to be deleted from UniProtKB when the procedure was put in place.

It may happen that proteomes that were identified as redundant are later reinstated as non-redundant, e.g. a proteome for a strain used as a model by a significant community or with proteins that have been crystallized. In the past, it has also happened on rare occasions that entries were deleted but later reinstated for other reasons. In such cases, the UniProtKB entries are created anew, with new accession numbers.

To help users to link deleted to subsequently reinstated entries, we are introducing a file that maps old to new accession numbers via their protein_ids. This file is available (in compressed format) by FTP at

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/reinstated_map.txt.gz

This mapping will also be used to make queries for obsolete identifiers on the UniProt website more meaningful.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • Cyclopeptide (Glu-Asn)

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • S-methylmethionine

Deleted term

  • N-acetylated lysine

Changes in subcellular location controlled vocabulary

New subcellular location:

Changes to keywords

New keyword:

UniProt release 2017_05

Published May 10, 2017

Headline

A certain taste for light

In most organisms, light perception is essential for survival. It not only mediates image-forming vision, but also performs other functions, such as phototaxis and circadian rhythm. Light-sensing function is carried out by photoreceptors, of which only 2 types are known in metazoans: opsins and cryptochromes. They are typically composed of two moieties: a protein and a prosthetic chromophore, the latter is responsible for light absorption. Consequently, photoreceptor denaturation, which targets the protein moiety, does not abolish light absorption, although it shifts absorbance peaks to different wavelengths. Photoreceptor activation by light induces a signaling pathway, called phototransduction, which involves the activation of a G-protein, the modulation of cGMP levels and ultimately a change in the permeability of cyclic nucleotide-gated channels.

It has been long thought that Caenorhabditis elegans, an eyeless, soil-dwelling nematode, could not sense light. This assumption turned out to be erroneous. Not only does C.elegans sense light, but it vigorously escapes from it. This behaviour is elicited only in response to blue or shorter wavelengths of light, with maximal responsiveness to UV light. This mechanism may have evolved to protect the animal against prolonged direct sunlight exposure that paralyzes and eventually kills it. Indeed worms appear to spend much of their time above ground, living on small surface-dwelling animals or their carcasses and may therefore be frequently exposed to direct sunlight. From the very beginning of the discovery of phototransduction in C.elegans, it was obvious that the lite-1 gene was involved in this process, as its heterologous expression in muscle cells was sufficient to confer light responsiveness on these cells that were normally unresponsive. Lite-1 was also shown to act upstream of G proteins, but its exact function remained unclear. Is it a bone fide photoreceptor? Or is it just sensing light-produced chemicals? Like opsins, which are the most common photoreceptor proteins in metazoan photoreceptor cells, lite-1 contains a 7-transmembrane domain. However, it does not share any sequence similarity with opsins and its topology is opposite to conventional 7-transmembrane receptors, with its N-terminus located intracellularly and its C-terminus extracellularly. In fact, lite-1 belongs to the insect gustatory receptor family of chemoreceptors, rather than opsin family. To clarify its role, Gong et al. purified lite-1 and showed that it directly absorbs photons with an efficiency 10 to 100 times that of all known photoreceptors, capturing both UVA and UVB light. Interestingly, absorption of UVA and UVB light can be separated. For instance, mutations at residues Ala-332 and Ser-226 disrupt UVA absorption, but do not affect UVB absorption. In addition, prolonged light illumination, which bleaches conventional photoreceptors, abolishes lite-1 absorption of UVA, but does not affect that of UVB, which appears to be more stable and relatively resistant to photobleaching.

Another remarkable lite-1 feature is that it loses all photoabsorption abilities upon denaturation, suggesting that this activity strictly depends on its conformation and not upon the presence of a chromophore. Mutational analysis pointed at 2 tryptophan residues (Trp-77 and Trp-328) that are required for the absorption of both UVA and UVB light. In order to confirm the importance of these residues, Gong et al. introduced 'Trp-77' by mutagenesis at the equivalent position in a structurally related gustatory receptor, called gur-3, which contains 'Trp-328', but is not photosensitive. Amazingly, mutated gur-3 absorbs UVB light with an efficiency of about 30% of that of lite-1. All these observations indicate that lite-1 is a bona fide photoreceptor of a novel type.

C.elegans lite-1 entry has been updated and is publicly available as of this release.

UniProtKB news

Extension of controlled vocabulary for PTM to glycosylation sites

Our controlled vocabulary for post-translational modification, so far used to standardize the annotation of modified residues, lipidation sites and protein cross-links, has been extended to include terms for glycosylation sites.

Change of the nomenclature for glycosylation sites

We have introduced a change to the nomenclature for glycosylation sites.

We previously described the occurrence of the attachment of a glycan (mono- or polysaccharide) to an amino-acid residue with the following elements:

  • The type of linkage (C-, N-, O- or S-linked) to the protein
  • The abbreviation of the reducing terminal sugar (shown between parentheses): If three dots '...' follow the abbreviation, this indicates an extension of the carbohydrate chain. Conversely the absence of dots means that a monosaccharide is linked.

To this we have added:

  • The name of the glycosylated amino acid

The new nomenclature is thus composed of three elements:

<linkage type> (<reducing carbohydrate>) <amino acid name>.

The valid values have been added to our controlled vocabulary for post-translational modifications and applied to all Glycosylation annotations.

Example: Q9HCN3

Previous nomenclature:

FT   CARBOHYD    144    144       N-linked (GlcNAc...).

New nomenclature:

FT   CARBOHYD    144    144       N-linked (GlcNAc...) asparagine.

Note that this information about the type of glycosylation can be complemented by

  • the name of the modified protein form,
  • information on whether the modification is carried out by a host protein,
  • the frequency of the modification or the relationship with another feature ('partial', 'alternate', 'transient'),
  • evidence attribution

as documented for modified residues.

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Glycosylation' ('CARBOHYD' in the flat file):

  • C-linked (Man) hydroxytryptophan
  • C-linked (Man) tryptophan
  • N-linked (DATDGlc) asparagine
  • N-linked (GalNAc) asparagine
  • N-linked (GalNAc...) asparagine
  • N-linked (GalNAc...) (glycosaminoglycan) asparagine
  • N-linked (Glc) arginine
  • N-linked (Glc) asparagine
  • N-linked (Glc) (glycation) histidine
  • N-linked (Glc) (glycation) isoleucine
  • N-linked (Glc) (glycation) lysine
  • N-linked (Glc) (glycation) valine
  • N-linked (Glc...) arginine
  • N-linked (Glc...) asparagine
  • N-linked (GlcNAc) arginine
  • N-linked (GlcNAc) asparagine
  • N-linked (GlcNAc...) arginine
  • N-linked (GlcNAc...) asparagine
  • N-linked (GlcNAc...) (complex) arginine
  • N-linked (GlcNAc...) (complex) asparagine
  • N-linked (GlcNAc...) (high mannose) arginine
  • N-linked (GlcNAc...) (high mannose) asparagine
  • N-linked (GlcNAc...) (hybrid) arginine
  • N-linked (GlcNAc...) (hybrid) asparagine
  • N-linked (GlcNAc...) (keratan sulfate) arginine
  • N-linked (GlcNAc...) (keratan sulfate) asparagine
  • N-linked (GlcNAc...) (paucimannose) arginine
  • N-linked (GlcNAc...) (paucimannose) asparagine
  • N-linked (GlcNAc...) (polylactosaminoglycan) arginine
  • N-linked (GlcNAc...) (polylactosaminoglycan) asparagine
  • N-linked (Hex) arginine
  • N-linked (Hex) asparagine
  • N-linked (Hex) tryptophan
  • N-linked (Hex...) arginine
  • N-linked (Hex...) asparagine
  • N-linked (HexNAc) arginine
  • N-linked (HexNAc) asparagine
  • N-linked (HexNAc...) arginine
  • N-linked (HexNAc...) asparagine
  • N-linked (Lac) (glycation) lysine
  • N-linked (Man) tryptophan
  • O-linked (Ara) hydroxyproline
  • O-linked (Ara...) hydroxyproline
  • O-linked (DADDGlc) serine
  • O-linked (DATDGlc) serine
  • O-linked (GATDGlc) serine
  • O-linked (Fuc) serine
  • O-linked (Fuc) threonine
  • O-linked (Fuc...) serine
  • O-linked (Fuc...) threonine
  • O-linked (FucNAc) serine
  • O-linked (FucNAc...) serine
  • O-linked (Gal) hydroxylysine
  • O-linked (Gal) hydroxyproline
  • O-linked (Gal) serine
  • O-linked (Gal) threonine
  • O-linked (Gal...) hydroxylysine
  • O-linked (Gal...) hydroxyproline
  • O-linked (Gal...) serine
  • O-linked (Gal...) threonine
  • O-linked (GalNAc) serine
  • O-linked (GalNAc...) serine
  • O-linked (GalNAc...) (keratan sulfate) serine
  • O-linked (GalNAc) threonine
  • O-linked (GalNAc...) threonine
  • O-linked (GalNAc...) (keratan sulfate) threonine
  • O-linked (GalNAc) tyrosine
  • O-linked (GalNAc...) tyrosine
  • O-linked (Glc) hydroxylysine
  • O-linked (Glc) serine
  • O-linked (Glc...) serine
  • O-linked (Glc) tyrosine
  • O-linked (Glc...) tyrosine
  • O-linked (GlcA) serine
  • O-linked (GlcNAc) hydroxyproline
  • O-linked (GlcNAc...) hydroxyproline
  • O-linked (GlcNAc) serine
  • O-linked (GlcNAc...) serine
  • O-linked (GlcNAc) threonine
  • O-linked (GlcNAc...) threonine
  • O-linked (GlcNAc) tyrosine
  • O-linked (GlcNAc...) tyrosine
  • O-linked (GlcNAc1P) serine
  • O-linked (GlcNAc6P) serine
  • O-linked (Man) serine
  • O-linked (Man...) serine
  • O-linked (Man...) (keratan sulfate) serine
  • O-linked (Man) threonine
  • O-linked (Man...) threonine
  • O-linked (Man...) (keratan sulfate) threonine
  • O-linked (Man1P) serine
  • O-linked (Man1P...) serine
  • O-linked (Man6P) threonine
  • O-linked (Man6P...) threonine
  • O-linked (Xyl) serine
  • O-linked (Xyl...) serine
  • O-linked (Xyl...) (chondroitin sulfate) serine
  • O-linked (Xyl...) (dermatan sulfate) serine
  • O-linked (Xyl...) (heparan sulfate) serine
  • O-linked (Xyl...) (glycosaminoglycan) serine
  • O-linked (Xyl...) (keratan sulfate) threonine
  • O-linked (Xyl...) (glycosaminoglycan) threonine
  • O-linked (Hex) hydroxylysine
  • O-linked (Hex...) hydroxylysine
  • O-linked (Hex) hydroxyproline
  • O-linked (Hex...) hydroxyproline
  • O-linked (Hex) serine
  • O-linked (Hex...) serine
  • O-linked (Hex) threonine
  • O-linked (Hex...) threonine
  • O-linked (Hex) tyrosine
  • O-linked (Hex...) tyrosine
  • O-linked (HexNAc) hydroxyproline
  • O-linked (HexNAc...) hydroxyproline
  • O-linked (HexNAc) serine
  • O-linked (HexNAc...) serine
  • O-linked (HexNAc) threonine
  • O-linked (HexNAc...) threonine
  • O-linked (HexNAc) tyrosine
  • O-linked (HexNAc...) tyrosine
  • S-linked (Gal) cysteine
  • S-linked (Gal...) cysteine
  • S-linked (Glc) cysteine
  • S-linked (Glc...) cysteine
  • S-linked (GlcNAc) cysteine
  • S-linked (GlcNAc...) cysteine
  • S-linked (Hex) cysteine
  • S-linked (Hex...) cysteine
  • S-linked (HexNAc) cysteine
  • S-linked (HexNAc...) cysteine

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • Cysteine sulfonic acid (-SO3H)

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Cardiomyopathy, dilated 1T
  • Sarcoidosis early-onset

UniRef news

Addition of GO annotation to UniRef90 and UniRef50 clusters

We have started to compute Gene Ontology (GO) annotations for UniRef90 and UniRef50 clusters: A GO term is assigned to a cluster when it is found in all UniProtKB members that are annotated with this term, or when it is a common ancestor of at least one GO term of each such member.

The UniRef XML format now represents the GO annotations with property elements. We have introduced three new types: "GO Molecular Function", "GO Biological Process", "GO Cellular Component". The values of these property elements are GO identifiers.

Example:

<entry id="UniRef50_B0KJL7" updated="2017-03-15"
  <name>Cluster: Animal haem peroxidase</name>
  ...
  <property type="GO Molecular Function" value="GO:0004601"/>
  <property type="GO Biological Process" value="GO:0006979"/>
  ...

This change does not affect the XSD, but may nevertheless require code changes.

UniProt release 2017_04

Published April 12, 2017

Headline

Death (by insulin) in paradise

Have you ever been lucky enough to see cones snails in their natural habitat? Their shells are beautiful and you may be tempted to pick them up to admire them. Try to resist: cone snails hate that! These venomous animals can fire their harpoons and inject toxins under your skin. In some cases, these injections can be fatal. Cone snails produce 100-200 distinct venom peptides, and most of the characterized ones target their prey's nervous system, including specific receptors, ion channels and transporters.

Cone snails predominantly live in warm seas and feed on fish, worms or molluscs. Fish-hunting cone snails can be classified into 2 categories depending upon their hunting strategy. There are 'hook-and-line hunters', who use a venomous harpoon, which is shot into the fish. There are 'net hunters', who protrude a sort of stretchy mouth, aim it at fish, and eventually engulf it. Cone snails move very slowly and all this process takes some time, so why does the fish not simply swim away? It has been proposed that cone snails release a subset of narcotizing or relaxing toxins, called the 'nirvana cabal', into water, causing fish to become disoriented and to stop moving.

The analysis of the Conus geographus venom gland transcriptome led to the amazing discovery of 3 transcripts (Con-Ins G1, Con-Ins G2 and Con-Ins G3), expressed at high levels and sharing very high homology with vertebrate insulin. The N-terminal half of Con-Ins G1 is almost identical to that of the fish hormone. It is known that the addition of human insulin to water causes hypoglycemia in fish, which severely affects their swimming behavior, insulin being absorbed via the gills. The effect can be reversed by placing fish in a 2% glucose bath. A similar effect was observed with synthetic Con-Ins G1, suggesting that it is indeed a component of the 'nirvana cabal'.

Venom insulins are widely used by cone snails. All mollusc eaters produce venom insulins, as do many worm hunters, though not all. In fish hunters, all net hunters produce venom insulins, while hook-and-line do not. Venom insulins found in fish hunting cone snails closely resemble fish insulins, whereas those identified in snail-hunters share sequence and structural similarities with mollusc insulins. Interestingly, while cone snail insulin, produced in nerve rings to control their own glucose homeostasis, is highly conserved across all tested species, venom insulins diverge rapidly, suggesting adaptation to their specific prey.

Cone snail venom insulins are the smallest known insulins found in nature. They lack A- and B-chain C-terminal residues that, in vertebrates, are crucial for hormone storage and activity. In human pancreatic beta-cells, insulin is stored as a hexamer (a trimer of dimers), but it is the monomer that bears the hormonal activity. Hexamer-to-monomer conversion can cause a delay in insulin action that can lead to a delay in blood glucose control following insulin injection in diabetic patients. Attempts to shorten the C-terminus of human insulin B chain in order to abolish self-association have resulted in near-complete loss of activity. By contrast, Con-Ins G1 is monomeric, bypassing the hexamer conversion step, but it also potently binds to the human insulin receptor. It is yet not entirely clear how Con-Ins G1 achieves that. As most conotoxins, C. geographus insulins are extensively post-translationally modified. In the absence of modifications, insulin receptor activation is reduced by approximately 8-fold. The study of Con-Ins G1 crystal structure shows how Con-Ins G1 can compensate for the lack of C-terminal key residues, paving the way for the design of fast-acting therapeutic insulins.

The use of insulins in venoms has not been reported in any other animals, but cone snails. However, the Gila monster, a venomous lizard living in southwestern United States and northwestern Mexico, also targets the glucose homeostasis of its prey. It produces a peptide, called exendin-4, which mimics the incretin hormone glucagon-like peptide 1 (GLP-1), and acts as a potent stimulator of glucose-dependent insulin release. Exendin-4 has been developed as a commercial drug, under the name 'Exenatide', for the treatment of type 2 diabetes.

As of this release, the Con-Ins G1 entry is publicly available in the safe conotoxin-free environment of your computer.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Ceroid lipofuscinosis, neuronal, 12

UniProt release 2017_03

Published March 15, 2017

Headline

Viral Short Message Service: peptide texting guides the outcome of infection

Communication is not simply a dispensable tool invented by Homo sapiens to do business and to have an enjoyable social life. Long before the advent of cell phones, most living organisms, from animals and plants to bacteria, were communicating with each other in order to ensure species survival. The recent discovery of a peptide-based communication system in some bacterial viruses extends this observation far beyond our wildest imaginings.

Some bacterial viruses, called temperate bacteriophages, have the ability to infect their host through a lytic (productive infection) or a lysogenic (latent) cycle. The lytic cycle leads to the lysis of the host bacterial cell and release of progeny virions. In the lysogenic cycle, on the other hand, the bacteriophage genome becomes integrated into the host genome as a prophage without any virion production. The decision between lysis and lysogeny is probabilistic in nature, but usually depends on the number of co-infecting viruses and the bacterial nutritional state. When uninfected bacteria are abundant and healthy, the lytic pathway is preferred. In later stages of infection, when the number of uninfected bacteria is reduced, progeny phages are at risk of no longer having a new host to infect. At this point, lysogeny is favoured. Although the molecular mechanism undelying the phage lytic or lysogenic decision is still largely unknown, even in well-studied bacteriophages like Lambda or Mu, a substantial leap forward was made earlier this year.

Erez et al. were investigating whether phage-infected bacteria may produce molecules to alert other bacterial cells of their infection, when they made an amazing discovery. A screening of the culture medium of Bacillus subtilis infected by Phi3T bacteriophages led to the identification not of a bacterial, but of a... viral hexapeptide! This peptide was called AimP. The bacteriophage also encodes a cytoplasmic receptor for AimP, called AimR. In the absence of AimP, the AimR receptor behaves as a DNA-binding homodimer which activates the transcription of a third phage component of the system, AimX. AimX is a regulatory non-coding RNA which favors lysis, either by inhibiting lysogeny or by promoting lysis, in an as yet undefined manner. In the presence of AimP, the AimR receptor becomes a peptide-bound, transcriptionally inactive monomer. As a result, the expression of AimX drops and lysogeny is promoted.

The current experimental data suggest the following model. AimP is synthesized in infected bacteria as a pre-pro-peptide. Its N-terminal signal sequence is recognized by the host secretion system and cleaved off upon secretion. Once released in the extracellular milieu, the inactive pro-peptide is further processed by bacterial extracellular proteases to yield the mature active 6 amino-acid long AimP peptide, which is internalized by surrounding bacteria through the oligopeptide permease transporter (OPP). AimP accumulates in the bacterial cytoplasm. When a phage infects an 'AimP-rich' bacterium, the expressed AimR receptor binds AimP and cannot activate the expression of AimX, leading to preferential lysogeny. In other words, a phage can "sense" the level of global infection in the environment and adapt to preserve chances for viable reproduction.

This viral mode of 3-membered communication has been called 'arbitrium' (after the Latin word meaning 'decision'). It may not be restricted to Phi3T bacteriophages. Indeed, Erez et al. found 112 instances of AimR homologues in Bacillus phages and, in all cases, aimR homologues were found upstream of aimP candidate genes.

As of this release, Bacillus phage Phi3T AimP and AimR entries have been updated and are publicly available.

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to keywords

New keywords:

Modified keyword:

UniProt release 2017_02

Published February 15, 2017

Headline

Freshwater fish see red

Vision relies on specialized neurons found in the retina, called photoreceptor cells. Vertebrate photoreceptor cells contain visual pigments consisting of a G-protein-coupled receptor, called opsin, and a covalently bound chromophore derived from vitamin A, most commonly 11-cis retinal (a derivative of vitamin A1). Light-induced isomerization of 11-cis retinal to all-trans triggers a conformational change leading to G-protein activation, release of all-trans retinal and activation of the phototransduction cascade.

Typical rod photopigments have a maximum light absorbance of around 500 nm. However, at the end of the 19th century, Köttgen and Abelsdorff observed that the rod pigments in certain freshwater fish were "red-shifted" towards 20-30 nm longer wavelengths than those of marine fish and terrestrial animals. This difference is due to a change in chromophore. Instead of 11-cis retinal, freshwater vertebrates use 11-cis 3,4-didehydroretinal, a derivative of vitamin A2, whose only difference with vitamin A1 is an additional conjugated double bond within its beta-ionone ring. What is the evolutionary advantage of this modification? Fresh water, in lakes or streams, is often murky. As a result, the light environment is shifted to the red and infrared end of the spectrum. Switching light absorbance seems to be the appropriate response to optimize vision in this specific aquatic milieu.

The chromophore switch is not only specific for certain species, it can also be regulated during life. For example, many amphibians use 11-cis 3,4-didehydroretinal during the tadpole stage, that they spend in ponds. Upon metamorphosis, they switch to 11-cis retinal which provides clear vision to the terrestrial adult they have become. Conversely, salmon live happily with 11-cis retinal in the open ocean. During spawning migration, however, 11-cis retinal is progressively replaced by 11-cis 3,4-didehydroretinal, possibly through the action of thyroid hormones. In zebrafish also, the switch to vitamin A2-based chromophores can be induced by thyroid hormone treatment. Maybe the most striking example of differential usage of visual chromophores is provided by the American bullfrog. This voracious predator spends a large part of its life floating or swimming at the surface of the water, looking for aquatic, as well as aerial prey, with its eyes just above the waterline. Its dorsal retina, steered towards water, contains 11-cis 3,4-didehydroretinal, while its ventral retina uses 11-cis retinal.

While much of this knowledge on vitamins A1 and A2 was acquired long ago, the identity of the dehydrogenase catalyzing the switch between both forms remained elusive until December 2015, when Enright et al. published the identification of the enzyme. The authors compared the expression profile of zebrafish retinal pigment epithelium (RPE) of thyroid hormone-treated versus control animals. The most highly up-regulated transcript was that encoding cyp27c1, a cytochrome P450 family member. cyp27c1 was also strongly expressed in dorsal, but not ventral bullfrog RPE, correlating with the distribution of vitamin A2. In vitro, purified cyp27c1 was able to very efficiently catalyze the conversion of vitamin A1 to vitamin A2. In vivo, cyp27c1 knockout zebrafish survive to adulthood without overt developmental abnormalities. However, upon treatment with thyroid hormone, the mutant fish eyes fail to produce any vitamin A2 and their photoreceptors do not undergo a red-shift in sensitivity. Thus, the expression of a single enzyme, cyp27c1, mediates the dynamic spectral tuning of the entire visual system by controlling the balance of vitamin A1 and A2 in the eye.

Obviously, humans are not adapted for aquatic vision. However, they do produce vitamin A2, as has been documented in keratinocytes, and they express CYP27C1 in liver, kidney and pancreas. The human enzyme catalyzes the same reaction as fish and amphibian orthologs, but the physiological relevance of this observation is not clear at present.

Zebrafish and bullfrog CYP27C1 entries have been annotated in UniProtKB/Swiss-Prot. The preliminary sequence of American bullfrog CYP27C1 was kindly provided by Professor Corbo and Dr. Enright and we would like to thank them sincerely. The human ortholog has been updated. All 3 entries are publicly available as of this release.

Cross-references to Araport

Cross-references have been added to the Arabidopsis Information Portal Araport, an open-access online community resource for Arabidopsis research.

Araport is available at https://www.araport.org/.

The format of the explicit links is:

Resource abbreviation Araport
Resource identifier AGI locus code

Example: Q43125

Show all entries having a cross-reference to Araport.

Text format

Example: Q43125

DR   Araport; AT4G08920; -.

XML format

Example: Q43125

<dbReference type="Araport" id="AT4G08920"/>

RDF format

Example: Q43125

uniprot:Q43125
  rdfs:seeAlso <http://purl.uniprot.org/araport/AT4G08920> .
<http://purl.uniprot.org/araport/AT4G08920>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Araport> .

Cross-references to IMGT/GENE-DB

Cross-references have been added to IMGT/GENE-DB, the genome database of the international Immunogenetics information system (IMGT) for genes encoding immunoglobulins and T-cell receptors.

IMGT/GENE-DB is available at http://www.imgt.org/genedb/.

The format of the explicit links is:

Resource abbreviation IMGT/GENE-DB in entry view, IMGT_GENE-DB in source formats
Resource identifier Gene name

Example: P01871

Show all entries having a cross-reference to IMGT/GENE-DB.

Text format

Example: P01871

DR   IMGT_GENE-DB; IGHM; -.

XML format

Example: P01871

<dbReference type="IMGT_GENE-DB" id="IGHM"/>

RDF format

Example: P01871

uniprot:P01871
  rdfs:seeAlso <http://purl.uniprot.org/imgt_gene-db/IGHM> .
<http://purl.uniprot.org/imgt_gene-db/IGHM>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/IMGT_GENE-DB> .
 

Change of the cross-references to TAIR

We have modified our cross-references to the TAIR database, and now use the TAIR accession number as the primary resource identifier, while continuing to show the TAIR locus name in an additional field.

Text format

Example: Q9ZVI3

Previous format:

DR   TAIR; AT2G38610; -.

New format:

DR   TAIR; locus:2064097; AT2G38610.

XML format

Example: Q9ZVI3

Previous format:

<dbReference type="TAIR" id="AT2G38610"/>

New format:

<dbReference type="TAIR" id="locus:2064097">
  <property type="gene designation" value="AT2G38610"/>
</dbReference>

This change does not affect the XSD, but may nevertheless require code changes.

RDF format

Example: Q9ZVI3

Previous format:

uniprot:Q9ZVI3
  rdfs:seeAlso <http://purl.uniprot.org/tair/AT2G38610> .
<http://purl.uniprot.org/tair/AT2G38610>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/TAIR> .

New format:

uniprot:Q9ZVI3
  rdfs:seeAlso <http://purl.uniprot.org/tair/locus:2064097> .
<http://purl.uniprot.org/tair/locus:2064097>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/TAIR> .
  rdfs:comment "AT2G38610" .

Removal of sequence similarity annotations for domains

Sequence similarity annotations were mainly used to describe two types of information:

  1. A family to which the protein belongs, worded as:
    Belongs to FamilyName.
  2. A structural domain that the protein contains, worded as:
    Contains NumberOfOccurence DomainName.

The domains that a protein contains are also annotated in 'Domain', 'Zinc finger', 'Repeat', 'Calcium binding' or 'DNA binding' annotations, which describe a domain's name and sequence coordinates. The 'Sequence similarity' annotations of type 2, however, described only a domain's name and number of occurences. We have therefore removed these less detailed annotations.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Thyroxine-binding globulin deficiency

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • N,N,N-trimethylmethionine

UniProt release 2017_01

Published January 18, 2017

Headline

Sheep in wolves' clothing: human variant reannotation in UniProtKB/Swiss-Prot with ExAC

Annotation of sequence variants has always been an important part of the curation of human proteins in UniProtKB/Swiss-Prot. As of this release, about 76,500 variants are annotated in the knowledgebase. 99% of them are single amino acid polymorphisms (SAPs), the rest are small indels. 38% of the SAPs are associated with a genetic disorder. This high percentage of rare SAPs reflects our strategy to prioritize the annotation of disease-causing and/or functionally characterized variants reported in peer-reviewed scientific literature. Most are annotated as involved in diseases (as disease-causing agents, susceptibility factors or disease modifiers), but for some, the role in the phenotype is not clear, although they have been found in patients and not (yet?) in healthy individuals. These variants are called Variants of Unknown Significance (VUS). In the 'good old days', we were quite confident and we associated SAPs with diseases provided some criteria were met, such as cosegregation of the mutation with the phenotype, and absence of the mutation in a reasonably high number of healthy controls. At that time, 100 control individuals, ethnically matched if possible, seemed acceptable. Those days are gone. Nowadays, these simple criteria have been changed for a real roadmap, based on guidelines developed by Richards et al. The stumbling block remains the frequency of a given variant in the population in view of the occurrence of the disease. In other words, if a variant is not found in healthy individuals, is it because it is pathogenic, or simply not looked for hard enough? In this context, the high-quality sequence of almost 61,000 exomes provided by the Exome Aggregation Consortium (ExAC) is a major achievement.

ExAC aims to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community. The sequence of 60,706 exomes from unrelated individuals is currently available on the ExAC website. Surprisingly, each ExAC exome donor harbored on average 54 mutations reported to be disease-causing in HGMD or ClinVar. The pathogenicity of most of them (41) could be ruled out due to high allele frequency. Take for instance the gene CLN8. Mutations in this gene have been shown to cause neuronal ceroid lipofuscinosis-8 (CLN8), an autosomal recessive neurodegenerative disorder with an onset age of 2 to 7 years. In view of the clinical synopsis, no 'healthy' adult homozygous for any disease-causing mutation is expected. ExAC observed 93 individuals homozygous for the p.Pro229Ala variant, which had formerly been reported to be pathogenic. An analogous result was obtained for the variant p.Met1444Ile in GLI2. This mutation was reported to cause holoprosencephaly-9 (HPE9), an autosomal dominant disorder characterized by a wide phenotypic spectrum of brain developmental defects. Although HPE9 has variable expressivity and incomplete penetrance, the presence of this mutation in 20 homozygous individuals analyzed by ExAC lead to its reclassification as a benign polymorphism.

The ExAC publication has a fruitful impact on our annotation. First, 38 variants (in 36 gene entries) reported in UniProtKB/Swiss-Prot and thought to be pathogenic have been reclassified as either benign polymorphisms or VUS. Second, the ExAC database has become an invaluable tool for curators, helping them to tag human variants with the appropriate status 'Disease' (disease-associated), 'Polymorphism' (innocuous) or 'Unclassified' (i.e. VUS). Third, we are learning to be more and more cautious when annotating new variants. The result is an increased number of VUS in UniProtKB/Swiss-Prot (currently representing about 20% of the total number of variants identified in patients). Old variants will be progressively confirmed or reclassified as new knowledge becomes available.

As of this release, the variants updated thanks to ExAC data are available in UniProtKB/Swiss-Prot.

The UniProt team wishes you a Happy New Year!

Cross-references to SFLD

Cross-references have been added to the Structure Function Linkage Database (SFLD), a resource that links evolutionarily related sequences and structures from mechanistically diverse superfamilies of enzymes to their chemical reactions.

SFLD is available at http://sfld.rbvi.ucsf.edu/django/.

The format of the explicit links is:

Resource abbreviation SFLD
Resource identifier SFLD identifier
Optional information 1 SFLD model name
Optional information 2 Number of hits

Example: P00877

Show all entries having a cross-reference to SFLD.

Text format

Example: P00877

DR   SFLD; SFLDS00014; RuBisCO; 1.

XML format

Example: P00877

<dbReference type="SFLD" id="SFLDS00014">
  <property type="entry name" value="RuBisCO"/>
  <property type="match status" value="1"/>
</dbReference>

RDF format

Example: P00877

uniprot:P00877
  rdfs:seeAlso <http://purl.uniprot.org/sfld/SFLDS00014> .
<http://purl.uniprot.org/sfld/SFLDS00014>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SFLD> ;
  rdfs:comment "RuBisCO" ;
  up:signatureSequenceMatch <http://purl.uniprot.org/isoforms/P00877-1#SFLD_SFLDS00014_match_1> .

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • O-UMP-histidine
  • O-UMP-serine
  • O-UMP-threonine

Changes to keywords

Deleted keyword:

  • Cyclosporin

UniRef news

Change of the UniRef FASTA header

We have added the NCBI taxonomy identifier of the common taxon of a UniRef cluster to the UniRef FASTA header, which now has the format:

>UniqueIdentifier ClusterName n=Members Tax=TaxonName TaxID=TaxonIdentifier RepID=RepresentativeMember

Where:

  • UniqueIdentifier is the primary accession number of the UniRef cluster.
  • ClusterName is the name of the UniRef cluster.
  • Members is the number of UniRef cluster members.
  • TaxonName is the scientific name of the lowest common taxon shared by all UniRef cluster members.
  • TaxonIdentifier is the NCBI taxonomy identifier of the lowest common taxon shared by all UniRef cluster members.
  • RepresentativeMember is the entry name of the representative member of the UniRef cluster.

Example:

>UniRef50_Q9K794 Putative AgrB-like protein n=2 Tax=Bacillus TaxID=1386 RepID=AGRB_BACHD
MLERLALTLAHQVKALNAEETESVEVLTFGFTIILHYLFTLLLVLAVGLLHGEIWLFLQI
ALSFTFMRVLTGGAHLDHSIGCTLLSVLFITAISWVPFANNYAWILYGISGGLLIWKYAP
YYEAHQVVHTEHWERRKKRIAYILIVLFIILAMLMSTQGLVLGVLLQGVLLTPIGLKVTR
QLNRFILKGGETNEENS

This addresses the issue that scientific taxon names can be ambiguous. Example: "Bacillus" refers to both a genus of bacteria as well as a genus of insects.

UniProt release 2016_11

Published November 30, 2016

Headline

From mouth to gut, a new mechanism for fimbria assembly

Fighting the oral microbiome is a daily task. Ineffective oral hygiene leads not only to dental caries, but also to inflammatory gum diseases, such as gingivitis. In some cases, gingivitis can worsen and turn into periodontitis, which involves the chronic destruction of connective tissues, including that of the alveolar bone around the teeth, and consequently loosening and subsequent loss of teeth. We are not all equally affected by periodontal diseases. There are marked differences in disease progression rate and severity, reflecting personal susceptibility, diversity in virulence among the microorganism species (and subspecies) and environmental conditions. Despite these variables, Porphyromonas gingivalis is now recognized as a major contributor to periodontitis. This Gram-negative black-pigmented anaerobic rod resides in subgingival biofilms and harbors an arsenal of virulence factors, among which are fimbriae (also called pili). Described for the first time in the early 1950s, fimbriae are non-flagellar appendages, formed by the assembly of proteins called pilins at the bacterial surface. They are often involved in the initial adhesion of the bacteria to host tissues during colonization, and also in biofilm formation, cell motility (twitching mobility), and transport of proteins and DNA across cell membranes. There are major (long) and minor (short) fimbriae, both containing a structural, stalk-forming subunit (FimA for the major fimbriae, Mfa1 for the minor fimbriae) and 3 accessory subunits (FimC, FimD and FimE for the major fimbriae; Mfa3, Mfa4 and Mfa5 for the minor fimbriae) thought to form the fimbria tip. The last subunit is FimB (major fimbriae) or Mfa2 (minor fimbriae), which anchors the pilus to the outer membrane.

A very thorough study published last April, combining X-ray structure, biochemical and mutational analyses, sheds new light on the fimbria assembly mechanism in several bacteria from the Bacteroidia class, including P. gingivalis. The assembly occurs from tip to base. A tip pilin monomer is incorporated first, followed by stalk-forming structural pilin subunits and finally an anchor pilin at the base. Tip and structural pilins are synthesized in the cytoplasm as lipoprotein precursors, and exported into the periplasm using the Sec pathway. In the periplasm, they are folded and become lipidated at the N-terminus. The modified pilins are then exported across the outer membrane. During this process, they undergo a cleavage that releases the lipid moiety and several amino acids from the N-terminus, creating a groove. At this stage, mature structural pilins adopt an extended "open" conformation, allowing the assembly of the fimbriae where a C-terminal extension binds to the N-terminal groove of the previous subunit, a little like interlocking Lego bricks. The tip pilins exhibit a similar N-terminal groove to accommodate the C-terminal extension from structural pilin, but their C-terminus remains buried. Anchor pilins do not undergo cleavage and remain tethered to the outer membrane. As for structural pilin subunits, their C-terminus is involved in their incorporation into fimbriae.

Although fimbria assembly has been studied in numerous phylogenetically distinct bacteria, until this recent publication, very little was known about pilin structure and assembly in human-associated Bacteroidales members. The reported mechanism was hitherto unseen, but it could be widespread. Indeed, FimA proteins represent a large and diverse superfamily, which is highly represented in the gut microbiome, suggesting that they may confer adaptive advantages in bacterial colonization of this environment.

Close to 30 entries have been updated in UniProtKB/Swiss-Prot to include these new findings. The entries can be consulted just as well before or after brushing your teeth!

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

RDF news

Change of URIs for Ensembl and Ensembl Genomes

For historic reasons, UniProt had to generate URIs to cross-reference databases that did not have an RDF representation. Our policy is to replace these by the URIs generated by the cross-referenced database once it starts to distribute an RDF representation of its data.

We have therefore updated the URIs for the Ensembl and Ensembl Genomes databases from

http://purl.uniprot.org/ensembl/<identifier>
http://purl.uniprot.org/ensemblbacteria/<identifier>
http://purl.uniprot.org/ensemblfungi/<identifier>
http://purl.uniprot.org/ensemblmetazoa/<identifier>
http://purl.uniprot.org/ensemblplants/<identifier>
http://purl.uniprot.org/ensemblprotists/<identifier>

to

  • http://rdf.ebi.ac.uk/resource/ensembl/<identifier>
    for genes
  • http://rdf.ebi.ac.uk/resource/ensembl.transcript/<identifier>
    for transcripts
  • http://rdf.ebi.ac.uk/resource/ensembl.protein/<identifier>
    for proteins

UniProt release 2016_10

Published November 2, 2016

Headline

N-acyl amino acids: a new treatment for obesity?

Mitochondria play a fundamental role in energy production. After glycolysis, glucose products are imported into the mitochondrial matrix, where they go through the citric acid cycle. The electrons produced in this process are transported from one protein complex to the next in the mitochondrial inner membrane. The final electron acceptor is molecular oxygen, which is ultimately reduced to water. During electron transport, the participating protein complexes pump protons out of the matrix space into the intermembrane space and thus create a concentration gradient. This gradient is used by ATP synthase to power the phosphorylation of ADP into ATP. However not all energy liberated from the oxidation of dietary substrates is converted into ATP. Protons can leak back to the matrix through the inner membrane independently of ATP synthase and the energy accumulated is dissipated as heat. Several proteins are known to be involved in this process, called "uncoupled respiration". One of them, UCP1 has been most extensively studied in the context of thermogenesis mediated by brown and beige adipose tissues.

Adaptive thermogenesis does not rely exclusively upon UCP1. Adipose tissues secrete many bioactive proteins, some of which potentially play a role in the regulation of energy expenditure. Recently, Long et al. identified a protein secreted by brown and beige fat cells, PM20D1. This protein is co-expressed with UCP1 in adipocytes. When injected with PM20D1 viral expression vectors and placed on high fat diet for a period of 47 to 54 days, mice exhibited a blunted weight gain, due to a massive reduction in fat mass compared with control animals. There was no difference in food intake, nor in movement between treated and untreated animals, suggesting the activation of a thermogenic gene program in the classical brown fat (BAT), subcutaneous inguinal white fat (iWAT), or both. Interestingly, UCP1 levels were unchanged in these experiments.

In vitro, PM20D1 appeared to be a bidirectional N-acyl amino acid synthase and hydrolase, the synthase activity being lower than the hydrolase activity. In vivo, plasma levels of N-oleyl-phenylalanine (C18:1-Phe) were indeed elevated in mice injected with PM20D1 expression vector. But what is the effect of N-lipidated amino acids on cells? When treated with N-acyl amino acids, primary BAT adipocytes and differentiated iWAT cells showed increased oxygen consumption in a UCP1-independent manner, indicating respiratory uncoupling activity of these compounds. The N-acyl amino acids tested (N-arachidonyl-glycine (C20:4-Gly), C20:4-Phe, and C18:1-Phe) acted directly on mitochondria, possibly by interaction with mitochondrial transporter proteins, such as SLC25A4 and SLC25A5. Of note, SLC25A4 and SLC25A5 exhibit ADP/ATP symport activity, but are also thought to translocate protons across the inner membrane. Finally treatment of obese mice with C18:1-Leu induced weight loss through the reduction of fat mass and improved glucose tolerance tests.

In the 1930s, the mitochondrial uncoupling 2,4 dinitrophenol (DNP) was used in diet pills to stimulate metabolism and promote weight loss and actually it can still be purchased on the internet for this purpose. Though quite efficient in terms of weight loss, this drug has severe side effects. It can cause an excessive rise in body temperature due to the heat produced during uncoupling. DNP overdose causes fatal hyperthermia, with body temperature rising to as high as 44oC shortly before death. Will N-acyl-amino acids become a new, this time innocuous, treatment of choice for obesity? It's difficult to anticipate. Chronic treatment of mice with C18:1-Phe or C20:4-Gly not only increases energy expenditure, with no effects on movement, but also reduces food intake, which obviously also contributes to weight loss. However, several N-acyl-amino acids have other biological functions, besides respiratory uncoupling, and hence may have other (undesirable?) effects. Nevertheless the study of Long et al. sheds light on new endogenous mitochondrial uncouplers and new thermogenic mechanisms that are undoubtedly worth further investigation.

As of this release, PM20D1 entries have been updated and are publicly available.

UniProtKB news

Cross-references to DisGeNET

Cross-references have been added to DisGeNET, a discovery platform for the dynamical exploration of human diseases and their genes.

DisGeNET is available at http://www.disgenet.org.

The format of the explicit links is:

Resource abbreviation DisGeNET
Resource identifier Gene identifier (corresponding to GeneID gene identifier)

Example: P02649

Show all entries having a cross-reference to DisGeNET.

Text format

Example: P02649

DR   DisGeNET; 348; -.

XML format

Example: P02649

<dbReference type="DisGeNET" id="348"/>

RDF format

Example: P02649

uniprot:P02649
  rdfs:seeAlso <http://identifiers.org/ncbigene/348> .
<http://identifiers.org/ncbigene/348>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/DisGeNET> .

Cross-references to OpenTargets

Cross-references have been added to OpenTargets. This Target Validation platform brings together information on the relationships between potential drug targets and diseases. The core concept is to identify evidence of an association between a target and disease from various data types.

OpenTargets is available at https://www.targetvalidation.org/.

The format of the explicit links is:

Resource abbreviation OpenTargets
Resource identifier Gene identifier (corresponding to Ensembl gene identifier)

Example: P15056

Show all entries having a cross-reference to OpenTargets.

Text format

Example: P15056

DR   OpenTargets; ENSG00000157764; -.

XML format

Example: P15056

<dbReference type="OpenTargets" id="ENSG00000157764"/>

RDF format

Example: P15056

uniprot:P15056
  rdfs:seeAlso <http://purl.uniprot.org/opentargets/ENSG00000157764> .
<http://purl.uniprot.org/opentargets/ENSG00000157764>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/OpenTargets> .

Change of the cross-references to PhosphoSite

The PhosphoSite resource has changed its name to PhosphoSitePlus and we have updated our cross-references to reflect this name change.

Change of the cross-references to SMR

We have modified our cross-references to the SWISS-MODEL Repository (SMR) database. These cross-references used to indicate the sequence ranges of the UniProt canonical sequence that can be modelled with high confidence. This information is now no longer available in our cross-references, but you can get the most up-to-date data in SMR which is now updated weekly for several model organisms, or by triggering yourself the update of a specific entry in SMR.

Text format

Example: Q00362

Previous format:

DR   SMR; Q00362; 4-376, 492-523.

New format:

DR   SMR; Q00362; -.

XML format

Example: Q00362

Previous format:

<dbReference type="SMR" id="Q00362">
  <property type="residue range" value="4-376, 492-523"/>
</dbReference>

New format:

<dbReference type="SMR" id="Q00362"/>

RDF format

Example: Q00362

Previous format:

uniprot:Q00362
  rdfs:seeAlso <http://purl.uniprot.org/smr/Q00362> .
<http://purl.uniprot.org/smr/Q00362>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SMR> ;
  rdfs:comment "4-376, 492-523" .

New format:

uniprot:Q00362
  rdfs:seeAlso <http://purl.uniprot.org/smr/Q00362> .
<http://purl.uniprot.org/smr/Q00362>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SMR> .

Change of RDF representation of the cross-references to PDB

We have modified the representation of our cross-references to PDB. These cross-references indicate the sequence ranges of the UniProt canonical sequence that are covered by a PDB structure when this data is available. This piece of information was provided via a reification of the cross-reference statement and each range was represented with a chain property that had a string literal value. We have introduced a new chainSequenceMapping property to simplify this description.

Example: P00750

Previous format:

uniprot:P00750
  rdfs:seeAlso <http://rdf.wwpdb.org/pdb/1A5H> .

<http://rdf.wwpdb.org/pdb/1A5H>
  rdf:type up:Structure_Resource ;
  up:database <http://purl.uniprot.org/database/PDB> ;
  up:method up:X-Ray_Crystallography ;
  up:resolution "2.90"^^xsd:float .

<#_5030303735300036>
  rdf:type rdf:Statement ;
  rdf:type up:Structure_Mapping_Statement ;
  rdf:subject uniprot:P00750 ;
  rdf:predicate rdfs:seeAlso ;
  rdf:object <http://rdf.wwpdb.org/pdb/1A5H> ;
  up:chain "A/B=311-562" ,
           "C/D=298-304" .

New format:

uniprot:P00750
  rdfs:seeAlso <http://rdf.wwpdb.org/pdb/1A5H> .

<http://rdf.wwpdb.org/pdb/1A5H>
  rdf:type up:Structure_Resource ;
  up:database <http://purl.uniprot.org/database/PDB> ;
  up:method up:X-Ray_Crystallography ;
  up:resolution "2.90"^^xsd:float ;
  up:chainSequenceMapping isoform:P00750-1#PDB_1A5H_tt311tt562 ,
                          isoform:P00750-1#PDB_1A5H_tt298tt304 .

isoform:P00750-1#PDB_1A5H_tt311tt562
  up:chain "A/B=311-562" .

isoform:P00750-1#PDB_1A5H_tt298tt304
  up:chain "C/D=298-304" .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • Hydroxylated arginine
  • N6-(beta-hydroxybutyrate)lysine

UniProt website news

Web browser support update

UniProt strives to support all major web browsers up to the oldest version that is supported by the browser developers. Since Microsoft stopped the support for Internet Explorer versions older than 11 in January 2016, we have dropped the support for these versions from UniProt release 2016_10.

We recommend to use one of the following major web browsers for the UniProt website:

  • Internet Explorer 11+
  • FireFox 45+
  • Chrome (latest update)
  • Safari 9+

Please note that for older versions of these browsers certain features of the website may not be available (you can check here which browser version you are using).

UniProt release 2016_09

Published October 5, 2016

Headline

Ki-67: the great leap from simple marker to functional actor

A marker is 'something (such as a sign or an object) that shows the location, the presence or the existence of something'. Used daily in laboratories worldwide, from basic research to clinics, markers are a scientist/practitioner's best friend and the community continuously seeks new markers, notably for improving diagnosis and prognosis in medicine. Take for instance Ki-67. This protein, encoded by the MKI67 gene, is present during all active phases of the cell cycle, G1, S, G2, and mitosis, but is absent from resting G0 cells. During interphase, it is predominantly present in the cortex and dense fibrillar components of the nucleolus. During mitosis, it relocates to the periphery of the condensed chromosomes. It is a widely used marker for cell proliferation, very valuable in cancer diagnosis and prognosis. In this case, the term "widely" seems an understatement. A search in the NCBI PubMed database retrieves over 22'200 publications, but hardly any deal with its actual function. Indeed, while Ki-67 association with cellular proliferation is well established, its precise role in this process was unknown until recently. It was quite tempting to suggest that it is 'required for maintaining cell proliferation', as it was cautiously stated in the human UniProtKB/Swiss-Prot entry. However, a marker is just a marker and drawing any functional conclusion from expression levels may be hazardous.

At the very beginning of mitosis, chromosomes are compacted into thick fibers. After nuclear envelope breakdown (NEBD), chromosomes separate from one another in the cytoplasm, attach to the mitotic spindle and align along the center of the cell during metaphase. The spindle pulls a set of chromosomes to each pole of the dividing cell. How do chromosomes maintain their structural individuality during this process? As the molecules responsible for chromosome compaction are by themselves unable to distinguish different chromosomes, what are the factors that prevent chromosome coalescence?

Earlier this year, Cuylen et al. tackled this issue. Using automated live-cell imaging, the authors analyzed the effect of removing different proteins from cells. Out of almost 1,300 candidate genes, the knockdown of only one caused the sought-after chromosome clustering phenotype: MKI67. The internal structure of mitotic chromosomes appeared unaffected by Ki-67 depletion, but soon after NEBD, chromosomes merged into a single mass of chromatin, whose access to spindle microtubules was impaired.

Ki-67 is a large, about 3'000 amino acid long, protein that localizes at the chromosome surface from prophase until telophase, as mentioned above. Cuylen et al. show that the protein's adsorption at the chromosome surface is mediated by its C-terminal region. The elongated N-terminal portion orients perpendicular to the chromosomes, a little like bristles on a brush. Ki-67 size and overall electric charge may form a repulsive shield, preventing coalescence. The range of Ki-67-mediated chromosome repulsion seems to depend on molecular density. When Ki-67 was overexpressed, mitotic chromosomes were spaced further apart.

Hence natural proteins seem to be able to act as surfactants in intracellular compartmentalization. It would be interesting to investigate whether it is also the case for membrane-less organelles, such as nucleoli, with which Ki-67 was also shown to be associated.

As of this release, the human Ki-67 entry has been updated in UniProtKB/Swiss-Prot and is publicly available.

UniProtKB news

Change of RDF representation of the cross-references to family and domain databases

We have modified the representation of our cross-references to family and domain databases. These cross-references indicate the number of matches of the family or domain signature to the UniProt canonical sequence, and this piece of information was provided via a reification of the cross-reference statement. We have introduced a new Signature_Resource class with a signatureSequenceMatch property to describe each match as a resource and thereby simplify this description.

Example: A0AVT1

Previous format:

uniprot:A0AVT1
  rdfs:seeAlso <http://purl.uniprot.org/pfam/PF00899> .

<http://purl.uniprot.org/pfam/PF00899>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Pfam> ;
  rdfs:comment "ThiF" .

<#_4130415654310021>
  rdf:type rdf:Statement ;
  rdf:type up:Domain_Assignment_Statement ;
  rdf:subject uniprot:A0AVT1 ;
  rdf:predicate rdfs:seeAlso ;
  rdf:object <http://purl.uniprot.org/pfam/PF00899> ;
  up:hits 2 .

New format:

uniprot:A0AVT1
  rdfs:seeAlso <http://purl.uniprot.org/pfam/PF00899> .

<http://purl.uniprot.org/pfam/PF00899>
  rdf:type up:Signature_Resource ;
  up:database <http://purl.uniprot.org/database/Pfam> ;
  rdfs:comment "ThiF" .
  up:signatureSequenceMatch isoforms:A0AVT1-1#Pfam_PF00899_match_1 ,
                            isoforms:A0AVT1-1#Pfam_PF00899_match_2 .

Change of RDF representation of the cross-references to EMBL

We have modified the representation of our cross-references to nucleotide CoDing Sequences (CDS) from the INSDC. When a CDS differs substantially from a reviewed UniProtKB/Swiss-Prot sequence, the UniProt curators indicate the nature of the difference in the corresponding cross-reference. This piece of information was provided via a reification of the cross-reference statement. We have introduced a new sequenceDiscrepancy property to simplify this description.

Example: P30154

Previous format:

uniprot:P30154
  rdfs:seeAlso <http://purl.uniprot.org/embl-cds/BAG59103.1> .

<http://purl.uniprot.org/embl-cds/BAG59103.1>
  rdf:type up:Nucleotide_Resource ;
  up:database <http://purl.uniprot.org/database/EMBL> ;
  up:locatedOn <http://purl.uniprot.org/embl/AK296455> .

<#_503330313534001A>
  rdf:type rdf:Statement ;
  rdf:type up:Nucleotide_Mapping_Statement ;
  rdf:subject uniprot:P30154 ;
  rdf:predicate rdfs:seeAlso ;
  rdf:object <http://purl.uniprot.org/embl-cds/BAG59103.1> ;
  rdfs:comment "Frameshift." .

New format:

uniprot:P30154
  rdfs:seeAlso <http://purl.uniprot.org/embl-cds/BAG59103.1> .

<http://purl.uniprot.org/embl-cds/BAG59103.1>
  rdf:type up:Nucleotide_Resource ;
  up:database <http://purl.uniprot.org/database/EMBL> ;
  up:locatedOn <http://purl.uniprot.org/embl/AK296455> ;
  up:sequenceDiscrepancy uniprot:P30154#EMBL_BAG59103.1 .

uniprot:P30154#EMBL_BAG59103.1
  rdfs:comment "Frameshift." .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2016_08

Published September 7, 2016

Headline

Butterfly fashion: all they need is cortex

Butterfly and moth wing patterns fulfill various functions, such as mate attraction, thermal regulation, and protection by concealment, mimicry or warning. Patterns are produced by a dust-like layer of tiny colored scales that cover an otherwise transparent membrane. Scales can be pigmented with melanins resulting in black and brown colors. Blue, red and iridescence are usually created by the microstructure of the scales, resulting in the scattering of light. Each scale is produced by a single cell on the wing surface.

Wing pattern and color can change in order to adapt to environmental changes. The classical example of such a phenomenon is provided by Biston betularia. This moth used to camouflage itself against lichen-covered tree trunks. Its peppered white wings makes it almost invisible on this background. With the advent of the industrial revolution in the 19th century in Britain, trunks turned soot black and so did Biston betularia. The new melanic morph was described for the first time in Manchester in 1848 and called carbonaria. It spread all over England and its frequency was over 90% in the 1950s. Several years after the Clean Air Act, in the early 1970s, its frequency started to drop again and nowadays the maximum is evaluated less than 50% and in most places below 10%.

The mutation that gave rise to Biston betularia industrial melanism has just been identified. It is the insertion of a large, tandemly repeated, transposable element into the first intron of the cort gene, which results in increased gene expression. The transposition event is thought to have occurred around 1819, which is consistent with the historical record. Surprisingly, the cort gene does not encode a transcription factor that would be involved in the expression of pigmentation genes. Its only known function has been reported in Drosophila, where the cort-encoded protein cortex is a cell-cycle regulator, required for the completion of meiosis in oocytes. In Heliconius numata tarapotensis and Heliconius melpomene rosina, 2 butterfly species, cortex is expressed in final instar larval hindwing discs, in regions fated to become black in the adult wing. Although cortex function in the regulation of pigmentation patterning is yet unknown, the current hypothesis is that it may regulate scale cell development.

In other latitudes, butterflies escape from predators not by concealment, but by warning that they are unpalatable with bright and distinctive wing colors. Within a given area, experienced birds have been "educated" to avoid certain patterns. This pattern recognition varies upon geographical locations. As a result, in a given area, a number of butterfly species, edible or not, mimic each other and have the same color pattern, even though they may be only distantly related, while Lepidopteria of the same species found in other locations may exhibit very different patterns. A recent study focused on different Heliconius species living in South America. The result was quite striking. In these species too, the cort gene appeared to be a major regulator of color and pattern. This result suggests that the recruitment of cortex to wing patterning may have occurred before the major diversification of the Lepidoptera. This gene has repeatedly been targeted by natural selection to generate both cryptic, as in Biston betularia, and aposematic, as in Heliconius genus, patterns.

As of this release, UniProtKB/Swiss-Prot Biston betularia, Heliconius melpomene and Heliconius erato cortex entries have been updated with this new knowledge and are publicly available.

UniProtKB news

Cross-references to Conserved Domains Database

Cross-references have been added to the Conserved Domains Database (CDD), a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins.

CDD is available at https://www.ncbi.nlm.nih.gov/cdd.

The format of the explicit links is:

Resource abbreviation CDD
Resource identifier CDD identifier
Optional information 1 CDD model name
Optional information 2 Number of hits

Example: Q196W5

Show all entries having a cross-reference to CDD.

Text format

Example: Q196W5

DR   CDD; cd04278; ZnMc_MMP; 1.

XML format

Example: Q196W5

<dbReference type="CDD" id="cd04278">
  <property type="entry name" value="ZnMc_MMP"/>
  <property type="match status" value="1"/>
</dbReference>

RDF format

Example: Q196W5

uniprot:Q196W5
  rdfs:seeAlso <http://purl.uniprot.org/cdd/cd04278> .
<http://purl.uniprot.org/cdd/cd04278>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/CDD> ;
  rdfs:comment "ZnMc_MMP" .

Change of the cross-references to VectorBase

We have modified our cross-references to the VectorBase database. We now use the VectorBase Transcript identifier as the primary resource identifier, while showing the VectorBase Protein and Gene identifiers in additional fields.

VectorBase is available at http://vectorbase.org.

The new format of the explicit links is:

Resource abbreviation VectorBase
Resource identifier Transcript identifier
Optional information 1 Protein identifier
Optional information 2 Gene identifier

Example: A7UVJ5

Show all entries having a cross-reference to VectorBase.

Text format

Example: A7UVJ5

Previous format:

DR   VectorBase; AGAP001789. Anopheles gambiae.

New format:

DR   VectorBase; AGAP001789-RA; AGAP001789-PA; AGAP001789.

XML format

Example: A7UVJ5

Previous format:

<dbReference type="VectorBase" id="AGAP001789">
  <property type="organism name" value="Anopheles gambiae"/>
</dbReference>

New format:

<dbReference type="VectorBase" id="AGAP001789-RA">
  <property type="protein sequence ID" value="AGAP001789-PA"/>
  <property type="gene ID" value="AGAP001789"/>
</dbReference>

This change does not affect the XSD, but may nevertheless require code changes.

RDF format

Example: A7UVJ5

Previous format:

uniprot:A7UVJ5
  rdfs:seeAlso <http://purl.uniprot.org/vectorbase/AGAP001789> .
<http://purl.uniprot.org/vectorbase/AGAP001789>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/VectorBase> ;
  rdfs:comment "Anopheles gambiae" .

New format:

uniprot:A7UVJ5
  rdfs:seeAlso <http://purl.uniprot.org/vectorbase/AGAP001789-RA> .
<http://purl.uniprot.org/vectorbase/AGAP001789-RA>
  rdf:type up:Transcript_Resource ;
  up:database <http://purl.uniprot.org/database/VectorBase> ;
  up:translatedTo <http://purl.uniprot.org/vectorbae/AGAP001789-PA> ;
  up:transcribedFrom <http://purl.uniprot.org/vectorbase/AGAP001789> .

Change of the cross-references to WormBase

Cross-references to WormBase may now be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Example: P00167

Changes to the controlled vocabulary of human diseases

New diseases:

UniProt website news

Peptide search tool

We have introduced a new tool called Peptide search that is available from a link in the header of the UniProt website. You can enter one or several peptide sequences (for example from a proteomics experiment) into the search field and the tool quickly finds all UniProtKB sequences that exactly match one of your query sequences. Searches can be restricted to a taxonomic subset of UniProtKB to decrease the search time. The tool returns a results page showing the matched UniProtKB entries in a design consistent with the UniProtKB text search results page, including filters on the left, results on the right and an option to customise the results table through the 'Columns' button.

Publications view added to UniProtKB entries

UniProt Knowledgebase (UniProtKB) protein entries now have a dedicated view of publications relevant for a protein. UniProtKB contains more than 350,000 unique publications, with over 210,000 of these fully curated in UniProtKB/Swiss-Prot and the remainder imported in UniProtKB/TrEMBL. This set is complemented by more than 640,000 additional publications that have been computationally mapped from other resources to UniProtKB entries. The publications annotated in UniProtKB have previously been displayed in the main 'Entry' view and a link provided access to a separate page that listed the computationally mapped publications. We have now combined all publications into a new 'Publications' view that can be accessed from a link under the 'Display' heading on the left hand side of a UniProtKB page. In this view you can filter the publications list by source and categories that are based on the type of data a publication contains about the protein (such as function, interaction, sequence, etc.) or the number of proteins it describes ('small scale' vs 'large scale'), see for example P10276.

UniProt release 2016_07

Published July 6, 2016

Headline

(Bacterial) immigration under control

Essentially all our mucosal surfaces are covered by microorganisms, not only bacteria, but also archaea, fungi, protozoans and viruses. Most of them reside within the gastrointestinal tract. Normal gut flora is largely responsible for overall health of the host and it does not trigger any inflammatory response... as long as it remains where it belongs. In order to maintain a subtle, though strict segregation, the colonic epithelium is covered by mucus. The latter is organized in 2 layers. The inner layer adheres firmly to the epithelial cells. It is dense and does not allow bacterial penetration, thus keeping the epithelial cell surface free from bacteria. The outer layer is the habitat of the commensal flora. The inner mucus layer is converted into the outer layer by proteolytic activities provided by the host and also probably by commensal bacterial proteases and glycosidases.

Colonic quietness is not only maintained by the mucus physical barrier, the immune system plays also a crucial role, among others, through the secretion of IgA into the gut lumen. These dimeric immunoglobulins bind flagellin, a highly conserved protein component of the bacterial flagellum that is expressed by many different commensal species. This interaction limits the association of flagellated bacteria with the intestinal mucosa. The mechanism leading to IgA production by B cells in this context is not yet fully uncovered, but it is known that flagellin is sensed by at least 3 different innate immune receptors, including TLR5, which plays an instrumental role in this process.

In this peaceful, though cautious cohabitation, another host protein actor has been recently identified, LYPD8. In the absence of LYPD8, bacteria penetrate the inner mucus layer despite normal mucin production, the main building block of mucus, and further into the crypts of the large intestine, causing severe inflammation. LYPD8 is membrane protein, attached to the plasma membrane through a glycophosphatidylinositol (GPI) anchor. It is selectively expressed in epithelial cells at the uppermost layer of the large intestinal gland and can be released into the gut lumen by the action of specific phospholipases. Once in the extracellular milieu, it binds to flagellated bacteria, including Proteus mirabilis. Contrary to TLR5, this interaction seems to be specific to flagella, a higher order structure comprised of polymerized flagellins, not to monomeric flagellins. This binding severely impairs bacterial swarming activity, thereby regulating gut homeostasis.

Until these recent observations, nothing was known about LYPD8. It had only been identified through large scale cDNA and genome sequencing. The sole annotations provided in UniProtKB were based on protein domain predictions, including that of the GPI anchor (UPAR/Ly6 domain) and of the signal peptide. As of this release, LYPD8 entries have been updated with this new functional information and are publicly available.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Ciliary dyskinesia, primary, 31
  • Jensen syndrome
  • Mental retardation, autosomal dominant 12
  • Thiopurine S-methyltransferase deficiency

UniProt release 2016_06

Published June 8, 2016

Headline

Strength through unity

Reversible phosphorylation of proteins is a fundamental regulatory mechanism for many processes across a wide range of taxa. It has been extensively studied in the context of intracellular events in the nucleus and in the cytoplasm. Less is known about extracellular phosphorylation, but a family of secretory pathway kinases has been identified within the Golgi apparatus and in the extracellular milieu in recent years. Among them, FAM20C has been shown to phosphorylate many secreted proteins involved in biomineralization, including enamel matrix proteins, such as AMBN, AMELX, AMTN and ENAM. The importance of extracellular phosphorylation in bone physiology is further supported by the observation that mutations in FAM20C are associated with Raine syndrome, an autosomal recessive osteosclerotic bone dysplasia with a neonatal lethal outcome.

FAM20A, FAM20C's closest paralog, exhibits all characteristics of a kinase, except for one residue, a conserved glutamic acid residue which is replaced by a glutamine, causing a loss of enzyme activity. This is not a characteristic unique to FAM20A. About 10% of the proteins classified as protein kinases lack some of the key features required for activity. They are called "pseudokinases". In spite of its lack of activity, mutations in FAM20A also produce a defect in biomineralization, namely amelogenesis imperfecta 1G.

This apparent paradox was solved by Cui et al. last year. They showed that in the absence of FAM20A, FAM20C activity dramatically drops. Moreover, FAM20A mutants associated with amelogenesis imperfecta 1G fail to activate FAM20C. The proteins have to form a complex for full FAM20C activity.

Kinases are synthesized as inactive proteins. Classically, their activation is achieved through the phosphorylation of a domain called the "activation loop" which induces a conformational change. FAM20C does not have an activation loop that could be phosphorylated. Yet another kind of activation, called "allosteric activation", has already been reported for kinase-pseudokinase pairs. In this model, it is the pseudokinase binding that induces the shape change of the bona fide kinase into its active conformation. Although the exact mechanism of FAM20C activation is still unclear, experimental results suggest that it may join the growing list of kinases regulated by dimerization-induced allostery.

FAM20A and FAM20B are quite old enzymes, evolutionarily related to kinases found in bacteria and slime molds. The fact that they do not use activation loop phosphorylation suggests that the allosteric mode of kinase activation may be very ancient, before the activation loop evolved. The presence of many conserved pseudokinases in the genomes of higher organisms suggests that allosteric activation may still be an efficient regulatory mechanism.

As of this release, FAM20A and FAM20C have been updated and are publicly available.

UniProtKB news

Removal of the cross-references to NextBio

Cross-references to NextBio have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Deleted diseases

  • Epilepsy, progressive myoclonic 5

RDF news

Change of URIs for neXtProt

For historic reasons, UniProt had to generate URIs to cross-reference databases that did not have an RDF representation. Our policy is to replace these by the URIs generated by the cross-referenced database once it starts to distribute an RDF representation of its data.

The URIs for the neXtProt database have therefore been updated from:

http://purl.uniprot.org/nextprot/<ID>

to:

http://nextprot.org/rdf/entry/<ID>

If required for backward compatibility, you can use the following query to add the old URIs:

PREFIX owl:<http://www.w3.org/2002/07/owl#>
PREFIX up:<http://purl.uniprot.org/core/>
INSERT
{
   ?protein rdfs:seeAlso ?old .
   ?old owl:sameAs ?new .
   ?old up:database <http://purl.uniprot.org/database/neXtProt> .
}
WHERE
{
   ?protein rdfs:seeAlso ?new .
   ?new up:database <http://purl.uniprot.org/database/neXtProt> .
   BIND(iri(concat('http://purl.uniprot.org/nextprot/', substr(str(?new),31))) AS ?old)
}

The dereferencing of existing http://purl.uniprot.org/nextprot/<ID> URIs will be maintained.

UniProt release 2016_05

Published May 11, 2016

Headline

Slow/White and the 6 DWORFs

Striated muscle function relies on a cycle of contraction and relaxation. Upon electrical stimulation of the myocyte plasma membrane, Ca(2+) is released from the sarcoplasmic reticulum (SR) into the cytosol. The released calcium activates movement of the molecular motor myosin along actin filaments and contraction occurs. Cytosolic Ca(2+) is then pumped back into the SR, through the action of SERCA proteins, allowing actomyosin relaxation. The SERCA proteins are SR-resident transmembrane ATPases, that couple the hydrolysis of ATP with Ca(2+) translocation.

Recent studies have highlighted a role for a network of (very) small ORFs (smORFs) in SERCA regulation. The first members of this exclusive but growing club were phospholamban (PLN, 52 amino acids) and sarcolipin (SLN, 31 amino acids), which were both isolated by classical biochemical approaches decades ago. Both bind SERCA and reduce the rate of calcium movement in heart and slow skeletal muscle fibers. More recently the SERCA inhibitory micropeptide myoregulin (MRLN, 46 amino acids), was identified in fast muscle fibers by Anderson et al. These authors started by screening for skeletal muscle-specific RNAs and discovered MRLN in an apparent long non-coding RNA (lncRNA). Encouraged by this discovery, Olson lab members continued to look for smORFs in other muscle-specific lncRNAs and found DWORF (34 amino acids), encoded by 2 exons of a 795 bp-long transcript; very difficult to predict using current software. In mouse myocytes, DWORF expression stimulates Ca(2+) uptake in the SR, not by direct activation of SERCA, but rather by relieving MRLN-, PLN- and SLN-mediated inhibition. DWORF expression may be particularly beneficial for recovery from periods of prolonged contraction.

SERCA regulation by micropeptides encoded in supposed lncRNAs is not a vertebrate-specific phenomenon. In Drosophila melanogaster, a single muscle-specific transcript encodes 2 smORFs related to sarcolipin, sarcolamban A and B (SCLA, 28 amino acids, and SCLB, 29 amino acids). Computer simulations predicted that both peptides fit the groove of SERCA, and this has been experimentally verified. While mutant flies deficient in sarcolamban showed no behavioral or morphological muscle phenotype, they do exhibit significantly more arrhythmic cardiac contractions than wild-type flies.

The idea that smORFs may be overlooked in the current genome annotation is not new, and these recent advances in muscle physiology underscore the likelihood that many transcripts annotated as noncoding RNAs may actually encode peptides with important biological functions. These smORFs could represent fast-evolving key regulators of larger molecular complexes. They also highlight the need for expert biocuration to make these data available in databases, as they cannot be automatically predicted, retrieved, nor annotated at the current time.

The 6 dworfs have been curated and integrated into UniProtKB/Swiss-Prot and we continue to survey the literature for other hidden micropeptide treasures (motivated solely by biological interest and not by our desire to find a seventh member for the purposes of this headline).

UniProtKB news

Cross-references to SIGNOR

Cross-references have been added to SIGNOR, the Signaling Network Open Resource, a resource that organizes and stores, in a structured format, signaling information published in the scientific literature. The core of this project is a large collection of manually-annotated causal relationships between proteins that participate in signal transduction.

SIGNOR is available at http://signor.uniroma2.it/.

The format of the explicit links is:

Resource abbreviation SIGNOR
Resource identifier UniProtKB accession number.

Example: P00533

Show all entries having a cross-reference to SIGNOR.

Text format

Example: P00533

DR   SIGNOR; P00533; -.

XML format

Example: P00533

<dbReference type="SIGNOR" id="P00533"/>

RDF format

Example: P00533

uniprot:P00533
  rdfs:seeAlso <http://purl.uniprot.org/signor/P00533> .
<http://purl.uniprot.org/signor/P00533>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SIGNOR> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt website news

Change of UniProt website job identifiers

To enable a more flexible and scalable infrastructure, we have extended the length of the UniProt website's job identifiers.

Example:

M201604052M3YWGETHB
has become:
M2016040537D007A56D816107CE5B52C10342DB3700000452

We will continue to store job results for 7 days.

UniProt release 2016_04

Published April 13, 2016

Headline

Small changes, big effects

Our brain has the ability to reorganize itself by forming new neural connections throughout life. This plasticity allows neurons to adjust their activities in response to new situations, to changes in their environment, and to compensate for injury and disease. Plasticity is not only due to the creation/destruction of neuronal connections, but also to the modulation of synaptic strength depending upon its activity, a process called 'short-term synaptic plasticity' (STP). There are 2 types of STP, with opposite effects, known as 'depression' and 'facilitation'. When neurons receive excitatory input, they generate strong electrical impulses (called spikes) which cause a release of neurotransmitters at the synaptic connections with other neurons. The neurotransmitters stimulate receptors on the postsynaptic neuron and trigger downstream electrical impulses. Action potential activity leads to the depletion of neurotransmitters consumed during the synaptic signaling process at the axon terminal of a presynaptic neuron, causing 'depression'. It also induces an influx of calcium into the axon terminal. The calcium accumulation increases neurotransmitter release by the next presynaptic spike, facilitating synaptic transmission and temporarily potentiating the synapse ('facilitation').

Facilitation is important for the proper function of mammalian brains. It may form the basis of short-term working memory. In the hippocampus, it has been proposed to play a role in the acquisition of spatial information. In the auditory pathway, it allows the maintenance of linear transmission of rate-coded sound intensity.

Although synaptic facilitation was observed more than 70 years ago, the underlying mechanism is not yet fully elucidated. However, a major breakthrough was recently achieved and published in January in Nature. In their article, Jackman et al. identified a synaptotagmin-7 (SYT7) requirement for facilitation to occur in most central synapses. SYT7 is a calcium- and phospholipid-binding protein involved in the exocytosis of many secretory and synaptic vesicles. In SYT7-knockout mice, facilitation was eliminated at all synapses (except for mossy fiber synapses), although calcium influx was not affected by the mutation.

To rule out an indirect effect of SYT7 knockout, the authors tried to rescue facilitation through viral expression of SYT7 in hippocampal CA3 pyramidal cells. To do so, they used an adeno-associated virus that drove bicistronic expression of both channelrhodopsin-2 and SYT7. Channelrhodopsins are unicellular green algae proteins that serve as sensory photoreceptors. When expressed in the experimental setting established by Jackman et al., they enabled light to control electrical excitability only in the fibers expressing SYT7. The result was clear-cut: facilitation was restored. The identification of a protein required for synaptic facilitation may pave the way for future investigations on the functional role of this process.

As of this release, SYT7 proteins have been updated in UniProtKB/Swiss-Prot and are publicly available.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Inclusion body myopathy 2

UniProt service news

New UniProt JAPI

We have developed a new version of the UniProt JAPI. The legacy UniProt JAPI will be retired as of Wednesday, April 13th 2016. If you have any questions or concerns, please feel free to contact us at helpstuff@unipstuffrot.org.

UniProt RDF news

Change of the UniProt RDF files distribution

The UniProt RDF distribution has been available on the UniProt FTP site since 2008 with data split into one file per dataset. Over time the size of the largest files has grown to over 80 Gigabytes. These large files are difficult to download and they also limit the maximum rate at which the data can be loaded into many RDF stores. We have therefore split the files of the three biggest datasets into sets of smaller files:

  • The UniProtKB dataset is split based on taxonomy and whether entries are active or not. The resulting files contain at most 1 million active or 10 million obsolete entries.
  • The UniRef dataset is split into files that contain at most 1 million entries.
  • The UniParc dataset is split into files of approximately 1 Gigabyte in size.

We also reduced the data redundancy between the datasets to further decrease the total data volume:

  • The UniProtKB dataset has always been fully normalized with respect to the taxonomy dataset and it is now also normalized with respect to the keywords, GO and citations datasets. The total number of unique triples across these datasets remains the same, but it means that if you have so far only loaded the UniProtKB and taxonomy RDF files into your RDF store, you must now also load the keywords.rdf.xz, go.owl.xz and citations.rdf.xz files in order to have the same data.
  • The UniRef dataset has been normalized with respect to the UniProtKB and UniParc datasets. It now only describes the UniRef cluster memberships. The sequence and entry information of UniProtKB and UniParc member entries is no longer repeated in the UniRef RDF files.

UniProt release 2016_03

Published March 16, 2016

Headline

From the Zika forest to the Amazon, news from a viral wanderer

In 2015, a large outbreak in Brazil put the Zika virus in the spotlight. Most people who become infected with Zika virus do not become sick and for those who do, the illness is generally mild. However, in some cases, complications can be quite severe. In addition, microcephaly has been reported in some babies born to mothers infected with Zika virus during pregnancy, pointing to the virus as an emerging human pathogen.

Although the Zika virus owes its worldwide infamy to its wandering to the Western hemisphere, it has been circulating in Africa for a long time before. It was first discovered in Uganda, in 1947 in rhesus monkeys living in the Zika Forest (after which it was named), and subsequently in humans in 1952. It is an RNA virus of the flavivirus genus, which also includes dengue, yellow fever and West Nile viruses. Like its relatives, it is transmitted by Aedes mosquitoes originally in endemic regions of central Africa. Taking advantage of modern means of transportation, it started spreading, first in Micronesia in 2007, then French Polynesia in 2013, and Brazil and Central America in 2014.

As it has long been considered insignificant, the Zika virus has not been extensively studied and most of our current knowledge has been inferred from other viruses of the same genus. The Zika virus entry into target cells can be triggered by binding to AXL and TYRO3. Interestingly, these proteins are also involved in Ebola virus and Lassa virus entry in human cells. Attachment to the host receptors is followed by internalization by a process called 'apoptotic mimicry' whereby the virus manages to be recognized by the target cell as an apoptotic body. After fusion of the virus membrane with the host endosomal membrane, the RNA genome is released into the cytoplasm. Flaviviruses are remarkable in that their genome encodes a single polyprotein that inserts into the endoplasmic reticulum (ER) membrane forming a complex pattern. This polyprotein is subsequently cleaved into 13 molecules by viral and host peptidases. The non-structural proteins form membrane spherules, presumably to protect the double stranded RNA intermediate of viral replication. The genomic viral RNA is replicated and translated, leading to creation of new Zika virions in the ER. The virions bud by hijacking the host endosomal sorting complex required for transport (ESCRT) system. They are transported to the Golgi apparatus, where further maturation occurs. Eventually fusion-competent virions are released by exocytosis.

As of this release, a Zika virus reference proteome has been manually curated in UniProtKB, where it can be safely visited.

A page dedicated to Zika has also been created in ViralZone to offer a global view of how this particular virus functions and provides access to other databases.

Cross-references to EPD

Cross-references have been added to EPD, the Encyclopedia of Proteome Dynamics, a resource that contains data from multiple, large-scale proteomics experiments aimed at characterising proteome dynamics in both human cells and model organisms.

EPD is available at https://www.peptracker.com/epd/analytics/.

The format of the explicit links is:

Resource abbreviation EPD
Resource identifier UniProtKB accession number.

Example: P00451

Show all entries having a cross-reference to EPD.

Text format

Example: P00451

DR   EPD; P00451; -.

XML format

Example: P00451

<dbReference type="EPD" id="P00451"/>

RDF format

Example: P00451

uniprot:P00451
  rdfs:seeAlso <http://purl.uniprot.org/epd/P00451> .
<http://purl.uniprot.org/epd/P00451>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/EPD> .

Cross-references to TopDownProteomics

Cross-references have been added to TopDownProteomics, a resource from the Consortium for Top Down Proteomics that hosts top down proteomics data presenting validated proteoforms to the scientific community.

TopDownProteomics is available at http://repository.topdownproteomics.org/.

The format of the explicit links is:

Resource abbreviation TopDownProteomics.
Resource identifier UniProtKB accession number.

Example: P10599

Show all entries having a cross-reference to TopDownProteomics.

Cross-references to TopDownProteomics may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Text format

Example: P10599

DR   TopDownProteomics; P10599-1; -. [P10599-1]
DR   TopDownProteomics; P10599-2; -. [P10599-2]

XML format

Example: P10599

<dbReference type="TopDownProteomics" id="P10599-1">
  <molecule id="P10599-1"/>
</dbReference>
<dbReference type="TopDownProteomics" id="P10599-2">
  <molecule id="P10599-2"/>
</dbReference>

RDF format

Example: P10599

uniprot:P10599
  rdfs:seeAlso <http://purl.uniprot.org/topdownproteomics/P10599-1> ,
    <http://purl.uniprot.org/topdownproteomics/P10599-2> .

<http://purl.uniprot.org/topdownproteomics/P10599-1>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/TopDownProteomics> .
<#_5030303735300040>
  rdf:type rdf:Statement ;
  rdf:subject <P10599> ;
  rdf:predicate rdfs:seeAlso ;
  rdf:object <http://purl.uniprot.org/topdownproteomics/P10599-1> ;
  up:sequence isoform:P00750-1 .
<http://purl.uniprot.org/topdownproteomics/P10599-2>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/TopDownProteomics> .
<#_5030303735300040>
  rdf:type rdf:Statement ;
  rdf:subject <P10599> ;
  rdf:predicate rdfs:seeAlso ;
  rdf:object <http://purl.uniprot.org/topdownproteomics/P10599-2> ;
  up:sequence isoform:P00750-2

Changes to the controlled vocabulary of human diseases

New diseases:

UniProt release 2016_02

Published February 17, 2016

Another one (antibiotic) bites the dust

Polymyxin E (also known as colistin) and other polymyxin antibiotics are among our last-resort drugs against multi-drug resistant Gram-negative bacteria such as Klebsiella pneumoniae, Pseudomonas aeruginosa and Acinetobacter.

The initial target of polymyxin antibiotics is the lipopolysaccharide layer (LPS) of the Gram-negative bacterial outer membrane. LPS has two 2-keto-3-deoxyoctonoic acid units bound to lipid A, which itself consists of 2 glucosamine units with attached fatty acyl chains and a phosphate group on each sugar. Lipid A acts as a hydrophobic anchor, in which the tight packing of the fatty acyl chains helps to stabilize the overall outer membrane structure. The positively charged L-2,4-diaminobutyric acid residues of polymyxins interact with the negatively charged phosphate groups on lipid A. The amphipathic antibiotics are thought to form pores that permeabilize the outer membrane. The polymyxins would then insert into and disrupt the inner membrane, leading to further pore formation. There is also some evidence that polymyxins have other intracellular targets.

As the initial contact of polymyxin antibiotics is with lipid A, resistance often occurs via its modification, frequently masking its negative charge. Before August 2015 a number of chromosomal resistance loci were known, but no resistance had been identified on a more easily transferred plasmid. During a routine surveillance of commensal Escherichia coli for antibiotic resistance, scientists in China identified mcr1, a plasmid-encoded gene which encodes a protein of the phosphoethanolamine transferase family. The gene confers both colistin and polymyxin B resistance by modifying lipid A, and probably originated in Paenibacillus. This would seem logical as Paenibacillus is the natural source of polymyxin antibiotics.

The gene was first identified from a pig farm in Shanghai in July 2013. Retrospective screening of isolated E.coli plasmids in China showed an alarming rise in its presence in pork, ranging from 6% in 2011 to 22% in 2014. The gene has also been detected in chicken meat in China, rising from 5% in 2011 to 28% in 2014. Screening hospital inpatients in 2014 showed both E.coli and K.pneumoniae mcr1-containing plasmid; 1.4% from E.coli, 0.7% from K.pneumoniae. The gene was also detected in E.coli genomes from Malaysia. An in situ test in mice showed that the gene was indeed able to confer colistin resistance. The original plasmid can transfer to other E.coli cells via conjugation, but only via transformation into K.pneumoniae or P.aeruginosa; it is stable in the absence of selective pressure.

Since the publication of the paper identifying mcr1 on-line November 15, 2105, numerous papers have appeared reporting retrospective screening for the gene. So far its earliest isolation is from a French calf in 2005, in which a worrying co-localization with a wide-spectrum beta-lactamase resistance gene was also reported. The gene has been found in human fecal samples dating from 2012 on, in Europe, Africa, South America and Asia. It was found in E.coli isolated from pigs in Germany in 2010, from Belgian calves in 2011-2012, in European food samples from June 2011 on, and from animal feces in Asia. The gene is not always isolated from the same plasmid background, and mcr1 is often associated with mobile genetic elements, probably aiding its dispersal.

In short, the gene has been slowly spreading around the world since before we were even aware of its existence. Colistin has been used in agriculture since the 1950s and is widely used in China, which is probably contributing to its steady dissemination. There are increasingly urgent calls for its agricultural use to be reevaluated before resistance spreads even further.

As of this release, Mcr-1 has been annotated and is available in UniProtKB/Swiss-Prot.

Cross-references to SwissPalm

Cross-references have been added to SwissPalm, a manually curated resource to study protein S-palmitoylation. It encompasses S-palmitoylated protein hits from more than 50 species and provides curated information and filters that increase the confidence in true positive hits. SwissPalm integrates predictions of S-palmitoylated cysteine scores, orthologs and isoform multiple alignments.

SwissPalm is available at http://swisspalm.epfl.ch/.

The format of the explicit links is:

Resource abbreviation SwissPalm
Resource identifier UniProtKB accession number.

Example: Q13530

Show all entries having a cross-reference to SwissPalm.

Text format

Example: Q13530

DR   SwissPalm; Q13530; -.

XML format

Example: Q13530

<dbReference type="SwissPalm" id="Q13530"/>

RDF format

Example: Q13530

uniprot:Q13530
  rdfs:seeAlso <http://purl.uniprot.org/swisspalm/Q13530> .
<http://purl.uniprot.org/swisspalm/Q13530>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SwissPalm> .

Change of the cross-references to Gramene

We have modified our cross-references to the Gramene database.

The new format of the explicit links is:

Resource abbreviation Gramene
Resource identifier Transcript identifier
Optional information 1 Protein identifier
Optional information 2 Gene identifier

Cross-references to Gramene may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

The Gramene database has also been moved from the category "Organism-specific databases" to the category "Genome annotation databases".

Example: Q10DK7

Show all entries having a cross-reference to Gramene.

Text format

Example: Q10DK7

Previous format:

DR   Gramene; Q10DK7; -.

New format:

DR   Gramene; OS03T0727600-01; OS03T0727600-01; OS03G0727600.

XML format

Example: Q10DK7

Previous format:

<dbReference type="Gramene" id="Q10DK7"/>

New format:

<dbReference type="Gramene" id="OS03T0727600-01">
  <property type="protein sequence ID" value="OS03T0727600-01"/>
  <property type="gene ID" value="OS03G0727600"/>
</dbReference>

This change does not affect the XSD, but may nevertheless require code changes.

RDF format

Example: Q10DK7

Previous format:

uniprot:Q10DK7
  rdfs:seeAlso <http://purl.uniprot.org/gramene/Q10DK7> .
<http://purl.uniprot.org/gramene/Q10DK7>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/Gramene> .

New format:

uniprot:Q10DK7
  rdfs:seeAlso <http://purl.uniprot.org/gramene/OS03T0727600-01> .
<http://purl.uniprot.org/gramene/OS03T0727600-01>
  rdf:type up:Transcript_Resource ;
  up:database <http://purl.uniprot.org/database/Gramene> ;
  up:translatedTo <http://purl.uniprot.org/gramene/OS03T0727600-01> ;
  up:transcribedFrom <http://purl.uniprot.org/gramene/OS03G0727600> .

Removal of the cross-references to GeneFarm

Cross-references to GeneFarm have been removed.

Removal of the cross-references to GenoList

Cross-references to GenoList have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Periventricular nodular heterotopia 4
  • Transposition of the great arteries dextro-looped 2

UniProt website news

UniProt feature viewer added to UniProtKB entries

UniProt provides sequence annotations, a.k.a. protein features, to describe regions or sites of biological interest; secondary structure regions, domains, post-translational modifications and binding sites among others, play a critical role in the understanding of what the protein does. With the growth in biological data, integration and visualization becomes increasingly important for exposing different data aspects that might be otherwise hidden, unclear or difficult to grasp.

Hence we are introducing the UniProt feature viewer, a BioJS component bringing together protein sequence features in one compact view. Similar to genome viewers, the viewer uses tracks to display different protein features providing an intuitive picture of co-localized elements. Each track can be expanded to reveal a more in-depth view of the underlying data. The variant track offers a novel visualization and presents UniProt curated natural variants along with imported variants from large-scale studies (such as 1000 Genomes and COSMIC).

The UniProt feature viewer is available for every UniProtKB protein entry through the 'Feature viewer' link under the 'Display' heading on the left hand side.

If you would like to include the feature viewer in your own website or resource, you can find instructions in our technical documentation.

UniProt release 2016_01

Published January 20, 2016

Headline

cGAMP, a welcome stowaway

We are often amazed by the strategies deployed by viruses to trick our defences, but our immune system does not lag behind and it can also fool viral invaders. The detection of viruses by the innate immune system relies on the detection of intracellular DNA by pattern recognition receptors, including cyclic guanosine monophosphate (GMP) adenosine monophosphate (AMP) synthase (cGAS, also called MB21D1). In response to cytosolic DNA, this enzyme synthesizes 2'3'-cyclic GMP-AMP (cGAMP), which then binds to STING (also called TMEM173), an endoplasmic reticulum transmembrane protein, leading to the activation of the type I interferon (IFN) response, thereby inducing an antiviral state.

Last year, Gentili et al. made a puzzling observation. To study cGAS function, they transduced human monocyte-derived dendritic cells with a cGAS-expressing lentivirus. As expected, the cells were strongly activated, but the stimulatory property of the cGAS-encoding lentivirus did not correlate with the transduction efficiency. This led to the hypothesis that it was not cGAS itself that was responsible for the activation of the infected cells, but some other stimulatory signal, which was transferred by the viral vector. Indeed, when dendritic cells were challenged with virus-like particles (VLPs) that did not themselves encode cGAS, but were produced in the presence of cGAS, the cells were stimulated. This effect was abolished when VLPs were produced in the presence of a catalytically inactive cGAS mutant. Concomitantly, Bridgeman et al. found that the incubation of macrophages, epithelial cells or lung fibroblasts with lentiviral particles collected from cells overexpressing cGAS led to the STING-dependent up-regulation of type I interferons and interferon-stimulated genes. All this evidence pointed to cGAMP as the stimulatory signal and indeed both groups identified the dinucleotide in the viral particles, by mass spectrometry, not only in their experimental system, but also in more physiological settings, using a herpes virus (MCMV) and a poxvirus (Modified Vaccinia Anakara virus). It is yet unclear whether the incorporation of cGAMP into virus particles is a selective host-directed process or simply a consequence of random fluid-phase uptake of cytosolic material into viral particles.

cGAMP has previously been shown to diffuse through gap junctions, thereby alerting non-infected neighboring cells to pathogen threat. The discovery by Gentili et al. and Bridgeman et al. suggests that cells located far from the initial infection site may also benefit from cGAMP transfer and initiate rapid antiviral responses bypassing the need for cGAS activation.

Although the downstream fate of the dinucleotide does not directly depend on cGAS enzyme activity, this piece of information has been introduced into cGAS entries as of this release.

Cross-references to CollecTF

Cross-references have been added to the CollecTF database of bacterial transcription factor binding sites. CollecTF stores data on experimentally-validated TFBS and places special emphasis on providing a transparent curation process that captures the experimental support for sites as reported by authors in peer-reviewed publications.

CollecTF is available at http://www.collectf.org.

The format of the explicit links is:

Resource abbreviation CollecTF
Resource identifier CollecTF identifier

Example: A0KST7

Show all entries having a cross-reference to CollecTF.

Text format

Example: A0KST7

DR   CollecTF; EXPREG_00000150; -.

XML format

Example: A0KST7

<dbReference type="CollecTF" id="EXPREG_00000150"/>

RDF format

Example: A0KST7

uniprot:A0KST7
  rdfs:seeAlso <http://purl.uniprot.org/collectf/EXPREG_00000150> .
<http://purl.uniprot.org/collectf/EXPREG_00000150>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/CollecTF> .

Cross-references to GeneDB

Cross-references have been added to GeneDB pathogen genome database from Sanger Institute. GeneDB provides access to the latest sequence data and annotation/curation for the whole range of organisms sequenced by the Sanger Pathogen group.

GeneDB is available at http://www.genedb.org.

The format of the explicit links is:

Resource abbreviation GeneDB
Resource identifier GeneDB identifier

Example: Q8WPT5

Show all entries having a cross-reference to GeneDB.

Text format

Example: Q8WPT5

DR    GeneDB; H25N7.01:pep; -.

XML format

Example: Q8WPT5

<dbReference type="GeneDB" id="H25N7.01:pep"/>

RDF format

Example: Q8WPT5

uniprot:Q8WPT5
  rdfs:seeAlso <http://purl.uniprot.org/genedb/H25N7.01:pep> .
<http://purl.uniprot.org/genedb/H25N7.01:pep>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/GeneDB> .

Cross-references to iPTMnet

Cross-references have been added to iPTMnet integrated resource for PTMs in systems biology context. iPTMnet connects multiple disparate bioinformatics tools and systems text mining, data mining, analysis and visualization tools, and databases and ontologies into an integrated resource to address the knowledge gaps in exploring and discovering PTM networks. iPTMnet database currently contains phosphorylation information.

iPTMnet is available at http://pir.georgetown.edu/iPTMnet.

The format of the explicit links is:

Resource abbreviation iPTMnet
Resource identifier UniProtKB accession number.

Example: Q15796

Show all entries having a cross-reference to iPTMnet.

Text format

Example: Q15796

DR   iPTMnet; Q15796; -.

XML format

Example: Q15796

<dbReference type="iPTMnet" id="Q15796"/>

RDF format

Example: Q15796

uniprot:Q15796
  rdfs:seeAlso <http://purl.uniprot.org/iptmnet/Q15796> .
<http://purl.uniprot.org/iptmnet/Q15796>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/iPTMnet> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • ADP-ribosyl aspartic acid

UniProt release 2015_12

Published December 9, 2015

Headline

Host proteins SERINC3 and SERINC5 decrease HIV-1 infectivity

It has long been known that the HIV-1 nef ("negative regulatory factor") protein increases the infectivity of the HIV-1 virion (PMID:7981973). This mysterious protein is only found in primate lentiviruses. Its function is to manipulate the host's cellular machinery and thus to allow infection, survival or replication of the virus. The abundant research performed on this topic has unraveled many phenotypes associated with nef, mainly in restricting host protein expression to cellular membrane. However, all these various functions have not allowed a clear understanding of the virion infectivity phenotype, although they have revealed the way HIV-1 avoids the host's immune response.

Two recent papers in Nature have shown that nef actually prevents the incorporation of host SERINC3 and SERINC5 proteins into the HIV-1 virion. These proteins dramatically decrease virion infectivity when they are part of its membrane. This study improves the understanding of nef function in virion infectivity. The means used by nef to achieve this function are still unknown, but are related to its capacity to prevent specific host proteins from reaching the plasma membrane. Human SERINC3 and SERINC5 functions are still not well understood, but further study on these proteins will reveal their antiviral action.

As of this release, HIV-1 nef and human proteins SERINC3 and SERINC5 have been updated and are publicly available.

UniProtKB news

Displaying human UniProtKB sequence annotations in genome browser tracks

Genome browser tracks allow users to align sequence annotations to the reference genome data and genome annotations. Both UCSC and Ensembl genome browsers have custom tracks for displaying external annotations in their browsers. UniProt would like to announce the beta release of new genome tracks which allow the alignment of protein sequence annotations in our resource to a reference genome. These UniProt genome tracks include genomic locations of protein sequences and annotations such as active sites, metal binding sites, post-translational modifications, variants and domains with supporting literature evidence where available. Each species represented by the genome annotation tracks resource will have protein sequences and annotations defined by the BED and bigBed formats.
The beta release is available in the new dedicated 'genome_annotation_tracks' directory on the UniProt FTP site and provides tracks for human with the release of additional species in the future. UniProt would welcome your feedback on this new resource.

Cross-references to SwissLipids

Cross-references have been added to SwissLipids, a comprehensive reference database that links mass spectrometry-based lipid identifications to curated knowledge of lipid structures, metabolic reactions, enzymes and interacting proteins.

SwissLipids is available at http://www.swisslipids.org.

The format of the explicit links is:

Resource abbreviation SwissLipids
Resource identifier SwissLipids identifier

Cross-references to SwissLipids may be isoform-specific (e.g. Q08477). The general format of isoform-specific cross-references was described in release 2014_03.

Example: P52824

Show all entries having a cross-reference to SwissLipids.

Text format

Example: P52824

DR   SwissLipids; SLP:000000740; -.

XML format

Example: P52824

<dbReference type="SwissLipids" id="SLP:000000740"/>

RDF format

Example: P52824

uniprot:P52824
  rdfs:seeAlso <http://purl.uniprot.org/swisslipids/SLP:000000740> .
<http://purl.uniprot.org/swisslipids/SLP:000000740>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/SwissLipids> .

Cross-references to MalaCards

Cross-references have been added to MalaCards, an integrated database of human maladies and their annotations, modeled on the architecture and richness of the popular GeneCards database of human genes.

The MalaCards disease and disorders database is organized into "disease cards", each integrating prioritized information, and listing numerous known aliases for each disease, along with a variety of annotations, as well as inter-disease connections.

MalaCards is available at http://www.malacards.org.

The format of the explicit links is:

Resource abbreviation MalaCards
Resource identifier Gene symbol

Example: P26439

Show all entries having a cross-reference to MalaCards.

Text format

Example: P26439

DR   MalaCards; HSD3B2; -.

XML format

Example: P26439

<dbReference type="MalaCards" id="HSD3B2"/>

RDF format

Example: P26439

uniprot:P26439
  rdfs:seeAlso <http://purl.uniprot.org/malacards/HSD3B2> .
<http://purl.uniprot.org/malacards/HSD3B2>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/MalaCards> .

Change of UniProtKB annotation cardinality constraints

Each UniProtKB entry may contain a variable number of different annotation topics. Most topics can be present more than once in a given entry (e.g. when a precursor protein is cleaved into chains/peptides with different functions, each one is described in a separate Function annotation). But some topics had been limited to occur no more than once per entry. We have lifted this restriction to allow for more flexibility and granularity in our annotations.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Fanconi anemia complementation group M
  • Paget disease of bone
  • Spinocerebellar ataxia, autosomal recessive, 5

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • PolyADP-ribosyl aspartic acid

New terms for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • 2-(S-cysteinyl)-methionine (Cys-Met)
  • Cyclopeptide (Cys-Ile)

UniProt service news

Retirement of UniProt BioMart

Based on user surveys and service evaluations, we decided to retire our UniProt BioMart service. For those who relied on the UniProt BioMart for tasks such as ID mapping, bulk retrieval of entries, or programmatic access to entry annotations, we have alternative services that will satisfy your needs. Please visit our YouTube channel and help pages for tutorials and more information about these services:

Please contact us if you have questions about this change.

Retirement of UniProt Distributed Annotation System (DAS)

The Distributed Annotation System (DAS) defines a communication protocol used to exchange annotations on genomic or protein sequences. It was first released in 2001 and UniProt had started to provide its data following the DAS protocol in July 2004. DAS has fulfilled a valuable role in integrating distributed and varied data, particularly for display in genome browsers and other applications that feature data visualisation, but unfortunately the level of usage of DAS in 2015 can no longer justify support and maintenance and we have therefore retired the UniProt DAS server.

Documentation on programmatic access to UniProt data can be found on the UniProt website.

Please contact us if you have questions about this change.

UniProt release 2015_11

Published November 11, 2015

Headline

The sense of a motion

No need to be a great scientist to understand that when a hawk is circling in the sky looking for food, small rodents should run and hide. This does not imply the mere recognition of a static image, or of a global movement, but most importantly to sense an asynchrony between a moving object (the hawk) and its background (the slow-moving clouds above it).

In vertebrates, visual motion sensing takes place in the retina and more specifically in a subset of retinal ganglion cells (RGCs). RGCs are located near the inner surface of the retina, where they receive visual information from photoreceptors via intermediate neurons, bipolar cells and amacrine cells. They extract salient features and send them deeper into the brain for further processing. The final picture is produced by the integration of many signals, each carried by a distinct population of RGCs. It is currently estimated that approximately 70 types of interneurons form specific synapses on roughly 30 types of RGCs. The discovery of the function of each RGC type and of their connections with specific interneurons is like trying to find the proverbial needle in a haystack.

Three years ago, Zhang et al. tackled this issue using a transgenic mouse line, called TYW3. In these mice, strong regulatory elements from the Thy1 gene drive the expression of yellow fluorescent protein (YFP). In the retina, YFP fluorescence could be detected in only a small subset of RGCs. The brightest cell population (W3-RGCs) was chosen for further characterization. Interestingly, these cells remained silent under most common visual inputs, including locomotion in a natural environment obtained with videos from a camera mounted on the head of a freely moving rat. The only condition that elicited reliable responses from W3-RGCs was the movement of small spots differing from that of the background, but not when these movements coincided.

The canonical pathway for delivering visual input to RGCs involves direct connections between bipolar cells and RGCs. In other words, RCGs typically are two synapses away from a photoreceptor, which ensures the fastest transmission of the signal. Surprisingly, W3-RGCs receive strong and selective input from unusual excitatory amacrine cell type interneurons, called VG3-ACs. With the introduction of the VG3-AC partner to the circuit, W3-RGCs appear to be three synapses away from a photoreceptor, slowing visual information delivery to the cells. A possible explanation is that W3-RGCs compare motion in the center and surround of the receptive field, firing only when the two are asynchronous. For the comparison to be temporally precise, input from the surround must arrive at the cell rapidly and/or input from the center must be delayed.

The crucial connection between W3-RGCs and VG3-ACs is ensured by homophilic interactions between Sdk2 proteins expressed at the cell surface of both cell types. Sdk2 is a cell adhesion protein whose expression is detected in the embryonic retina soon before birth and persists into adulthood, spanning the periods of lamina formation and synaptogenesis. Sdk2 knockout caused no alterations in retinal structure, but the strength of synaptic connections between VG3-ACs and W3-RGCs drops about 20-fold.

For your eyes only, the Sdk2 entries have been updated and are publicly available as of this release.

UniProtKB news

Change of the cross-references to eggNOG

We have introduced an additional field in the cross-references to the eggNOG database to indicate the taxonomic scope of an orthologous group.

Text format

Example: U3JAG9

DR   eggNOG; ENOG410IEUN; Eukaryota.
DR   eggNOG; ENOG410YVPU; LUCA.

XML format

Example: U3JAG9

<dbReference type="eggNOG" id="ENOG410IEUN">
  <property type="taxonomic scope" value="Eukaryota"/>
</dbReference>
<dbReference type="eggNOG" id="ENOG410YVPU">
  <property type="taxonomic scope" value="LUCA"/>
</dbReference>

This change did not affect the XSD, but may nevertheless require code changes.

RDF format

Example: U3JAG9

uniprot:U3JAG9
  rdfs:seeAlso <http://purl.uniprot.org/eggnog/ENOG410IEUN> ,
               <http://purl.uniprot.org/eggnog/ENOG410YVPU> .
<http://purl.uniprot.org/eggnog/ENOG410IEUN>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/eggNOG> ;
  rdfs:comment "Eukaryota" .
<http://purl.uniprot.org/eggnog/ENOG410YVPU>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/eggNOG> ;
  rdfs:comment "LUCA" .

Changes to the controlled vocabulary of human diseases

New diseases:

Changes to keywords

New keyword:

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt release 2015_10

Published October 14, 2015

Headline

The smell of the sea in UniProtKB

Memories left by a walk on the seashore bring into play all our senses, of which smell is not the least. This characteristic 'smell of the sea' is carried by a little molecule, dimethylsulfide (DMS), which is an enzymatic cleavage product of dimethylsulfoniopropionate (DMSP).

DMSP is one of the most abundant organic molecules in the world, with a billion tons made and turned over every year. It is produced by marine macroalgae, as well as by single-cell phytoplankton species, such as diatoms, dinoflagellates and haptophytes, and occurs at high concentrations in their cytoplasm. The physiological function of DMSP is not yet fully established. It is thought to function as an osmolyte. It has also been proposed to serve as a cryoprotectant in polar algae. DMSP enzymatic cleavage products, DMS and acrylate, are quite effective at scavenging free radicals and other reactive oxygen species. Hence they may serve as an antioxidant system.

In healthy growing phytoplankton, DMSP freely diffuses in the cytoplasm, and only minute quantities are released. This amount is sufficient to attract zooplankton which start feeding on algae. Organisms grazed upon or infected by viruses as well as stressed or senescent cells release greater amount of DMSP, which is taken up by bacterioplankton, metabolized into DMS and used as a source of carbon and sulfur. DMS is not only used by seawater microorganisms, it is also volatile and a small fraction of it is released into the atmosphere where it creates an olfactory landscape providing seabirds with orientation cues to potential food supplies. In the atmosphere, DMS is oxidized to sulfuric acid and becomes an important source of sulfate aerosols. These act as condensation nuclei, causing water molecules to coalesce and cloud to form. The cycle is closed when rain brings back the sulfur-containing particles into the ocean. Interestingly, phytoplankton appear to convert DMSP into DMS very rapidly when they are stressed by UV radiation. The local increase in volatile DMS increases cloud formation, hence decreasing direct sun light exposure and relieving stress. Through this mechanism, plankton may shape local weather for their own benefit.

DMS release by seaweed was described in 1935 and DMSP was identified as its precursor almost 70 years ago, but the enzyme catalyzing the reaction remained elusive until last June. Using classical biochemical approaches, as well as genomic and proteomic analyses, Alcombri et al. identified ALMA1 from the chloroplastic membrane fraction of the coccolithophore alga Emiliania huxleyi, an abundant bloom-forming marine phytoplankton. This enzyme is a redox-sensitive homotetramer, that belongs to the aspartate/glutamate racemase superfamily and catalyzes DMSP cleavage into DMS and acrylate. Phylogenetic studies show the presence of numerous ALMA1 homologs in major, globally distributed phytoplankton taxa and in other marine organisms. This major discovery paves the way for future investigations on the physiological role of DMS and may allow quantification of the relative biogeochemical contribution of algae and bacteria to global DMS production.

If you want to take a deep, though virtual breath of sea smell, you can visit ALMA1 entries that are available to you as of this release.

UniProtKB news

Cross-references to WBParaSite

Cross-references have been added to WBParaSite, an open access resource providing access to the genome sequences, genome browsers, semi-automatic annotation and comparative genomics analysis of parasitic worms (helminths). WormBase ParaSite is closely integrated with and complementary to the main WormBase resource, the central focus of which is the model nematode Caenorhabditis elegans and its close relatives.

WBParaSite is available at http://parasite.wormbase.org.

The format of the explicit links is:

Resource abbreviation WBParaSite
Resource identifier Transcript identifier
Optional information 1 Protein identifier
Optional information 2 Gene identifier

Cross-references to WBParaSite may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Example: A8PGQ3

Show all entries having a cross-reference to WBParaSite.

Text format

Example: A8PGQ3

DR   WBParaSite; Bm6838; Bm6838; WBGene00227099.

XML format

Example: A8PGQ3

<dbReference type="WBParaSite" id="Bm6838">
  <property type="protein sequence ID" value="Bm6838"/>
  <property type="gene ID" value="WBGene00227099"/>
</dbReference>

RDF format

Example: A8PGQ3

uniprot:A8PGQ3
  rdfs:seeAlso <http://purl.uniprot.org/wbparasite/Bm6838> .
<http://purl.uniprot.org/wbparasite/Bm6838>
  rdf:type up:Transcript_Resource ;
  up:database <http://purl.uniprot.org/database/WBParaSite> ;
  up:translatedTo <http://purl.uniprot.org/wbparasite/Bm6838> ;
  up:transcribedFrom <http://purl.uniprot.org/wbparasite/WBGene00227099> .

Removal of the cross-references to CYGD

Cross-references to CYGD have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniParc news

UniParc cross-reference types changes

UniParc and UniProtKB entries both contain cross-references to external databases. For consistency reasons we have adjusted the names of these databases in UniParc to the ones in UniProtKB. In particular we have changed the following types of cross-references in UniParc:

Old type New type
ENSEMBL Ensembl
FLYBASE FlyBase
H_INV H-InvDB
REFSEQ RefSeq
TAIR_ARABIDOPSIS TAIR
WORMBASE WormBase
WormBase ParaSite WBParaSite

Example:

Previous XML:

<dbReference type="WormBase ParaSite" id="A_03330" version_i="1" active="Y" created="2014-09-12" last="2015-07-09">
  <property type="NCBI_taxonomy_id" value="6185"/>
</dbReference>

New XML:

<dbReference type="WBParaSite" id="A_03330" version_i="1" active="Y" created="2014-09-12" last="2015-07-09">
  <property type="NCBI_taxonomy_id" value="6185"/>
</dbReference>

UniProt release 2015_09

Published September 16, 2015

Headline

Life (and death) in 2D

While the cinema industry struggles to produce ever more realistic 3D, even 4D, films out of 2D images, scientists have achieved the exact opposite: in a collection of (3D) vertebrate embryos, they have identified a mutant that flattens in the course of development.

Vertebrates have a defined body shape in which correct tissue and organ shape and alignment are essential for function. Correct morphogenesis depends on force generation, force transmission through the tissue, and the response of tissues and extracellular matrix to force. In addition, embryos must be able to withstand environmental perturbations, such as gravity. Already in 1917, in his master work "On Growth and Form", Sir D'Arcy Wentworth Thompson postulated that "the forms as well the actions of our bodies are entirely conditioned (save for certain exceptions in the case of aquatic animals) by the strength of gravity upon this globe". It is actually from an "aquatic animal", a fish, that the confirmation of this hypothesis came earlier this year. Screening of a Japanese rice fish mutant identified an embryo that displayed pronounced body flattening around stage 25-28 (50-64 h post fertilization). Although general development was not delayed, the mutant exhibited delayed blastopore closure and progressive body collapse from mid-neurulation, surviving until just before hatching. This mutant was aptly named hirame, which means flatfish in Japanese. When embryos were grown in agarose, their collapse correlated with the direction of gravity, reflecting the mutant's inability to withstand external forces. The mutants also showed defective fibronectin fibril formation.

The hirame mutation lies within the Yap1 gene and creates a premature stop codon at position 164. Yap1 is a transcriptional co-activator that promotes proliferation and inhibits cell death during embryonic development. Porazinski and colleagues showed that Yap1 is also essential for actomyosin-mediated tissue tension.

The hypothesis with the strongest experimental support is that YAP1 acts on ARHGAP18 expression (and possibly that of other ARHGAP18-related genes), which in turn regulates cortical actomyosin network formation. Actomyosin contraction promotes fibronectin assembly, which could be a critical in vivo mechanism for the integration of mechanical signals, such as tension generated by actomyosin, with biochemical signals, such as integrin signaling, ensuring proper tissue shape and alignment and appropriate organ and body shape.

YAP1 knockdown in the human cell line hTERT-RPE1 caused a phenotype reminiscent of the fish embryo phenotype. When cultured in a 3D spheroid system, these retinal epithelial cells also exhibited collapse upon exposure to external forces, marked reduction of cortical F-actin bundles and lack of typical fibronectin fibril pattern. This suggests that YAP1 orthologs may play a similar role in all vertebrates, and possibly beyond.

As of this release, YAP1 protein entries have been updated and are publicly available.

UniProtKB news

Release of variation files for 27 new species

In collaboration with Ensembl and Ensembl Genomes, UniProt would like to announce the release of variation files for 27 species in addition to human, mouse and zebrafish files currently available in the dedicated variants directory on the UniProt FTP sites. This release includes a further 13 vertebrate species, including agriculturally important species: cow, chicken, pig and sheep. These new variant catalogues also expand the diversity of species with variants for plant, fungi and protist species that includes rice, bread wheat, barley and grape.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2015_08

Published July 22, 2015

Headline

Pseudo-allergy, real progress

Do you sniffle and sneeze as trees start to bloom and the pollen gets airborne? Your mast cells are to blame. These cells reside at strategic anatomical positions, such as skin, gastrointestinal tract and lung, and provide us with a first line of defence against potential harm from our environment. Besides their beneficial functions, mast cells can also react to compounds that do not represent any threat to our health, such as pollen. This process begins with the interaction of an antigen with immunoglobulin E (IgE) bound to high affinity Fc epsilon receptors at the mast cell surface. It ends with the release of histamine and various inflammatory and immunomodulatory substances, which causes allergy. Most adverse reactions to peptidergic and small molecule therapeutic agents, collectively called basic secretagogues, also rely on mast cell stimulation, but do not correlate with IgE antibody titer. They proceed through a different, not yet fully understood, IgE-independent mechanism called pseudo-allergy, that eventually also leads the release of granule-stored histamine. In human, MRGPRX2 has been proposed, among others, to serve as a receptor for basic secretagogues, but until recently there was no direct proof of its involvement.

Earlier this year, McNeil et al. showed that "basic secretagogues activate mouse mast cells in vitro and in vivo through a single receptor, Mrgprb2, the ortholog of the human G-protein-coupled receptor MRGPRX2". The first achievement of this study was to prove the orthology of these 2 genes, which was not an easy task. In humans, MRGPRX2 is found in a cluster with 3 other MRGPRX family members. This cluster is dramatically expanded in mouse, with 22 potential protein-coding genes that show comparable sequence identity to MRGPRX2. To establish orthology, the authors used 2 criteria: expression pattern (expression in mast cells) and pharmacology (some 16 compounds were tested for mast cell activation). Then Mrgprb2a knockout mice were created. Gene targeting was performed using a zinc-finger-nuclease-based strategy, as classical homologous recombination approach was impossible in this genomic locus due to too many repetitive sequences. The null animals showed no visible phenotype in normal conditions, but didn't produce any pseudo-allergic reaction in response to small-molecule therapeutic drugs. Secretagogue-induced histamine release, inflammation and airway contraction were abolished.

This elegant study does not deal simply with the identification of "just another receptor". It addresses an issue that may concern all of us at some point in our lives. Basic secretagogues are compounds that are frequently encountered either in natural fluids, such as the wasp venom toxin mastoparan, or in various drugs, such as cationic peptidergic drugs, antibiotics (fluoroquinolone family), neuromuscular blocking agents, etc. These latter are routinely used in surgery to reduce unwanted muscle movement and are responsible for nearly 60% of allergic reactions in a surgical setting. The majority of these compounds activate mast cells in an Mrgprb2-dependent manner. The animal model created by McNeil et al. could then be used for pre-clinical testing of new drugs in order to minimize pseudo-allergic risks. In addition, the identification a motif common to several Mrgprb2 agonists may allow the prediction of side effects of clinically used compounds.

As of this release, primate MRGPRX2 and mouse Mrgprb2 entries have been updated and are publicly available.

UniProt service news

Programmatic access to UniProt with sparql.uniprot.org

We are happy to announce the public release of the UniProt SPARQL endpoint at sparql.uniprot.org, where you can also find links to the documentation of the UniProt RDF data model and an interactive query interface with sample queries to get you started.

For those unfamiliar with SPARQL, this is a W3C standardized query language for the Semantic Web. If you know SQL, it will look familiar to you and you can do similar types of queries with it. SPARQL also allows you to query and combine data from a variety of SPARQL endpoints, providing a valuable low-cost alternative to building your own data warehouse. You can combine UniProt data from sparql.uniprot.org with that from the SPARQL endpoints hosted by the EBI's RDF platform, the SIB's neXtProt SPARQL endpoint, etc.

We look forward to feedback from the community to help us improve this service further.

UniProtKB news

Addition of human somatic protein altering variants from COSMIC

The Catalogue of Somatic Mutations in Cancer (COSMIC) is a database of manually curated somatic variants from peer reviewed publications and genome-wide studies. UniProt, in collaboration with COSMIC, have integrated COSMIC release v71 protein altering variants into the homo_sapiens_variation.txt.gz file. The COSMIC variants provide the standard information found in the homo_sapiens_variation.txt.gz file and additional information on the primary tissue(s) the variant was found in within the Phenotype/Disease field.

Changes to the humdisease.txt file

We have added cross-references to MedGen to the humdisease.txt file. MedGen, the NCBI portal to information about human genetic disorders, conveys multiple disease names, medical terms and information for the same disorder from various sources into a specific concept. Each MedGen concept has a Concept Unique Identifier (CUI) that allows computational access to global disease information. Together with disease nomenclature, this includes disease definitions, clinical findings, available clinical and research tests, molecular resources, professional guidelines, original and review literature, consumer resources, clinical trials, and Web links to other related resources. MedGen is a valuable resource to allow UniProtKB users to access an extensive range of biomedical data.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Blepharophimosis-ptosis-intellectual disability syndrome
  • Ehlers-Danlos syndrome 2

UniProt release 2015_07

Published June 24, 2015

Headline

Coding-non-coding RNAs: a game of hide-and-seek

It is well-established that microRNAs (miRNAs) are small eukaryotic non-coding RNA molecules that repress the expression of their target genes. miRNAs are transcribed by RNA polymerase II as large primary transcripts (pri-miRNA), that share the same characteristics as all other RNA polymerase II-transcribed RNAs, such as the presence of a 5'-cap and a 3'-poly(A) tail. pri-miRNAs are processed to smaller pre-miRNAs, which in turn are cleaved to produce mature miRNAs. In animals, this final maturation step occurs in the cytoplasm, while in plants it takes place in the nucleus. Cytosolic mature miRNAs guide the RNA-induced silencing complex (RISC) in repressing target genes through either cleavage or translational repression of their mRNAs.

A recent article published in Nature revealed that plant pri-miRNAs may not be as non-coding as previously assumed. Some do actually encode small regulatory peptides, called miPEPs, which enhance the accumulation of their corresponding mature miRNAs. This has been shown for Medicago truncatula pri-miR171b and Arabidopsis thaliana pri-miR165a which encode miPEP171b and miPEP165a, respectively. These two 20- and 18-amino acid-long peptides have been shown to be translated in vivo and to promote the transcription of their pri-miRNAs, resulting in the accumulation of mature miR171b and miR165a. This increase leads to the reduction of lateral root development in the case of miR171b and stimulation of main root growth for miR165a. The same effects were observed when synthetic peptides were applied to plants, suggesting that miPEPs might have agronomical applications.

Five other pri-miRNAs were experimentally shown to encode active miPEPs, suggesting that the presence of such small regulatory peptides may be widespread in plants. Computer analysis of the 5'-end of 50 pri-miRNAs in Arabidopsis thaliana revealed that all of them contained at least one ORF, which, if translated, could give rise to 3- to 59-amino acid-long peptides of unknown biological activity. No common signature was found among them, possibly due to the specificity of each putative miPEP for its own pri-miRNA.

Arabidopsis thaliana miPEP165a, miPEP160b, miPEP164a and miPEP319a and Medicago truncatula miPEP171b peptides have been manually annotated and are integrated into UniProtKB/Swiss-Prot as of this release. The sequences of the other 2 Medicago truncatula functionally characterized peptides, miPEP169d and miPEP171e, are unfortunately not available.

UniProtKB news

Cross-references to ESTHER

Cross-references have been added to ESTHER, a database of the Alpha/Beta-hydrolase fold superfamily of proteins.

ESTHER is available at http://bioweb.ensam.inra.fr/ESTHER/general?what=index.

The format of the explicit links is:

Resource abbreviation ESTHER
Resource identifier Gene locus.
Optional information 1 Family name.

Example: P0C064

Show all entries having a cross-reference to ESTHER.

Text format

Example: P0C064

DR   ESTHER; bacbr-grsb; Thioesterase.

XML format

Example: P0C064

<dbReference type="ESTHER" id="bacbr-grsb">
  <property type="family name" value="Thioesterase"/>
</dbReference>

Cross-references to Genevisible

Cross-references have been added to Genevisible, a search portal to normalized and curated expression data from GENEVESTIGATOR.

Genevisible is available at http://genevisible.com/search.

The format of the explicit links is:

Resource abbreviation Genevisible
Resource identifier Gene identifier.
Optional information 1 Organism code.

Example: P31946

Show all entries having a cross-reference to Genevisible.

Text format

Example: P31946

DR   Genevisible; P31946; HS.

XML format

Example: P31946

<dbReference type="Genevisible" id="P31946">
  <property type="organism ID" value="HS"/>
</dbReference>

Removal of the cross-references to Genevestigator

Cross-references to Genevestigator have been removed.

Change of the cross-references to PomBase

Cross-references to PomBase may now optionally indicate a gene designation in order to align them with the format of other model organism databases.

Text format

Example: Q9P3A7

DR   PomBase; SPAC1565.08; cdc48.

Example: O60058

DR   PomBase; SPBC56F2.07c; -.

XML format

Example: Q9P3A7

<dbReference type="PomBase" id="SPAC1565.08">
  <property type="gene designation" value="cdc48"/>
</dbReference>

Example: O60058

<dbReference type="PomBase" id="SPBC56F2.07c"/>

This change did not affect the XSD, but may nevertheless require code changes.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease:

  • Hypogonadism LHB-related

Changes to keywords

New keywords:

UniProt release 2015_06

Published May 27, 2015

Headline

POLQ, a new target for cancer therapy?

DNA double-strand breaks (DSBs) are our worse cellular enemy, yet they do occur all the time, often accidentally, as a result of endogenous metabolic reactions and replication stress. They can also be induced by exogenous sources, like radiation or exposure of cells to DNA-damaging agents, or serve as intermediates in a number of programmed recombination events, during meiosis or assembly of immunoglobulins or T-cell receptors. Whatever their origin, DSBs are highly toxic to cells if not repaired, and if repaired incorrectly, they can cause deletions, translocations, and fusions in the DNA, which can have dramatic consequences.

The most frequently used mechanisms for DSB repair are homologous recombination (HR) and non-homologous end-joining (NHEJ), but alternative forms of end-joining exist, such as microhomology-mediated end-joining (MMEJ). HR is highly accurate and therefore important for preserving genome integrity. NHEJ results in small, less than 10 bp deletions. The most error-prone is MMEJ, which promotes inter- and intrachromosome rearrangements associated with relatively large DNA deletions (30-200 bp).

While NHEJ preferentially acts on 'blunt-ended' DNA breaks, HR is preceded by resection of DNA around the 5'-ends of the break. RAD51 proteins bind to the resulting 3' single-stranded overhangs and help them to recognize complementary (homologous) DNA in another intact DNA helix. The overhangs then invade the homologous double-strand and use it as a template for repair. MMEJ also starts with DNA resected ends, but in this case it is DNA polymerase theta (POLQ) that directly binds them and enables short (2-6 bp) homologous DNA sequences in overhangs to form base pairs. The homology can be either terminal, or internal, as far as 5 nucleotides away from the 3' terminus. Once homology has been found, each DNA strand is extended from the base-paired region using the opposing overhang as a template, and, in case of internal homology, the terminal unpaired regions are removed.

Normal cells tend to down-regulate POLQ. Cancer cells, which exhibit HR deficiency due to mutations in genes involved in HR repair, tend to up-regulate POLQ. This allows them to limit DNA damage and survive, although at the expense of genome integrity. In these cells, increased levels in POLQ will further inhibit HR, by binding to RAD51 proteins and preventing their accumulation at resected DNA ends.

Cytotoxic drugs used for cancer therapy promote DSBs in order to overwhelm DNA repair mechanisms and induce cell death. Could the use of POLQ inhibitors, alone or in combination with other DNA damaging drugs, improve the treatment of HR-deficient tumors? It's too early to tell, but preliminary results suggest that it is worth investigating. Indeed, knockdown of POLQ in HR-deficient cells reduces cell survival following treatment with cisplatin or mitomycin C, and human tumor cells expressing shRNA against both FANCD2 (HR knockdown) and POLQ (MMEJ knockdown) do not grow in mice.

At the beginning of this year, POLQ was in the spotlight thanks to 3 very interesting publications, which shed light on its role and mode of action. UniProtKB/Swiss-Prot POLQ entries have been updated accordingly and are publicly available as of this release.

UniProt release 2015_05

Published April 29, 2015

Headline

A never-ending race between evolution and genomic integrity

Primate evolution has been accompanied by several waves of retrotransposon insertions. Nowadays about 50% of our genome is composed of endogenous retroelements (EREs). Although many of them have lost their transposition ability, some remain quite active. For instance, among the 500,000 copies of long interspersed element-1 (LINE1 or L1) present in the human genome, about 100 are retrotransposition-competent, and over 40 of them are highly active. Other EREs, such as short interspersed nuclear elements (SINEs), including Alu repeats, and SINE-VNTR-Alu (SVA), a composite hominid-restricted ERE, also actively move in the genome. It is currently estimated that new, non-parental L1 integrations occur in nearly 1/100 births and roughly every 20th newborn baby has a new Alu retrotransposon somewhere in its DNA.

Obviously having DNA jumping around our genome may be quite harmful and our cells work hard to repress EREs. Transcriptional silencing is controlled by TRIM28 and KRAB domain-containing Zinc finger proteins (KRAB-ZNFs). TRIM28 forms a repressive complex (KAP1 complex) by interacting with CHD3, a subunit of the nucleosome remodeling and deacetylation (NuRD) complex, and SETDB1, which specifically methylates histone H3 at 'Lys-9', inducing heterochromatinization. KRAB-ZNFs bind DNA and recruit the KAP1 complex to target sites.

KRAB-ZNF genes are one of the fastest growing gene families in primates, possibly to limit the activity of newly emerged ERE classes. This hypothesis has gained support in an elegant study recently published in Nature. In this article, Jacobs et al. used a heterologous cell system in which murine embryonic stem cells harbored a copy of human chromosome 11, which contains a number of EREs, including SVA and the L1 subfamily L1PA. In this cellular environment, the primate-specific EREs were derepressed. Individual overexpression of highly expressed human KRAB-ZNFs, confirmed by reporter gene assays, allowed the identification of genes involved in the repression of specific ERE (sub)families: ZNF91 and ZNF93 which acted on SVA and L1PA4, respectively. The authors then traced back the phylogenic history of these genes in the primate lineage and analyzed the parallel evolution of their target EREs. They could show that a new wave of L1PA insertions in great ape genomes was made possible through the deletion of a 129-bp element in L1PA3, which destroyed the ZNF93-binding site. This could be interpreted as an ERE response to a series of structural changes in ZNF93 that occurred soon before and improved host repression of L1PA activity.

In conclusion, the expansion of a new ERE drives the evolution of a host repressor which leads to a subsequent change in ERE to escape repression, and so on. It is a never-ending race of our genome with itself, which leads inexorably to greater and greater complexity.

As of this release, updated human ZNF91 and ZNF93 entries are available in UniProtKB/Swiss-Prot.

UniProtKB news

Removal of IPI species proteome data sets from FTP site

Since the closure of IPI in 2011, UniProt has provided proteome data sets for IPI species on its FTP site. In UniProt release 2015_03, we have started to provide new data sets for reference proteomes which cover also the IPI species and we have now removed the old 'proteomes' FTP directory that contained only data for the IPI species.

UniProtKB XSD change for evidence attribution

We have made the following changes to the UniProtKB XSD to allow a more fine-grained attribution of evidence to the parts of comment annotations that contain "free-text" descriptions:

  • The cardinality of all existing text elements was changed from maxOccurs="1" to maxOccurs="unbounded".
  • The phDependence, redoxPotential and temperatureDependence child elements of the bpcCommentGroup now have a sequence of text child elements.
  • The note child element of the isoformType was replaced by a sequence of text child elements.

The XSD changes are highlighted in red color below:

<xs:complexType name="commentType">
        ...
            <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>
        ...
    <xs:group name="bpcCommentGroup">
       ...
             <xs:element name="absorption" minOccurs="0">
                ...
                        <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>
                ...
            <xs:element name="kinetics" minOccurs="0">
                ...
                        <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>
                ...

            <!-- The following 3 elements will in future each have a sequence of <text> child elements:
            <xs:element name="phDependence" type="evidencedStringType" minOccurs="0"/>
            <xs:element name="redoxPotential" type="evidencedStringType" minOccurs="0"/>
            <xs:element name="temperatureDependence" type="evidencedStringType" minOccurs="0"/>
            -->
            <xs:element name="phDependence" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="text" type="evidencedStringType" maxOccurs="unbounded"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
            <xs:element name="redoxPotential" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="text" type="evidencedStringType" maxOccurs="unbounded"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
            <xs:element name="temperatureDependence" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="text" type="evidencedStringType" maxOccurs="unbounded"/>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        ...
    <xs:complexType name="isoformType">
        ...
            <!-- The <note> element will be replaced by a sequence of <text> elements:
            <xs:element name="note" minOccurs="0">
                <xs:complexType>
                    <xs:simpleContent>
                        <xs:extension base="xs:string">
                            <xs:attribute name="evidence" type="intListType" use="optional"/>
                        </xs:extension>
                    </xs:simpleContent>
                </xs:complexType>
            </xs:element>
            -->
            <xs:element name="text" type="evidencedStringType" minOccurs="0" maxOccurs="unbounded"/>

Cross-references to BioMuta

Cross-references have been added to BioMuta, a curated single-nucleotide variation and disease association database.

BioMuta is available at https://hive.biochemistry.gwu.edu/tools/biomuta/.

The format of the explicit links is:

Resource abbreviation BioMuta
Resource identifier Gene name.

Example: P02787

Show all entries having a cross-reference to BioMuta.

Text format

Example: P02787

DR   BioMuta; TF; -.

XML format

Example: P02787

<dbReference type="BioMuta" id="TF"/>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Lipidation' ('LIPID' in the flat file):

  • O-palmitoleyl serine

UniProt release 2015_04

Published April 1, 2015

Headline

Of CAT tails and protein translation by-products

Correct translation of mRNA into functional proteins is an essential cellular process. Defects in translation not only deprive cells of proteins needed for almost any task, but also produce by-products that can negatively impact these tasks and be toxic. Therefore translational garbage has to be removed.

One source of errors is defective ribosomes that stop during translation and hence produce incomplete polypeptide chains. All organisms have evolved mechanisms to manage translation arrest. In eukaryotes, ribosome stalling induces dissociation of the small 40S subunit and recruitment of the 'ribosome quality control complex' (RQC) to the large 60S subunit. RQC mediates the ubiquitination and degradation of the incompletely synthesized polypeptide chains.

Over the past few years, the mode of action of RQC has begun to be elucidated. The molecular components of RQC include listerin, an E3 ubiquitin ligase encoded by RKR1 in yeast and LTN1 in mammals, the AAA adenosine triphosphatase CDC48/VCP/p97 and ubiquitin-binding cofactors, as well as 2 proteins of unknown function. Listerin mediates the ubiquitination of the stalled polypeptide and subsequent recruitment of CDC48/VCP/p97 to the complex. The ATPase may provide the mechanical force to allow extraction of the nascent chain and its delivery to the proteasome for degradation.

Three recent studies have addressed the function of one of the uncharacterized proteins of the complex, called RQC2 in yeast and NEMF in mammals. In mammals, NEMF/RQC2 is responsible for the selective recognition of stalled 60S subunit. It does so by making multiple simultaneous contacts with 60S and peptidyl-tRNA to sense nascent chain occupancy. NEMF/RQC2 is also important for the stable association of listerin with the complex. Work in yeast not only corroborates these findings, but it reveals another unexpected function for NEMF/RQC2. NEMF/RQC2 recruits alanine- and threonine-charged tRNAs to the ribosomal A site and directs the elongation of stalled nascent chains independently of mRNA or 40S subunits, leading to non-templated C-terminal Ala and Thr extensions, aptly named CAT tails. The exact function of CAT tails is still under investigation, but they seem to induce an HSF1-dependent heat shock response in yeast through a mechanism that is yet to be determined. The heat shock response may help cells to buffer against malformed proteins. Alternatively, the extension at the C-terminus may serve to test the functional integrity of large ribosomal subunits, so that the cell can detect and dispose of defective large subunits that induce stalling.

mRNA-independent polypeptide biosynthesis has already been described in microorganisms. Classical examples of such peptides are peptide antibiotics, including actinomycin, bacitracin, colistin, and polymyxin B. In addition, in Staphylococcus aureus, pentaglycines acting as cross-linkers in the cell wall peptidoglycan are synthesized in the absence of mRNA. Although still considered as a very marginal event, the assembly of amino acids without mRNA blueprint might be more widespread than previously anticipated.

As of this release, updated yeast RQC2 and mammalian NEMF entries are available in UniProtKB/Swiss-Prot.

UniProtKB news

Reducing redundancy in proteomes

The UniProt Knowledgebase (UniProtKB) has witnessed an exponential growth in the last few years with a two-fold increase in the number of entries in 2014. This follows the vastly increased submission of multiple genomes for the same or closely related organisms. This increase has been accompanied by a high level of redundancy in UniProtKB/TrEMBL and many sequences are over-represented in the database. This is especially true for bacterial species where different strains of the same species have been sequenced and submitted (e.g. 1,692 strains of Mycobacterium tuberculosis, corresponding to 5.97 million entries). To reduce this redundancy, we have developed a procedure to identify highly redundant proteomes within species groups using a combination of manual and automatic methods. We have applied this procedure to bacterial proteomes (which constituted 81% of UniProtKB/TrEMBL in release 2015_03) and sequences corresponding to redundant proteomes (47 million entries) have been removed from UniProtKB. These sequences are still available in the UniParc sequence archive dataset within UniProt. From now on, we will no longer create new UniProtKB/TrEMBL records for proteomes identified as redundant.

Protein sequences belonging to proteomes that are not identified as redundant remain in UniProtKB. All proteomes are searchable through the UniProt website's Proteomes pages. Sequences corresponding to redundant proteomes are available for download from UniParc and you will also be directed to alternate non-redundant proteome(s) available for the same species. The history (i.e. previous versions) of redundant UniProtKB records is still available.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease:

  • Acid phosphatase deficiency

Changes to keywords

Modified keyword:

UniMES news

Retirement of UniProt Metagenomic and Environmental Sequences (UniMES)

The UniProt Metagenomic and Environmental Sequences (UniMES) database was developed as a repository for metagenomic and environmental data. UniProt has retired UniMES as there is now a resource at the EBI that is dedicated to serving metagenomic researchers. Henceforth, we recommend using the EBI Metagenomics portal instead. In addition to providing a repository of metagenomics sequence data, EBI Metagenomics allows you to view functional and taxonomic analyses and to submit your own samples for analysis.

UniProt release 2015_03

Published March 4, 2015

Headline

Regulation of translation initiation through folding

Many physiopathological events, such as stress or nutrient deprivation, induce rapid changes in cellular protein levels. In these cases, cells preferentially use translational control of existing mRNAs over transcriptional control, since the latter generates a slower response. Translation can be divided into 4 steps, initiation, elongation, termination, and ribosome recycling, but most regulation occurs at the initiation level.
In eukaryotes, translation initiation involves recruitment of the 40S ribosome to mRNA by the eukaryotic initiation factor 4F (eIF4F) complex. This complex is composed of eIF4E, which binds to the mRNA 5' cap structure, eIF4A, an RNA helicase and eIF4G, a scaffolding protein. Availability of eIF4E is rate-limiting in this process and it is an important target for control. Under stress or starvation conditions, when translation has to be rapidly repressed, eIF4E binding proteins (4E-BPs) interact with eIF4E outcompeting eIF4G, hence preventing eIF4F assembly and cap-dependent translation initiation. 3 4E-BPs have been identified in mammals. 4E-BP2 (EIF4EBP2) is one of them. It is an intrinsically disordered protein (IDP) that contains several phosphorylation sites. In its unphosphorylated state, 4E-BP2 interacts with eIF4E via 2 domains: a YXXXXLΦ motif (residues 54 through 60) and a secondary dynamic motif (residues 78 through 82). The unphosphorylated (or minimally phosphorylated), eIF4E-binding form of EIF4EBP2 is unstable and targeted for degradation via the ubiquitin-proteasome pathway. By contrast, highly phosphorylated 4E-BP2 is very stable, but only weakly binds to eIF4E and hence can be outcompeted by eIF4G, allowing translation to occur.

How does phosphorylation regulate 4E-BP2 interaction with eIF4E and its stability? It has been recently shown that phosphorylation induces a widespread disorder-to-order transition occurring in 2 steps. First, phosphorylation at Thr-37 and Thr-46 by MTOR induces folding of residues Pro-18 to Arg-62 into a four-stranded β-domain that sequesters the helical YXXXXLΦ motif into a partially buried β-strand, blocking accessibility to eIF4E. The folding also protects Lys-57 from ubiquitination, preventing proteasomal degradation. This ordered structure is further stabilized by phosphorylation at Ser-65, Thr-70 and Ser-83. The fully phosphorylated protein has an affinity for eIF4E 4,000 fold lower than the unphosphorylated form. This observation implies that binding must be coupled to unfolding in order to free the YXXXXLΦ motif, and it is indeed what is experimentally observed. When the phosphorylated form binds eIF4E, it undergoes an order-to-disorder transition, as suggested by NMR spectra that are similar to those of the unphosphorylated form.

Although it has long been suspected that the function of IDPs may be controlled by post-translational modifications (PTMs), this is the first report experimentally showing how a PTM can fold an entire domain. This new data have been annotated into UniProtKB/Swiss-Prot and as of this release, the updated EIF4EBP2 entry is publicly available.

UniProtKB news

New proteomics mapping files

Mappings of UniProt Knowledgebase (UniProtKB) human sequences to identified human peptides from public mass spectrometry (MS) proteomics repositories can now be found in the new dedicated 'proteomics_mapping' directory on the UniProt FTP site together with a description of how the mappings were generated. The mappings are based on our analysis of the content of those MS proteomics repositories that openly share with us their data and quality metrics concerning peptide identifications.

Mass spectrometry provides direct experimental evidence for the existence of proteins and these new peptide mappings greatly increase the proportion of human sequences in UniProtKB whose existence is supported by experimental proteomics data. The human reference proteome currently contains 89383 sequences and our analysis provides mass spectrometry evidence for 68229 of those sequences.

In future UniProt releases, we expect to add data from more MS proteomics repositories and additional species. We very much welcome the feedback of the community on our efforts.

New FTP repository for reference proteomes

Based on a gene-centric perspective, UniProt Knowledgebase (UniProtKB) starts to provide data sets for reference proteomes, whose repository can be found at the new reference_proteomes directory.

As of release 2015_03, it encompasses 1933 species distributed in Eukaryota, Archaea and Bacteria. Viruses will be added in the next release.

Removal of the cross-references to PhosSite

Cross-references to PhosSite have been removed.

Removal of the cross-references to PptaseDB

Cross-references to PptaseDB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified disease:

Deleted diseases:

  • Amelogenesis imperfecta and gingival fibromatosis syndrome
  • Glycogen storage disease 14
  • Ichthyosis, autosomal recessive, with hypotrichosis
  • Loeys-Dietz syndrome 2A
  • Loeys-Dietz syndrome 2B
  • Leigh syndrome, X-linked
  • Mental retardation, X-linked 59

Changes to keywords

New keyword:

UniParc news

UniParc cross-references with proteome identifier and component

The UniParc XML format uses dbReference elements to represent cross-references to external database records that contain the same sequence as the UniParc record. Additional information about an external database record is provided with different types of property child elements. We have introduced two new types for cross-references to external database records from which UniProt proteomes are derived: The type "proteome_id" shows the identifier of the corresponding UniProt proteome and the type "component" the genomic component which encodes the protein. As a first step, we have added this information to bacterial ENA records.

Example:

<entry dataset="uniparc">
    <accession>UPI0000131B78</accession>
    <dbReference type="EMBL" id="AAK44239" version_i="1" active="Y" version="1" created="2003-03-12" last="2014-11-23">
        <property type="NCBI_GI" value="13879058"/>
        <property type="NCBI_taxonomy_id" value="83331"/>
        <property type="protein_name" value="serine/threonine protein kinase"/>
        <property type="gene_name" value="MT0017"/>
        <property type="proteome_id" value="UP000001020"/>
        <property type="component" value="Chromosome"/>
    </dbReference>
    <dbReference type="EMBL" id="ABQ71734" version_i="1" active="Y" version="1" created="2007-07-09" last="2014-11-23">
        <property type="NCBI_GI" value="148503925"/>
        <property type="NCBI_taxonomy_id" value="419947"/>
        <property type="protein_name" value="serine/threonine protein kinase"/>
        <property type="gene_name" value="pknB"/>
        <property type="proteome_id" value="UP000001988"/>
        <property type="component" value="Chromosome"/>
    </dbReference>
    ...
    <dbReference type="EMBL_CON" id="EFD75652" version_i="1" active="Y" version="2" created="2011-12-05" last="2014-11-23">
        <property type="NCBI_taxonomy_id" value="537209"/>
        <property type="protein_name" value="transmembrane serine/threonine-protein kinase B pknB"/>
        <property type="gene_name" value="TBIG_00439"/>
        <property type="proteome_id" value="UP000004676"/>
        <property type="component" value="Unassembled WGS sequence"/>
    </dbReference>
    ...
</entry>

This change did not affect the UniParc XSD, but may nevertheless require code changes.

UniProt RDF news

UniProt RDF files compressed with XZ instead of gzip

The UniProt RDF distribution has been available on the UniProt FTP site as gzip compressed RDF/XML files since 2008. We have now changed the compression algorithm from gzip to XZ, which has a number of features that make it a better choice for the UniProt RDF data:

  • It reduces the file size by approximately 23%, which improves FTP download time.
  • It can be decompressed in parallel, which can give faster decompression rates on current hardware with a minimum of 6-8 CPU cores.
  • It allows random access.

Replacement of UniProt RDF file go.rdf with go.owl

The UniProt RDF distribution that is available on the UniProt FTP site contained a go.rdf file that has been replaced with a go.owl file that contains a subset of the official go.owl distribution of the Gene Ontology consortium, which is taken as a snapshot that is in sync with the GO annotations in the UniProt Knowledgebase.

In practical terms this means:

UniProt release 2015_02

Published February 4, 2015

Headline

Mosquitoes prefer humans

Blood-feeding is extremely unusual in insects. Among the 1 to 10 million insect species, only some 10,000 feed on blood, and among these, only 100 target humans. Not only is this behavior rare in terms of species, but within one species, it may be gender-specific. However this small proportion of insects have a dramatic impact on human health. Female mosquitoes are major vectors of human diseases, such as malaria, dengue, yellow fever and chikungunya. Mosquito's preference for humans is a matter of evolution. Aedes aegypti, the main vector of dengue and yellow fevers, actually exists as 2 subspecies, Aedes aegypti aegypti, feeding on human blood, and Aedes aegypti formosa, a generalist, zoophilic mosquito. It is currently thought that Aedes aegypti aegypti originated from a small population of forest-dwelling Aedes aegypti that became isolated in North Africa when a period of severe drought began in the Sahara approximately 4,000 years ago. The mosquito adapted to these harsh conditions, evolved a preference for breeding in artificial water storage containers and specialized in biting humans. This "domestic" form was reintroduced along the coast of East Africa following human movement and trade, and spread across much of the tropical and subtropical world. Today, along the coasts of Kenya, the 2 subspecies coexist, sometimes just a few hundreds of meters apart, domestic Aedes aegypti aegypti found in homes, laying eggs in water stored in containers indoors, and the forest Aedes aegypti formosa avoiding human settlements, laying eggs in tree holes outdoors.

What is the genetic basis underlying the mosquito's preference for humans? In order to answer this question, Mc Bride et al. established 29 colonies of each Aedes aegypti subspecies. They observed that, contrary to their forest counterparts, domestic females showed a strong preference for human odor as compared to guinea pig, and were also more responsive in assays in which insects were directly exposed to live hosts, i.e. an anaesthetized guinea-pig and a human arm (the owner of which should be congratulated for her commitment). Analysis of gene expression in antennae, the major olfactory organ, in both subspecies revealed almost 1'000 differentially expressed genes and among them, odorant receptors, a family of insect chemosensory receptors, were significantly overrepresented. Odorant receptor 4 (Or4) was of particular interest. It was upregulated in human-preferring mosquitoes, and also the 2nd most highly expressed odorant receptor in the antennae of domestic females. In addition, Or4 exhibited extensive variations that might affect its function. Or4 responds to sulcatone, a volatile odorant produced by a variety of animals and plants, but whose levels in humans are uniquely high. 7 major Or4 alleles have been identified. Alleles A, B, C, F, and G were highly sensitive to sulcatone, whereas D and E were much less sensitive. Interestingly, human-preferring colonies from various African, Asian and American countries were dominated by A-like alleles, whereas animal-preferring colonies were highly variable. This suggests that both Or4 expression levels and ligand-sensitivity play a role in human preference. Surprisingly, sulcatone has been described as a mosquito repellent at certain concentrations. Mc Bride et al. hypothesized that it could be a repellent at high concentrations and an attractant at lower levels.

The important behavioral (r)evolution form the ancestral Aedes aegypti formosa to Aedes aegypti aegypti is unlikely to be due to a single gene, but at least Or4 is one genetic element clearly associated with these changes. The corresponding Or4 UniProtKB entry has been manually annotated and is publicly available as of this release.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Amelogenesis imperfecta and gingival fibromatosis syndrome
  • Glycogen storage disease 14
  • Ichthyosis, autosomal recessive, with hypotrichosis
  • Leigh syndrome, X-linked
  • Loeys-Dietz syndrome 2A
  • Loeys-Dietz syndrome 2B
  • Mental retardation, X-linked 59

Changes to keywords

New keyword:

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt release 2015_01

Published January 7, 2015

Headline

Thalidomide, the pharmacological version of yin and yang

In the 1950s, the German company Chemie Gruenenthal brought a new drug to the market, thalidomide. It was primarily used as a sedative, but as it also had anti-emetic properties, it soon became popular to alleviate "morning sickness" in pregnant women. About 10,000 children were born to women taking thalidomide. They exhibited severe malformations, affecting limbs, ears, heart and other internal organs and only 50% survived. By the early sixties, the teratogenic effect of thalidomide had been established and its use discontinued. However, scientists' interest in this molecule never stopped. In 1965, thalidomide was shown to have immunomodulatory and anti-inflammatory properties in patients with erythema nodosum leprosum, an inflammatory complication of leprosy. More recently, thalidomide was proved to be efficient against several hematological cancers, including multiple myeloma, inhibiting cancer cell proliferation, modulating the immune system and the tumor microenvironment.

In 60 years, observations on thalidomide effects have accumulated, but its mode of action is still not fully elucidated. Nevertheless, some major steps have been accomplished to achieve this aim. A major breakthrough came in 2010 when thalidomide's primary target, a protein called cereblon (CRBN), was identified. CRBN is a component of a ubiquitin E3 complex, called CRL4. This complex is made of at least 4 proteins, CUL4, DDB1, RBX1 and CRBN. Each protein has its specific function. CUL4 provides a scaffold for assembly of RBX1 and DDB1, RBX1 is the docking site for the activated E2 protein, and DDB1 recruits substrate-specificity receptors, such as CRBN, that form the substrate-presenting side of the CRL4 complex. The recently published CRL4 3D structure revealed that the ligase arm of CUL4 is quite mobile, establishing a ubiquitination zone. As it is a promiscuous enzyme, any lysine crossing this zone may be a target.

How does thalidomide affect CRBN activity within the CRL4 complex? In the presence of thalidomide, 2 transcription factors, IKZF1 and IKZF3, are recognized by CRBN and targeted for destruction by the proteasome. Neither of these proteins are substrates in the absence of the drug. Under normal conditions, IKZF1 and IKZF3 regulate B-and T-cell development. IKZF1 suppresses the expression of IL2 in T-cells and stimulates the expression of IRF4. This observation sheds light upon the immunomodulatory effects of thalidomide. What about endogenous CRBN substrates? Until recently, none were known. Last July, Fisher et al. published the results of their search for proteins whose ubiquitination by CRL4/CRBN was inhibited by thalidomide (or thalidomide derivatives) and identified MEIS2, a homeodomain-containing protein. MEIS2 has been involved in some aspects of normal human development. In bats, differential MEIS2 expression has been observed during limb development. A failure in limb development is a very striking feature of "thalidomide babies". Hence MEIS2 may be a candidate for some aspects of thalidomide-induced teratogenicity.

Based on 3D structure analysis of the CRL4 complex, a model has been proposed in which thalidomide binds to CRBN at the canonical substrate-binding site. This interferes with the binding of endogenous CRBN substrates, impairs their ubiquitination and subsequent destruction, and results in their up-regulation. Conversely, the presence of thalidomide modifies the CRBN surface, creating a new binding site for neo-substrates, leading to their down-regulation.

As of this release, the updated versions of CRBN, DDB1, CUL4B, RBX1 entries are available in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to UniProt Proteomes

For several years now, UniProt has been providing 'proteome' sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced. In the past, these sets were based on the taxonomy of the organisms, but as more and more genomes of the same organism are being sequenced, we have recently introduced unique proteome identifiers to distinguish individual proteomes. These proteomes can be queried and downloaded from the new Proteomes section of the UniProt website. UniProtKB entries that are part of a proteome now have a cross-reference to their proteome and, where known, we also indicate the name of the component that encodes the respective protein.

UniProt Proteomes are available at http://www.uniprot.org/proteomes/.

The format of the explicit links is:

Resource abbreviation Proteomes
Resource identifier Proteome identifier.
Optional information 1 Component name.

Example: P78363

Text format

Example: P78363

DR   Proteomes; UP000005640; Chromosome 1.

XML format

Example: P78363

<dbReference type="Proteomes" id="UP000005640">
  <property type="component" value="Chromosome 1"/>
</dbReference>

RDF format

In the RDF format, we have introduced a new property proteome to represent a proteomes resource. The component is indicated by a relative URI reference.

Example: P78363

uniprot:P78363
  up:proteome <http://purl.uniprot.org/proteomes/UP000005640#Chromosome%201> .

Cross-references to DEPOD

Cross-references have been added to DEPOD, the human DEPhOsphorylation Database.

DEPOD is available at http://www.koehn.embl.de/depod/.

The format of the explicit links is:

Resource abbreviation DEPOD
Resource identifier UniProtKB accession number.

Example: Q99502

Show all entries having a cross-reference to DEPOD.

Text format

Example: Q99502

DR   DEPOD; Q99502; -.

XML format

Example: Q99502

<dbReference type="DEPOD" id="Q99502"/>

Cross-references to MoonProt

Cross-references have been added to MoonProt, a manually curated database containing information about the known moonlighting proteins.

MoonProt is available at http://www.moonlightingproteins.org/.

The format of the explicit links is:

Resource abbreviation MoonProt
Resource identifier UniProtKB accession number.

Example: P31230

Show all entries having a cross-reference to MoonProt.

Text format

Example: P31230

DR   MoonProt; P31230; -.

XML format

Example: P31230

<dbReference type="MoonProt" id="P31230"/>

Changes to the controlled vocabulary of human diseases

New diseases:

UniProt release 2014_11

Published November 26, 2014

Headline

Higher and higher

It is in human nature to push back the frontiers of what is possible. Modern humans left Africa and conquered the world. During their exploration, they met other humans who had already colonized the most improbable places tens of thousands of years earlier, maybe themselves being driven by the same urge to discover new horizons. Among the most challenging dwelling places is the Tibetan plateau, with an average elevation exceeding 4,500 meters. At this altitude, the oxygen concentration is only 60% of that available at sea level. Nevertheless, the Tibetan plateau is thought to have been inhabited for some 25,000 years.

To maintain oxygen homeostasis at high altitude (over 2,500 meters), the body responds in various ways, including increasing ventilation over the short term and increasing red blood cell production over the long term (see review). Hypoxia-inducible factor (HIF) plays a key role in the regulation of gene transcription in this process. HIF is a dimer composed of a common subunit beta, called ARNT, and 1 of 3 alpha subunits, called HIF1A, EPAS1, or HIF3A. Under normoxic conditions, HIFs-alpha are hydroxylated by prolyl hydroxylases EGLN1 (also known as PHD2), EGLN2 or EGLN3. Hydroxylation allows interaction with an E3-ubiquitin ligase, named VHL, followed by proteasomal degradation. Under hypoxic conditions, hydroxylation is arrested and HIFs-alpha are stabilized. They dimerize with ARNT and initiate the hypoxia response transcriptional program, which includes the stimulation of erythropoiesis. Strikingly, Tibetans exhibit a blunted erythropoietic response and their hemoglobin concentration is maintained at values expected at sea-level.

In 2010, 3 independent publications identified genes or loci showing evidence of hypoxia adaptation in Tibetans. All 3 studies pointed to 2 genes, among many others, being significantly associated with the decreased hemoglobin phenotype. They are EPAS1 and EGLN1. Interestingly, Tibetans may have inherited EPAS1 SNPs from Denisova man, an archaic Homo species identified in the Altai mountains of Siberia. The Tibetan-specific EGLN1 variant is more recent, currently estimated to have appeared some 8,000 years ago. It contains 2 single amino acid polymorphisms: p.Asp4Cys and p.Cys127Ser. Some characterization of this double variant came in September this year. Lorenzo et al. showed that it exhibited a lower K(m) value for oxygen, suggesting that it promotes increased HIF-alpha hydroxylation and degradation under hypoxic conditions. It could hence abrogate hypoxia-induced and HIF-mediated augmentation of erythropoiesis. Song et al. reported that the double variant specifically interferes with binding to PTGES3 (also called HSP90 cochaperone p23), but not to other known EGLN1 ligands, including FKBP8 or HSP90AB. As PTGES3-binding may facilitate HIF-alpha hydroxylation, a perturbation in this interaction would actually decrease HIF-alpha hydroxylation, hence decreased degradation and consequently increased HIF activity. The central question about the functional consequences of the Tibetan EGLN1 variant remains open...

It is not yet clear how high-altitude populations adapted to their harsh environment, but at least we begin to grasp the amazing complexity of this phenomenon. The scientific community has studied mostly 3 populations, Tibetans, Andeans and Ethiopians settled on the Simien plateau. They all exhibit patterns of genetic adaptation largely distinct from one another and the overlap is surprisingly low. The polymorphisms identified so far may not be straightforward loss- or gain-of-function, but they may instead fine tune complex interactions in which several proteins, possibly themselves carrying adaptive variations, are involved in a tissue-specific context.

As of this release, the UniProtKB/Swiss-Prot human EGLN1 has been updated with the new characterization data of the p.[Asp4Cys; Cys127Ser] polymorphism. On the new UniProt website, this information is to be found in the 'Sequences' section, 'Polymorphism' and 'Natural variant' subsections.

UniProtKB news

New mouse and zebrafish variation files

We would like to announce the release of two additional species, mouse and zebrafish, to the set of variation files available in the dedicated variants directory on the UniProt FTP sites. Both files catalogue protein altering Single Nucleotide Variants (SNVs or SNPs), stop-gained and stop-lost variants for UniProtKB/Swiss-Prot and UniProtKB/TrEMBL sequences of each species. These variants have been automatically mapped to UniProtKB sequences, including isoform sequences, through Ensembl. We very much welcome the feedback of the community on our efforts.

Structuring of 'cofactor' annotations

We have structured the previously free text cofactor annotations in UniProtKB and mapped individual cofactors to ChEBI identifiers. How this affects different UniProtKB distribution formats is described below.

Text format

CC   -!- COFACTOR:( <molecule>:)?
(CC       Name=<cofactor>; Xref=<database>:<identifier>;( Evidence={<evidence>};)?)*
(CC       Note=<free text>;)?

Note: Perl-style multipliers indicate whether a pattern (as delimited by parentheses) is optional (?) or may occur 0 or more times (*).

A cofactor annotation consists of:

  • An optional <molecule> value that indicates the isoform, chain or peptide to which this annotation applies.
  • Zero or more cofactors that are each described with:
    • A Name= field that shows the cofactor name.
    • A Xref= field that shows a cross-reference to the corresponding ChEBI record.
    • An optional Evidence= field that provides the evidence for the cofactor (see Evidence in the UniProtKB flat file format)
  • An optional Note= field that provides additional information.

Each cofactor description and the optional Note= field start on a new line. Lines are wrapped at a line length of 75 characters and indented to increase readability.

Examples:

  • Protein binds alternate/several cofactors
    CC   -!- COFACTOR:
    CC       Name=Mg(2+); Xref=ChEBI:CHEBI:18420;
    CC         Evidence={ECO:0000255|HAMAP-Rule:MF_00086};
    CC       Name=Co(2+); Xref=ChEBI:CHEBI:48828;
    CC         Evidence={ECO:0000255|HAMAP-Rule:MF_00086};
    CC       Note=Binds 2 divalent ions per subunit (magnesium or cobalt).
    CC       {ECO:0000255|HAMAP-Rule:MF_00086};
    CC   -!- COFACTOR:
    CC       Name=K(+); Xref=ChEBI:CHEBI:29103;
    CC         Evidence={ECO:0000255|HAMAP-Rule:MF_00086};
    CC       Note=Binds 1 potassium ion per subunit. {ECO:0000255|HAMAP-
    CC       Rule:MF_00086};
    
  • Isoforms
    CC   -!- COFACTOR: Isoform 1:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
    CC         Evidence={ECO:0000269|PubMed:16683188};
    CC       Note=Isoform 1 binds 3 Zn(2+) ions. {ECO:0000269|PubMed:16683188};
    CC   -!- COFACTOR: Isoform 2:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
    CC         Evidence={ECO:0000269|PubMed:16683188};
    CC       Note=Isoform 2 binds 2 Zn(2+) ions. {ECO:0000269|PubMed:16683188};
    
  • Chains
    CC   -!- COFACTOR: Serine protease NS3:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105;
    CC         Evidence={ECO:0000269|PubMed:9060645};
    CC       Note=Binds 1 zinc ion. {ECO:0000269|PubMed:9060645};
    CC   -!- COFACTOR: Non-structural protein 5A:
    CC       Name=Zn(2+); Xref=ChEBI:CHEBI:29105; Evidence={ECO:0000250};
    CC       Note=Binds 1 zinc ion in the NS5A N-terminal domain.
    CC       {ECO:0000250};
    
  • Cofactor unknown
    CC   -!- COFACTOR:
    CC       Note=Does not require a metal cofactor.
    CC       {ECO:0000269|PubMed:24450804};
    

XML format

We modified the XSD type commentType and introduced a new XSD type cofactorType as shown in red. We also moved the declaration of the molecule element - already used in the comment type "subcellular location" - to a more generic context so that it can also be used by other comment types such as "cofactor".

<xs:complexType name="commentType">
        ...
        <xs:sequence>
            <xs:element name="molecule" type="moleculeType" minOccurs="0"/>
            <xs:choice minOccurs="0">
            ...
                <xs:sequence>
                    <xs:annotation>
                        <xs:documentation>Used in 'cofactor' annotations.</xs:documentation>
                    </xs:annotation>
                    <xs:element name="cofactor" type="cofactorType" maxOccurs="unbounded"/>
                </xs:sequence>

                <xs:sequence>
                    <xs:annotation>
                        <xs:documentation>Used in 'subcellular location' annotations.</xs:documentation>
                    </xs:annotation>
                    <!-- <xs:element name="molecule" type="moleculeType" minOccurs="0"/> -->
                    <xs:element name="subcellularLocation" type="subcellularLocationType" maxOccurs="unbounded"/>
                </xs:sequence>
                ...
            </xs:choice>
            ...
            <xs:element name="text" type="evidencedStringType" minOccurs="0">
                <xs:annotation>
                    <xs:documentation>Used to store non-structured types of annotations,
                    as well as optional free-text notes of structured types of annotations.</xs:documentation>
                </xs:annotation>
            </xs:element>
            ...
        </xs:sequence>
        ...
    </xs:complexType>
    ...
    <xs:complexType name="cofactorType">
        <xs:annotation>
            <xs:documentation>Describes a cofactor.</xs:documentation>
        </xs:annotation>
        <xs:sequence>
            <xs:element name="name" type="xs:string"/>
            <xs:element name="dbReference" type="dbReferenceType"/>
        </xs:sequence>
        <xs:attribute name="evidence" type="intListType" use="optional"/>
    </xs:complexType>

A cofactor annotation consists of a sequence of:

  • An optional molecule element that indicates the isoform, chain or peptide to which this annotation applies.
  • Zero or more cofactor elements that each describe an individual cofactor with the following child elements:
    • A name element shows the cofactor name.
    • A dbReference element represents a cross-reference to the corresponding ChEBI record.
  • An optional text element that provides additional information.

Examples:

  • Protein binds alternate/several cofactors
    <comment type="cofactor">
      <cofactor evidence="1">
        <name>Mg(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:18420"/>
      </cofactor>
      <cofactor evidence="1">
        <name>Co(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:48828"/>
      </cofactor>
      <text evidence="1">Binds 2 divalent ions per subunit (magnesium or cobalt).</text>
    </comment>
    <comment type="cofactor">
      <cofactor evidence="1">
        <name>K(+)</name>
        <dbReference type="ChEBI" id="CHEBI:29103"/>
      </cofactor>
      <text evidence="1">Binds 1 potassium ion per subunit.</text>
    </comment>
    ...
    <evidence key="1" type="ECO:0000255">
      <source>
        <dbReference type="HAMAP-Rule" id="MF_00086"/>
      </source>
    </evidence>
    
  • Isoforms
    <comment type="cofactor">
      <molecule>Isoform 1</molecule>
      <cofactor evidence="9">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="9">Isoform 1 binds 3 Zn(2+) ions.</text>
    </comment>
    <comment type="cofactor">
      <molecule>Isoform 2</molecule>
      <cofactor evidence="9">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="9">Isoform 2 binds 2 Zn(2+) ions.</text>
    </comment>
    ...
    <evidence key="9" type="ECO:0000269">
      <source>
        <dbReference type="PubMed" id="16683188"/>
      </source>
    </evidence>
    
  • Chains
    <comment type="cofactor">
      <molecule>Serine protease NS3</molecule>
      <cofactor evidence="13">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="13">Binds 1 zinc ion.</text>
    </comment>
    <comment type="cofactor">
      <molecule>Non-structural protein 5A</molecule>
      <cofactor evidence="3">
        <name>Zn(2+)</name>
        <dbReference type="ChEBI" id="CHEBI:29105"/>
      </cofactor>
      <text evidence="3">Binds 1 zinc ion in the NS5A N-terminal domain.</text>
    </comment>
    ...
    <evidence key="3" type="ECO:0000250"/>
    ...
    <evidence key="13" type="ECO:0000269">
      <source>
        <dbReference type="PubMed" id="9060645"/>
      </source>
    </evidence>
    
  • Cofactor unknown
    <comment type="cofactor">
      <text evidence="1">Does not require a metal cofactor.</text>
    </comment>
    ...
    <evidence key="1" type="ECO:0000269">
      <source>
        <dbReference type="PubMed" id="24450804"/>
      </source>
    </evidence>
    

RDF format

We introduced a new cofactor property to list individual cofactors as ChEBI resource descriptions. As for other types of annotations, an optional sequence property may describe the molecule to which the annotation applies and an optional rdfs:comment property may provide additional information.

Examples:

Note: Evidence tags are omitted from the examples to make it easier to read them. They are represented as for all other types of annotations by reification of the concerned statements.

  • Protein binds alternate/several cofactors
    uniprot:Q5M434
      up:annotation SHA:1, SHA:2 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 2 divalent ions per subunit (magnesium or cobalt)." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_18420> ,
                  <http://purl.obolibrary.org/obo/CHEBI_48828> .
    SHA:2
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 1 potassium ion per subunit." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29103> ;
    
  • Isoforms
    uniprot:O15304
      up:annotation SHA:1, SHA:2 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Isoform 1 binds 3 Zn(2+) ions." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence isoform:O15304-1 .
    SHA:2
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Isoform 2 binds 2 Zn(2+) ions." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence isoform:O15304-2 .
    
  • Chains
    uniprot:P26662
      up:annotation SHA:1, SHA:2 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 1 zinc ion." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence annotation:PRO_0000037644 .
    SHA:2
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Binds 1 zinc ion in the NS5A N-terminal domain." ;
      up:cofactor <http://purl.obolibrary.org/obo/CHEBI_29105> ;
      up:sequence annotation:PRO_0000037647 .
    
  • Cofactor unknown
    uniprot:A9CEQ7
      up:annotation SHA:1 ;
      ...
    SHA:1
      rdf:type up:Cofactor_Annotation ;
      rdfs:comment "Does not require a metal cofactor." ;
    

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2014_10

Published October 29, 2014

Headline

K for Koagulation

After several weeks of a cholesterol-free diet, chickens start bleeding. The phenotype cannot be reversed by the addition of purified cholesterol to their chow, suggesting that another compound could have been extracted along with cholesterol during food preparation. This observation made by Henrik Dam in 1929 led to the identification of a fat-soluble vitamin involved in coagulation, also known as vitamin K (K standing for Koagulationsvitamin, the original German name for this compound, since the initial observations were reported in a German journal). This discovery was awarded the Nobel prize in 1943, but vitamin K function and metabolism are still extensively studied.

In plants, vitamin K plays an essential role in photosynthesis, which is why it is particularly enriched in photosynthetic tissues, such as green leaves. In animals, vitamin K is essential for blood clotting and bone mineralization. It also prevents the calcification of arteries and other soft tissues. More recently, vitamin K has been shown to function as a mitochondrial electron carrier and to serve as a ligand for the nuclear receptor SXR, which controls the expression of genes involved in transport and metabolism of endo- and xenobiotics.

The most extensively studied vitamin K function is its role as a cosubstrate for vitamin K-dependent gamma-carboxylase (GGCX). This enzyme catalyzes gamma-carboxylation of glutamate residues in target proteins. The modification activates several blood factor proteins and leads to initiation of the blood coagulation cascade. Widely used anticoagulant drugs, called coumarins, take advantage of this property and act as vitamin K antagonists. For example, warfarin is thought to inhibit vitamin K epoxide reductase complex subunit 1 (VKORC1), blocking vitamin K recycling, hence depleting active vitamin K stores. Although life-saving, the use of warfarin is quite tricky, as inadequate dosage may have dramatic consequences, either embolism or thrombosis (underdosage), or potentially fatal hemorrhage (overdosage). Interindividual genetic variations greatly affect warfarin efficiency. Polymorphisms within VKORC1 and CYP2C9, a cytochrome P450 family member involved in coumarin inactivation, together account for approximately 30% of population dose variance. A genetic variant p.Val433Met in another P450 family member, CYP4F2, has also been reported to increase warfarin requirements. CYP4F2 has recently been shown to catalyze vitamin K omega-hydroxylation, a key step in vitamin K degradation. The p.Val433Met polymorphism produces a decrease of CYP4F2 protein in the liver. Lower CYP4F2 levels likely lead to an increase in hepatic vitamin K levels, hence more molecules that warfarin must antagonize, resulting in coumarin resistance in individuals bearing this polymorphism.

As of this release, an updated version of the UniProtKB/Swiss-Prot CYP4F2 entry is available. Proteins undergoing gamma-carboxylation can be retrieved using the keyword Gamma-carboxyglutamic acid.

UniProtKB news

Change of the cross-reference ArrayExpress to ExpressionAtlas

The Expression Atlas database provides information on baseline and differential gene expression patterns under different biological conditions. Experiments in Expression Atlas are selected from the ArrayExpress database of functional genomics experiments. Because UniProtKB entries cross-reference only this subset of experiments, we have changed the resource abbreviation for these cross-references from ArrayExpress to ExpressionAtlas. We have at the same time added a field to indicate the type of expression patterns for which information can be found in the ExpressionAtlas (see examples below).

Text format

Example: P15822

DR   ExpressionAtlas; P15822; baseline and differential.

XML format

Example: P15822

<dbReference type="ExpressionAtlas" id="P15822"/>
  <property type="expression patterns" value="baseline and differential"/>
</dbReference>

RDF format

Example: P15822

uniprot:P15822
  rdfs:seeAlso <http://purl.uniprot.org/expressionatlas/P15822> .
<http://purl.uniprot.org/expressionatlas/P15822>
  rdf:type Resource ;
  up:database <http://purl.uniprot.org/database/ExpressionAtlas> ;
  rdfs:comment "baseline and differential" .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Amelogenesis imperfecta and gingival fibromatosis syndrome
  • Mental retardation, X-linked 59

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • (4R)-5-hydroxyleucine
  • (4R)-5-oxoleucine

Deleted term:

  • 5-methoxythiazole-4-carboxylic acid (Val-Cys)

UniProt release 2014_09

Published October 1, 2014

Headline

Small is beautiful (and useful)

In large scale studies, small proteins tend to be overlooked. They are difficult to predict using software tools and they often escape detection by mass spectrometry. When cDNA sequences are submitted, short coding sequences (CDS) are only rarely annotated and hence do not appear in any protein databases, including UniProtKB/TrEMBL or GenPept, and their nucleotide sequences can be tagged as 'non-coding RNAs'. In UniProtKB/Swiss-Prot, we are aware of the problem, but we are often reluctant to annotate uncharacterized small ORFs, fearing to introduce imaginary sequences in a database we wish to be as reliable as possible. That is why we are thrilled when new data become available that allow us to fill the gap.

This happened a few months ago, with the publication of 2 articles that brought the 'noncoding transcript' AK092578 under the spotlight. Pauli et al. were investigating inductive events during early embryogenesis in zebrafish. In order to find new signaling peptides, they sequenced RNAs extracted from embryos at different developmental stages and combined this approach with ribosome profiling to select for transcripts most likely to be translated. This led to the discovery of 399 novel coding genes. 28 of them contained a signal peptide, but no transmembrane domain, making them good candidates for signaling proteins. Pauli et al. focused their attention on one of them, apela, that they called toddler, encoded by AK092578, so far considered to be a noncoding transcript. A few weeks earlier, Chng et al. had already published the identification of the same protein, which they named elabela.

Apela is a highly conserved protein among vertebrates; this conservation is particularly striking in the 30 amino acid long mature peptide, the last 13 residues being nearly invariant in all vertebrate species studied. Apela is expressed in the zygote, with a peak during gastrulation, and becomes undetectable by 4 days post-fertilization. Its disruption leads to a dramatic phenotype, including small or absent hearts, posterior accumulation of blood cells, malformed pharyngeal endoderm, and abnormal left-right positioning and formation of the liver. Most mutant embryos eventually die between 5 and 7 days of development. Interestingly, this phenotype was reminiscent of that observed for apelin receptor (aplnr) deficiency.

The pathway leading to aplnr activation that could explain the observed mutant phenotype remained unsolved for several years. Indeed, aplnr disruption in zebrafish demonstrated that aplnr was required prior to the onset of gastrulation for proper cardiac morphogenesis, but its known ligand, apln, was not expressed until midgastrulation, too late to play a role in such a very early event. Along the same line, it had been reported that Aplnr mutant animals were not born in the expected Mendelian ratio, and many showed cardiovascular developmental defects, while Apln-deficient mice were viable, fertile, and showed normal development. Taken together, these observations suggested that Aplnr might have yet another ligand, expressed very early in embryonic development. The newly discovered apela protein seemed to fulfill the conditions and, using different strategies, both groups convincingly showed that apela is indeed aplnr's first ligand.

Human, mouse and zebrafish Apela orthologs have been updated accordingly and these entries are now available.

UniProtKB news

Evidence in the UniProtKB flat file format

The evidence for annotations in UniProtKB entries has been available for several years in the XML and RDF representation of the data and we have now added this information also to the text format (aka flat file format).

Representation of evidence

This section describes how evidence is represented, independently of the context in which they can be found.

An individual evidence description consists of a mandatory evidence type, represented by a code from the Evidence Codes Ontology (ECO) and, where applicable, the source of the data which is usually another database record that is represented by the database name and record identifier, but in the case of publications that are not in PubMed we indicate instead the corresponding UniProtKB reference number.

Examples:

  • An evidence type without source: {type}, e.g.
    {ECO:0000305}
    {ECO:0000250}
    {ECO:0000255}
    
  • An evidence type with source: {type|source}, e.g.
    {ECO:0000269|PubMed:10433554}
    {ECO:0000303|Ref.6}
    {ECO:0000305|PubMed:16683188}
    {ECO:0000250|UniProtKB:Q8WUF5}
    {ECO:0000312|EMBL:BAG16761.1}
    {ECO:0000313|EMBL:BAG16761.1}
    {ECO:0000255|HAMAP-Rule:MF_00205}
    {ECO:0000256|HAMAP-Rule:MF_00205}
    {ECO:0000244|PDB:1K83}
    {ECO:0000213|PDB:1K83}
    
  • Several evidence attributions: {type|source, type|source, ...}, e.g.
    {ECO:0000269|PubMed:10433554, ECO:0000303|Ref.6}
    

Change of the representation of different line and annotation types

This section describes in which line and annotation types evidence may be found and where it is placed. We use here the symbolic representation {evidence} as a placeholder for all evidence representations that are described in the previous section.

DE lines

Evidence may be found at the end of subcategory fields, e.g.

DE   RecName: Full=Palmitoyl-protein thioesterase-dolichyl pyrophosphate phosphatase fusion 1 {evidence};
DE   Contains:
DE     RecName: Full=Palmitoyl-protein thioesterase {evidence};
DE              Short=PPT {evidence};
DE              EC=3.1.2.22 {evidence};
DE     AltName: Full=Palmitoyl-protein hydrolase {evidence};
DE   Contains:
DE     RecName: Full=Dolichyldiphosphatase {evidence};
DE              EC=3.6.1.43 {evidence};
DE     AltName: Full=Dolichyl pyrophosphate phosphatase {evidence};
DE   Flags: Precursor;
GN lines

Evidence may be found after each gene designation, e.g.

GN   Name=cysA1 {evidence}; Synonyms=cysA {evidence};
GN   OrderedLocusNames=Rv3117 {evidence}, MT3199 {evidence};
GN   ORFNames=MTCY164.27 {evidence};
GN   and
GN   Name=cysA2 {evidence}; OrderedLocusNames=Rv0815c {evidence}, MT0837
GN   {evidence}; ORFNames=MTV043.07c {evidence};
OG lines

Evidence may be found after an organelle or plasmid, e.g.

OG   Mitochondrion {evidence}.
OG   Plasmid pWR100 {evidence}, Plasmid pINV_F6_M1382 {evidence}, and
OG   Plasmid pCP301 {evidence}.
OX lines

Evidence may be found after the taxonomy identifier, e.g.

OX   NCBI_TaxID=9606 {evidence};
RN lines

Evidence may be found after the reference number, e.g.

RN   [1] {evidence}
RC lines

Evidence may be found after each value, e.g.

RC   STRAIN=C57BL/6J {evidence}, and DBA/2J {evidence}; TISSUE=Brain
RC   {evidence};
KW lines

Evidence may be found after each keyword, e.g.

KW   ATP-binding {evidence}; Cell cycle {evidence}; Cell division {evidence};
KW   DNA replication {evidence};
CC lines

The evidence location depends on the annotation type.

Unstructured annotations:

Evidence may initially be found at the end of the annotations because this is how they have historically been attributed, e.g.

CC   -!- FUNCTION: Possesses kinase activity. May be involved in
CC       trafficking and/or processing of RNA. {evidence}.

At a later time, we intend to start attributing evidence at a more fine-grained level by placing them behind the sentences or paragraphs to which they apply, e.g.

CC   -!- FUNCTION: Possesses kinase activity. {evidence}. May be involved
CC       in trafficking and/or processing of RNA. {evidence}.

Structured annotations:

ALTERNATIVE PRODUCTS:

Evidence may be found behind the values of the Name= and Synonyms= fields. It may also be found in Comment= and Note= fields where it is placed as in unstructured annotations, e.g.

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=13;
CC         Comment=Additional isoforms seem to exist. {evidence};
CC       Name=1 {evidence}; Synonyms=LST1/A {evidence};
CC         IsoId=O00453-1; Sequence=Displayed;
..
CC       Name=12;
CC         IsoId=O00453-12; Sequence=VSP_047367;
CC         Note=No experimental confirmation available. {evidence};

BIOPHYSICOCHEMICAL PROPERTIES:

In the structured subtopics Absorption and Kinetic parameters evidence may be found at the end of the Abs(max)=, KM= and Vmax= fields. It may also be found in Note= fields and the unstructured subtopics pH dependence, Redox potential and Temperature dependence, where it is placed as in unstructured annotations, e.g.

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       Absorption:
CC         Abs(max)=465 nm {evidence};
CC         Note=The above maximum is for the oxidized form. Shows a maximal
CC         peak at 330 nm in the reduced form. These absorption peaks are
CC         for the tryptophylquinone cofactor. {evidence};
CC       Kinetic parameters:
CC         KM=5.4 uM for tyramine {evidence};
CC         Vmax=17 umol/min/mg enzyme {evidence};
CC         Note=The enzyme is substrate inhibited at high substrate
CC         concentrations (Ki=1.08 mM for tyramine). {evidence};

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       pH dependence:
CC         Optimum pH is 7-8 for ATPase activity. Is more active at pH 8 to
CC         10 than at pH 5.5. {evidence};
CC       Temperature dependence:
CC         Optimum temperature is 80 degrees Celsius for ATPase activity.
CC         {evidence};

RNA EDITING:

Evidence may be found behind the modified positions as well as in the optional Note= field where it is placed as in unstructured annotations, e.g.

CC   -!- RNA EDITING: Modified_positions=207 {evidence}; Note=Partially
CC       edited. Target of Adar. {evidence};

(Please note that we have taken this occasion to make an additional small format change to this annotation type: We have replaced the full-stop at the end of the annotation with a semi-colon to be consistent with other structured annotation types that consist of a list of Field=Value; items.)

MASS SPECTROMETRY:

In MASS SPECTROMETRY annotations the same evidence applies to all fields (incl. the optional Note= field) and all evidence attributions are thus displayed in a separate field instead of adding them at the end of each field. A new Evidence= field has replaced the previously existing Source= field, e.g.

CC   -!- MASS SPECTROMETRY: Mass=2189.4; Method=Electrospray; Range=167-
CC       186; Note=Monophosphorylated.; Evidence={evidence};

SEQUENCE CAUTION:

In SEQUENCE CAUTION annotations the same evidence applies to all fields (incl. the optional Note= field) and all evidence is thus displayed in a separate new Evidence= field instead of being added at the end of each field, e.g.

CC   -!- SEQUENCE CAUTION:
CC       Sequence=AAL25396.1; Type=Miscellaneous discrepancy; Note=Intron retention.; Evidence={evidence};
CC       Sequence=ABF70206.1; Type=Miscellaneous discrepancy; Note=Intron retention.; Evidence={evidence};
CC       Sequence=CAA32567.1; Type=Erroneous gene model prediction; Evidence={evidence};
CC       Sequence=CAA32568.1; Type=Erroneous gene model prediction; Evidence={evidence};

SUBCELLULAR LOCATION:

Evidence may be found at the same places where previously the non-experimental qualifiers By similarity, Probable and Potential were displayed (see Syntax modification of the 'Subcellular location' subtopic) as well as in the optional Note= field where it is placed as in unstructured annotations, e.g.

CC   -!- SUBCELLULAR LOCATION: Golgi apparatus, trans-Golgi network
CC       membrane {evidence}; Multi-pass membrane protein {evidence}.
CC       Note=Predominantly found in the trans-Golgi network (TGN). Not
CC       redistributed to the plasma membrane in response to elevated
CC       copper levels. {evidence}.
CC   -!- SUBCELLULAR LOCATION: Isoform 2: Cytoplasm {evidence}.
CC   -!- SUBCELLULAR LOCATION: WND/140 kDa: Mitochondrion {evidence}.

DISEASE:

Evidence may be found at end of the disease description as well as in the optional Note= field where it is placed as in unstructured annotations, e.g.

CC   -!- DISEASE: Sarcoidosis 1 (SS1) [MIM:181000]: An idiopathic,
CC       systemic, inflammatory disease characterized by the formation of
CC       immune granulomas in involved organs. Granulomas predominantly
CC       invade the lungs and the lymphatic system, but also skin, liver,
CC       spleen, eyes and other organs may be involved. {evidence}.
CC       Note=Disease susceptibility is associated with variations
CC       affecting the gene represented in this entry. {evidence}.
FT lines

Evidence may be found at the end of the feature description, e.g.

FT   VARIANT     341    341       P -> L (in AH2; strongly reduced
FT                                activity). {evidence}.
FT                                /FTId=VAR_065665.
FT   CONFLICT     52     53       RT -> KI (in Ref. 8; AAD14329).
FT                                {evidence}.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease:

  • Glycogen storage disease 14

Changes to keywords

New keyword:

Modified keyword:

UniProt release 2014_08

Published September 3, 2014

Headline

Ubiquitin caught at its own game

Ubiquitination is a widely used post-translational modification (PTM) in eukaryotic cells. It is involved in a plethora of cellular activities ranging from removal of misfolded and unwanted proteins to signaling in innate immunity, from transcriptional regulation to membrane trafficking. Ubiquitination is the covalent attachment of the small 76-residue protein ubiquitin onto a target protein, most often via an isopeptide bond between the amino group of a lysine side chain and ubiquitin C-terminus. This process occurs in several steps: an ubiquitin-activation step catalyzed by E1 enzymes, an ubiquitin-conjugation step catalyzed by E2 enzymes, and a step ensuring the target specificity involving E3 ligases. Many different types of ubiquitination exist, monoubiquitination, multi(mono)ubiquitination and polyubiquitination, each type conveying a different signal. Polyubiquitination occurs via further ubiquitination of a single lysine residue on the substrate protein. Ubiquitin contains 7 ubiquitin lysines; each can serve as an acceptor for further elongation and each defines a distinct fate for the modified protein. The classic example is the Lys-48-linked chain which targets the protein bearing it to degradation via the proteasome.

An additional step of complexity has been unveiled in 3 recent publications: Ubiquitin was discovered to be itself subjected to another PTM, namely phosphorylation, which confers on it the ability to activate the E3 ubiquitin-protein ligase Parkin (PARK2).

Parkin and the PINK1 kinase are involved in the signaling pathway leading to mitophagy, a specialized program which eliminates damaged mitochondria and hence maintains health. Indeed, defects in any of these proteins cause early-onset Parkinson disease.

Under normal conditions, PINK1 is imported into mitochondria, where it is processed and rapidly degraded. When mitochondria lose membrane potential or amass unfolded proteins, PINK1 accumulates on the outer membrane where it recruits cytosolic Parkin and activates its latent E3 activity. As a result, mitochondrial outer membrane proteins are ubiquitinated and the defective organelle is targeted for destruction.

It is in the Parkin activation step that phosphorylated ubiquitin comes into play. PINK1 directly phosphorylates ubiquitin at Ser-65. Of note, Parkin itself contains a ubiquitin-like domain that is also phosphorylated by PINK1 at Ser-65. All three publications agree that phosphorylated ubiquitin is involved in the PINK1/PARK2 pathway. Nevertheless Koyano and colleagues found that both ubiquitin and Parkin Ser-65 phosphorylations are needed for full Parkin activation, whereas Kane et al. observed Parkin activation with phospho-ubiquitin alone. While phospho-ubiquitin can be used by Parkin as a substrate for ubiquitination, its Parkin-binding and -activating abilities seem to be separated from its role as a substrate.

As of this release, human Parkin, PINK1 and ubiquitin entries have been updated accordingly and annotations have been transferred to orthologous entries based on sequence similarity. Proteins known to undergo ubiquitination can be retrieved with the keyword Ubl conjugation and proteins involved in the ubiquitination pathway, such as E1, E2 or E3 enzymes, with the keyword Ubl conjugation pathway.

UniProtKB news

New variant types in homo_sapiens_variation.txt.gz on the UniProt FTP site

UniProt would like to announce the addition of two variant types, stop lost and stop gained, to the set of protein altering variants from the 1000 Genomes Project available in the homo_sapiens_variation.txt.gz file. Stop lost and stop gained variants have been selected as the first structural variants to be added to the UniProt variant catalogue because they are two of the most commonly occurring variant types. UniProt expects to add further structural variant types and somatic variants to the available variant types and to include additional species. This file, along with the humsavar.txt file, can now be found in the new dedicated variants directory in the UniProt FTP site. We very much welcome the feedback of the community on our efforts.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Ichthyosis, autosomal recessive, with hypotrichosis
  • Loeys-Dietz syndrome 2A
  • Loeys-Dietz syndrome 2B

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • Isoaspartyl glycine isopeptide (Asn-Gly)
  • Isoaspartyl glycine isopeptide (Asp-Gly)

Deleted terms:

  • Aspartyl isopeptide (Asn)
  • Aspartyl isopeptide (Asp)

Changes to keywords

Modified keyword:

Website news

The UniProt website is changing

We would like to introduce you to the new UniProt website! We have been working on this site behind the scenes for a while and we're glad it's finally time to share it with you.

We redesigned the UniProt website following a user centered design process, involving over 250 users worldwide with varied research backgrounds and use cases. User centered design is a design approach that is grounded in the requirements and expectations of users. They are included at every stage of the process, from gathering requirements to testing the end product.

Some highlights of the changes and improvements:

  • A new homepage and advanced search functionality
  • A new results page interface with easy to use filters
  • A basket to store your favorite proteins and build up your own set
  • New protein entry page content classification and navigation bar
  • New tool output interfaces (e.g. BLAST results)
  • New 'Proteomes' pages for full protein sets from completely sequenced organisms

Contextual help is available on the site as well as UniProt help videos from the UniProt YouTube channel. We look forward to feedback from the scientific community to help improve the site further.

UniProt release 2014_07

Published July 9, 2014

Headline

Lark or owl? PER3 is the answer

Unless you are like Napoleon who never needed more than 4 hours of sleep at a stretch, being both an early bird and a night owl, you certainly have a diurnal preference. It is not a simple matter of taste, it is a matter of genetics, involving the PER3 gene.

In humans, the PER3 gene exists in 2 versions: a short one and a long one. The length variation depends upon the number of 18 amino-acid tandem repeats in the protein's C-terminus: 4 in the short version, 5 in the long one. Roughly 10% of the population is homozygous for the long allele (PER3 5/5) and 50% for the short allele (PER3 4/4). This polymorphism correlates significantly with extreme diurnal preference, the longer allele being associated with morningness and the shorter allele with eveningness. In addition, PER3 5/5 individuals are more vulnerable to sleep deprivation than their PER3 4/4 counterparts, exhibiting greater cognitive performance impairment. When allowed to take naps, PER3 5/5 individuals show a greater ability to sleep independently of circadian phase, suggesting that the polymorphism modifies the sleep homeostatic response without influencing circadian parameters.

The molecular mechanism of this behavioral difference is not known and there was no animal model to investigate it until recently. Indeed, the 18 amino-acid polymorphism does not exist in non-primate mammals. Earlier this year, Hasan et al. published a study in which they created 2 knock-in mice. These mice contained a "humanized" PER3 exon 18 with either the 4-repeat or 5-repeat allele. The transgenic mice exhibited a phenotypic response to sleep deprivation and recovery consistent with the observations made in humans. 816 genes were differentially expressed in the cortex of Per3 4/4 and Per3 5/5 mice and a similar amount in the hypothalamus. At least some of these genes seem to be involved in the regulation of, or response to, sleep, as well as in neuronal development and function. For instance, some isoforms of the Homer1 gene, a marker of sleep homeostasis, were up-regulated in the Per3 5/5 compared to the Per3 4/4 hypothalamus.

With this tool in hand, we may be in a position to start identifying the genetic control of sleep architecture in humans and maybe unveil if Napoleon's sleep ability was a true genetic oddity, the result of his iron will or just a historical myth.

As of this release, the human PER3 entry has been updated in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to CCDS

Cross-references have been added to CCDS, the Consensus CDS project.

CCDS is available at http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi.

The format of the explicit links is:

Resource abbreviation CCDS
Resource identifier CCDS identifier

Cross-references to CCDS may be isoform-specific. The general format of isoform-specific cross-references was described in release 2014_03.

Example: O70554

Show all entries having a cross-reference to CCDS.

Text format

Examples:

O70554

DR   CCDS; CCDS38509.1; -.

P00750

DR   CCDS; CCDS6126.1; -. [P00750-1]
DR   CCDS; CCDS6127.1; -. [P00750-3]

XML format

Examples:

O70554

<dbReference type="CCDS" id="CCDS38509.1"/>

P00750

<dbReference type="CCDS" id="CCDS6126.1">
  <molecule id="P00750-1"/>
</dbReference>
<dbReference type="CCDS" id="CCDS6127.1">
  <molecule id="P00750-3"/>
</dbReference>

Cross-references to GeneReviews

Cross-references have been added to GeneReviews, a resource of expert-authored, peer-reviewed disease descriptions.

GeneReviews is available at http://www.ncbi.nlm.nih.gov/books/NBK1116/.

The format of the explicit links is:

Resource abbreviation GeneReviews
Resource identifier GeneReviews identifier

Example: O00555

Show all entries having a cross-reference to GeneReviews.

Text format

Example: O00555

DR   GeneReviews; CACNA1A; -.

XML format

Example: O00555

<dbReference type="GeneReviews" id="CACNA1A"/>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • L-isoglutamyl histamine

Modified term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • N6-crotonyl-L-lysine -> N6-crotonyllysine

Changes to keywords

New keywords:

Modified keywords:

UniParc news

UniParc cross-references with protein and gene names

The UniParc XML format uses dbReference elements to represent cross-references to external database records that contain the same sequence as the UniParc record. Additional information about an external database record is provided with different types of property child elements. We have introduced two new types, "protein_name" and "gene_name", to show the preferred protein and gene name of external database records that provide this information. In this release we have added names for cross-references to UniProtKB and RefSeq. For UniProtKB entries that have several protein or gene names, UniParc shows only the main one, which is the same name that is shown in the UniProtKB FASTA format. We will soon add names for cross-references to ENA, Ensembl, EnsemblGenomes and model organism databases (FlyBase, SGD, TAIR, WormBase).

Examples:

<dbReference type="UniProtKB/Swiss-Prot" id="P05067" version_i="3" active="Y" version="3" created="1991-11-01" last="2014-02-19">
  <property type="NCBI_GI" value="112927"/>
  <property type="NCBI_taxonomy_id" value="9606"/>
  <property type="protein_name" value="Amyloid beta A4 protein"/>
  <property type="gene_name" value="APP"/>
</dbReference>
...
<dbReference type="UniProtKB/Swiss-Prot protein isoforms" id="P05067-2" version_i="1" active="Y" created="2003-03-28" last="2014-02-19">
  <property type="NCBI_taxonomy_id" value="9606"/>
  <property type="protein_name" value="Isoform APP305 of Amyloid beta A4 protein"/>
  <property type="gene_name" value="APP"/>
</dbReference>

This change did not affect the UniParc XSD, but may nevertheless require code changes.

FTP site news

Every folder on our FTP server now contains a file called RELEASE.metalink that specifies the size and MD5 checksum of every file in that folder, e.g.
ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/RELEASE.metalink

Metalink is an extensible metadata file format that describes one or more computer files available for download. It facilitates file verification and recovery from data corruption and lists alternate download sources (mirror URIs).

Various command line download tools, e.g. cURL version 7.30 or higher and aria2, support metalink.

Example: The following command will download all files in the current_release/ folder and verify their MD5 checksums:

curl --metalink ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/RELEASE.metalink

They will be downloaded from one of the alternative locations mentioned in the metalink file. If one FTP server goes down during a download, programs can automatically switch to another mirror location. Some programs can also download segments from several FTP locations at the same time, which can make downloads much faster.

Please note that UniProt can be downloaded from the consortium member FTP sites at three different geographical locations:

USA: ftp://ftp.uniprot.org/pub/databases/uniprot
UK: ftp://ftp.ebi.ac.uk/pub/databases/uniprot
Switzerland: ftp://ftp.expasy.org/databases/uniprot

This information can be found in our FAQ.

UniProt release 2014_06

Published June 11, 2014

Headline

Everything you always wanted to know about... sperm-egg interaction

To reach the ultimate goal of sexual reproduction which is egg fertilization, sperm cells have to run an obstacle course. They have to jump, or rather to swim, through a lot of hoops and hurdles before fusing with the oocyte and forming a zygote. The very first step of this race starts after ejaculation and involves sperm capacitation, a complex process characterized by a series of structural and functional changes, leading to sperm hypermotility that allows it to swim through oviductal mucus. In the ampulla of the fallopian tube, in the immediate surroundings of the oocyte, the spermatozoon meets a hyaluronic acid-rich matrix secreted by cumulus cells that it penetrates with the help of hyaluronidase PH-20/SPAM1. The next impediment is the egg's coat, the zona pellucida. The interaction between the spermatozoon and zona pellucida leads to the acrosomal reaction, in which molecules required for penetrating the zona pellucida are secreted and molecules needed for sperm binding to the egg are exposed. Once through the coat, the sperm access the perivitelline space and eventually the egg's plasma membrane, called the oolemma. It binds to it and both egg and sperm membranes fuse.

Although the overall fertilization process has been known for a long time, a large part of the detailed molecular mechanism is still mysterious. In 2005, Inoue et al. identified Izumo1 as the sperm-specific protein involved in egg attachment. Without Izumo1, fertilization does not occur, at least in mice. It took 9 more years to pinpoint Folr4 as the Izumo1 egg partner. Folr4 is widely conserved across mammals, including marsupials. Contrary to what its name might suggest, Folr4 is not a folate receptor, but it efficiently binds Izumo1 and hence has been renamed Juno, after Jupiter's wife (and sister). The Juno and Izumo1 interaction is an absolute requirement for fertilization. In the absence of Juno, mice display no particular phenotype in a daily life, but are totally sterile, although they mate normally.

After fertilization, the egg becomes refractory to further sperm fusion events to prevent polyspermy. This process involves biochemical changes of the oolemma occurring 30-45 minutes after the initial fusion event, as well as hardening of the zona pellucida in a second phase. Juno may play a role in establishing the membrane block to polyspermy. Indeed, it is rapidly shed from the oolemma and redistributed to vesicles within the perivitelline space where it may create an area of "decoy eggs" to neutralize incoming sperm.

This discovery is not yet "everything you always wanted to know about" fertilization, for instance it does not unveil the fusion mechanism itself, but is nevertheless a major step forward.

As of this release, human and mouse Juno proteins have been updated in UniProtKB/Swiss-Prot.

UniProtKB news

Extension of the UniProtKB accession number format

We have extended the UniProtKB accession number format to 10 alphanumerical characters by adding a third pattern for new UniProtKB accession numbers. Old UniProtKB accession numbers will not change. The valid patterns for UniProtKB accession numbers are:

accession 1 2 3 4 5 6 7 8 9 10
old [O,P,Q] [0-9] [A-Z,0-9] [A-Z,0-9] [A-Z,0-9] [0-9]
old [A-N,R-Z] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9]
new [A-N,R-Z] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9]

The three patterns can be combined into the following regular expression:

[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • N6-glutaryllysine

UniProt DAS news

We have retired the SAAS data source from our DAS server.

UniProt release 2014_05

Published May 14, 2014

Headline

A flounder... on the rocks!

Some organisms, such as certain vertebrates, plants, fungi and bacteria, have to resist low, subzero temperatures. Their survival relies upon the production of antifreeze molecules. Some insects, like the beetle Upis ceramboides, tolerate freezing to -60°C in midwinter thanks to the production of a compound, called xylomannan, made of a sugar and a fatty acid and located in cell membranes. However, most organisms use antifreeze proteins (AFPs). All AFPs act by binding to small ice crystals to inhibit growth that would otherwise be fatal, but each type of AFP seems to arrive at this end by a different route.

Pseudopleuronectes americanus, commonly called 'winter flounder', is a very common variety of flounder in North America. It lives in cold water and survives thanks to the expression of the AFP Maxi. The 3D structure of the Maxi protein has been recently elucidated, unveiling some very unusual features.

Maxi belongs to the type-I AFP family and consists of a homodimer. Each monomer folds exactly in half so that its N-and C-termini are side by side, hence the dimer looks like a 4-helix rod. It is composed of tandem 11-residue repeats that exhibit the [T/I]-x3-A-x3-A-x2 motif, where x is any residue. The conserved threonine/isoleucine and alanine residues in this motif have been shown to bind ice in monomeric type-I AFPs. In the 3D structure, the internal space generated by the packing of the 4 helices in the 11-residue repeat regions is just wide enough to accommodate a single layer of water. Amazingly, the water layer that occupies the gap consists of over 400 molecules forming an extensive, mainly polypentagonal network. As is the case for most globular proteins, Maxi internal residues are nonpolar, mainly alanines, which obviously is far from optimal for hydrophilic contacts. To overcome this problem, Maxi takes advantage of its backbone carboxyl groups to anchor water molecules and the whole structure is stabilized by water-mediated hydrogen bonding rather than by direct protein association. The positioned water molecules extend outwards between all 4 helices from the core to the surface and they form a network of ordered molecules at the periphery. As a result, this rather hydrophobic protein remains highly solvated and freely soluble in flounder blood under physiological conditions, i.e. at low temperatures. When the temperature rises above 16°C, Maxi irreversibly denatures.

Another surprise came from the observation that the predicted ice-binding residues, expected to face the protein exterior, actually occur on the inward-pointing surfaces of all 4 helices where they cooperate to form and anchor the interior ordered waters. How then does Maxi bind to ice? The current working hypothesis is that the positioned water molecules that extend outwards may form a network available to merge and freeze with the quasi-liquid layer on the surface of ice.

As of this release, the winter flounder antifreeze protein Maxi has been annotated and integrated into UniProtKB/Swiss-Prot. All antifreeze proteins available in UniProtKB/Swiss-Prot can be retrieved with the keyword 'Antifreeze protein'.

UniProtKB news

Update of ECO mapping for evidence

In 2011, we have started to use the Evidence Codes Ontology (ECO) to describe the evidence for UniProtKB annotations. Since then, this ontology has been extended and the GO Consortium has published a mapping of their GO evidence codes to ECO. We have adapted our mapping to ECO accordingly to have equivalent evidence codes for UniProtKB and GO annotations. How this affects different UniProtKB distribution formats is described below.

XML and DAS format

In these two formats, ECO codes are used to describe the evidence for UniProtKB annotations. In the UniProtKB XML format, an evidence is represented by an evidence element with a type attribute whose value is an ECO code. In the DAS (features) representation of UniProtKB, an evidence is represented by a METHOD element with an optional cvId attribute whose value is an ECO code.

The table below shows the mapping of previous to new ECO codes.

Previous ECO code New ECO code
ECO:0000001 ECO:0000305
ECO:0000006 ECO:0000269
ECO:0000034 ECO:0000303
ECO:0000044 ECO:0000250
ECO:0000203 ECO:0000501 and ECO:0000256

The codes ECO:0000312 and ECO:0000313 remain unchanged.

In the future, we will also use ECO:0000255 for UniProtKB annotations.

RDF format

In the UniProtKB RDF format, ECO codes are used to describe the evidence
for UniProtKB and GO annotations. An evidence is represented by an evidence property whose value is an ECO code. The evidence property is part of an attribution object which is assigned to a UniProtKB or GO annotation via reification.

The table below shows the mapping of previous to new ECO codes.

GO evidence code Previous ECO code New ECO code
EXP ECO:0000006 ECO:0000269
IBA ECO:0000308 ECO:0000318
IBD ECO:0000214 ECO:0000319
IC ECO:0000001 ECO:0000305
IDA ECO:0000002 ECO:0000314
IEA ECO:0000203 ECO:0000501
IEP ECO:0000008 ECO:0000270
IGC ECO:0000177 ECO:0000317
IGI ECO:0000011 ECO:0000316
IKR ECO:0000216 ECO:0000320
IMP ECO:0000015 ECO:0000315
IPI ECO:0000021 ECO:0000353
IRD ECO:0000215 ECO:0000321
ISA ECO:0000200 ECO:0000247
ISM ECO:0000202 ECO:0000255
ISO ECO:0000201 ECO:0000266
ISS ECO:0000044 ECO:0000250
NAS ECO:0000034 ECO:0000303
ND ECO:0000035 ECO:0000307
RCA ECO:0000053 ECO:0000245
TAS ECO:0000033 ECO:0000304

Cross-references for isoform sequences: RefSeq

We have added isoform-specific cross-references to the RefSeq database. The format of these cross-references is as described in release 2014_03.

Cross-references to MaxQB

Cross-references have been added to MaxQB, a database of large proteomics projects.

MaxQB is available at http://maxqb.biochem.mpg.de/mxdb/.

The format of the explicit links is:

Resource abbreviation MaxQB
Resource identifier UniProtKB accession number.

Example: Q6ZSR9

Show all entries having a cross-reference to MaxQB.

Text format

Example: Q6ZSR9

DR   MaxQB; Q6ZSR9; -.

XML format

Example: Q6ZSR9

<dbReference type="MaxQB" id="Q6ZSR9"/>

Removal of the cross-references to ProtClustDB

Cross-references to ProtClustDB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Short rib-polydactyly syndrome 2B
  • Short rib-polydactyly syndrome 3

UniParc news

UniParc cross-references with multiple taxonomy identifiers

The UniParc XML format uses dbReference elements to represent cross-references to external database records that contain the same sequence as the UniParc record. Additional information about an external database record is provided with different types of property child elements, e.g. the species is represented with a property of the type "NCBI_taxonomy_id" that stores an NCBI taxonomy identifier in its value attribute. In the past, all external database records described a single species.

Example:

<dbReference type="REFSEQ" id="ZP_06545872" version_i="1" active="Y" version="1" created="2010-03-07" last="2013-07-18">
  <property type="NCBI_GI" value="289827083"/>
  <property type="NCBI_taxonomy_id" value="496064"/>
</dbReference>
<dbReference type="REFSEQ" id="ZP_18488583" version_i="1" active="Y" version="1" created="2012-11-25" last="2013-07-18">
  <property type="NCBI_GI" value="425085490"/>
  <property type="NCBI_taxonomy_id" value="1203546"/>
</dbReference>

With the introduction of WP-accessions in the NCBI Reference Sequence Project (RefSeq) database, UniParc needs to represent more than one species per dbReference element.

Example:

<dbReference type="REFSEQ" id="WP_001144069" version_i="1" active="Y" version="1" created="2013-07-19" last="2013-11-12">
  <property type="NCBI_GI" value="447066813"/>
  <property type="NCBI_taxonomy_id" value="496064"/>
  <property type="NCBI_taxonomy_id" value="1203546"/>
</dbReference>

This change did not affect the UniParc XSD, but may nevertheless require code changes.

UniProt release 2014_04

Published April 16, 2014

Headline

An old unwanted guest being shown the door

Poliomyelitis causes disabling paralysis, notably in children and adolescents. It is an old plague. An early case of poliomyelitis is shown on a 3,000-year-old Egyptian stele. The disease is caused by the poliovirus, an RNA virus that colonizes the gastro-intestinal tract without any symptoms. In rare cases, the virus enters the central nervous system, preferentially infecting and destroying motor neurons, leading to muscle weakness and acute flaccid paralysis.

In the late 1940s, John Enders showed that the virus could be grown in cells cultured in vitro. This observation provided the basis for the generation of poliovirus vaccines during the 1950s. Poliomyelitis is now virtually absent in economically developed countries, and the World Health Organization is currently using the vaccine in a far-reaching plan to eradicate the poliovirus worldwide.

Polioviruses are small-sized (30nm), non-enveloped icosahedral viruses composed of a capsid and an 8kb single-stranded RNA genome. Upon entry into a host cell, the poliovirus rearranges cytoplasmic membranes to create double membrane spherical vesicles in which the virus replicates, hidden from the antiviral detectors of the host cell. Once new viral particles are assembled, the host cell undergoes lysis, releasing poliovirus virions.

The poliovirus genome encodes a single polyprotein, which is processed by autocatalytic cleavage into 13 different products that ensure all viral functions from entry and replication to cell exit. The size constraint on the poliovirus genome is enormous, since it has to fit within a 30nm wide capsid. In this context, the polyprotein coding strategy is ideal as it allows the greatest economy of genome length versus protein end products.

In order to reduce redundancy in the knowledgebase, UniProtKB/Swiss-Prot describes all the protein products encoded by one gene in a given species in a single entry. Viral proteins are no exception to the rule. Hence, the poliovirus polyprotein is represented in a single UniProtKB/Swiss-Prot entry, which contains the description of 13 final and 4 intermediate chains.

As of this release, the Genome polyprotein entry of poliovirus type 1 (strain Mahoney) has been updated in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references for isoform sequences: Ensembl Genomes

We have added isoform-specific cross-references to the Ensembl Genomes sections EnsemblFungi, EnsemblMetazoa, EnsemblPlants and EnsemblProtists. The format of these cross-references is as described in release 2014_03.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

UniProt release 2014_03

Published March 19, 2014

Headline

Minority report

We are a minority in our own body. Over 90% of our cells are actually not human, but microbial. The majority of these microbes reside in the gut. The gut microbiota is typically dominated by bacteria, more specifically by Bacteroidetes and Firmicutes. The exact composition of gut microbiota varies between individuals and depends upon lifestyle, diet, hygienic preferences, use of antibiotics, etc. Gut microbes have a profound influence on human physiology and nutrition. Among others, they contribute to harvesting energy from food.

All guidelines for a healthy diet emphasize the necessity of eating fruit, vegetables and whole grains. These products are rich in dietary fibers, i.e. non-starch polysaccharides, most of which cannot be digested by the hydrolases encoded by our genome. Our inherent ability to digest carbohydrates is restricted to starch and simple saccharides, not xyloglucans (XyGs), a family of highly branched plant cell wall polysaccharides, which are abundant in plants. In view of the prevalence of XyGs in our diet, the mechanism of degradation of these complex polysaccharides by bacteria was expected to be important to human energy acquisition, but until recently it was still unclear. Very interesting work by Larsbrink et al., published in February, sheds light on XyG metabolism. The authors identified a polysaccharide utilization locus (PUL) in the genome of a common human gut symbiont, Bacteroides ovatus. PUL is transcriptionally upregulated in response to growth on galactoxyloglucan. It is predicted to encode 10 genes, including 8 glycoside hydrolases. All of them were subjected to in-depth molecular characterization through reverse genetics, in vitro protein biochemistry and enzymology. Finally, the 3D structure of the endo-xyloglucanase BoGH5A, which generates short XyG oligosaccharides, was solved. This study unraveled all the details of the enzymatic pathways by which the most common dietary polysaccharides are digested in our gut.

Although XyG utilization loci (XyGULs) have been identified in only a few other gut-resident Bacteroidetes, including B. cellulosyliticus, B. uniformis, B. fluxus, Dysgonomonas mossii and D. gadei, most human beings harbor at least one of these Bacteroides XyGULs in their gut, suggesting their importance in human nutrition.

The importance of the gut microbiome goes far beyond an active role in food digestion. It also acts on intestinal function, promoting gut-associated lymphoid tissue maturation, tissue regeneration, gut motility, and morphogenesis of the vascular system surrounding the gut. It additionally affects many other physiopathological aspects, such as the nervous system and bone homeostasis. Not surprisingly, changes in the microbiota composition or a complete lack of a gut microbiota has been shown to affect metabolism, tissue homeostasis and behavior.

As of this release, manually reviewed B. ovatus XyGUL gene products are available in UniProtKB/Swiss-Prot. Let's bet that they will be followed by many more proteins encoded by our other genome(s) in the near future.

UniProtKB news

Cross-references for isoform sequences

Some of the resources to which we link contain information that is specific to an isoform sequence and where this is known we now indicate the corresponding UniProtKB isoform sequence identifier in our cross-references as described below. The first resources for which we provide such isoform-specific cross-references are Ensembl and UCSC.

Text format

The UniProtKB isoform sequence identifier is shown in square brackets at the end of the DR line as an optional field:

DR   ResourceAbbreviation; ResourceIdentifier(; AdditionalField)+. [IsoId]

Examples:

DR   Ensembl; ENST00000281772; ENSP00000281772; ENSG00000144445. [A0AUZ9-1]
DR   Ensembl; ENST00000418791; ENSP00000405724; ENSG00000144445. [A0AUZ9-2]
DR   Ensembl; ENST00000452086; ENSP00000401408; ENSG00000144445. [A0AUZ9-3]
DR   Ensembl; ENST00000457374; ENSP00000393432; ENSG00000144445. [A0AUZ9-3]
DR   UCSC; uc002vds.3; human. [A0AUZ9-1]
DR   UCSC; uc002vdt.3; human. [A0AUZ9-2]
DR   UCSC; uc002vdx.1; human. [A0AUZ9-4]

XML format

To show the UniProtKB isoform sequence identifier in dbReference elements, we added an optional molecule element to the dbReferenceType. For consistency, we also changed the type of the molecule element that is found in the commentType. The XSD has been changed as highlited below:

<xs:complexType name="commentType">
    ...
                <xs:sequence>
                    <xs:annotation>
                        <xs:documentation>Used in 'subcellular location' annotations.</xs:documentation>
                    </xs:annotation>
                    <!-- <xs:element name="molecule" type="xs:string" minOccurs="0"/> -->
                    <xs:element name="molecule" type="moleculeType" minOccurs="0"/>
                    <xs:element name="subcellularLocation" type="subcellularLocationType" maxOccurs="unbounded"/>
                </xs:sequence>
    ...
    <xs:complexType name="dbReferenceType">
    ...
        <xs:sequence>
            <xs:element name="molecule" type="moleculeType" minOccurs="0"/>
            <xs:element name="property" type="propertyType" minOccurs="0" maxOccurs="unbounded"/>
        </xs:sequence>
        ...
    </xs:complexType>
    ...
    <xs:complexType name="moleculeType">
        <xs:annotation>
            <xs:documentation>Describes a molecule by name or unique identifier.</xs:documentation>
        </xs:annotation>
        <xs:simpleContent>
            <xs:extension base="xs:string">
                <xs:attribute name="id" type="xs:string" use="optional"/>
            </xs:extension>
        </xs:simpleContent>
    </xs:complexType>

Examples:

<dbReference type="Ensembl" id="ENST00000281772">
  <molecule id="A0AUZ9-1"/>
  <property type="protein sequence ID" value="ENSP00000281772"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="Ensembl" id="ENST00000418791">
  <molecule id="A0AUZ9-2"/>
  <property type="protein sequence ID" value="ENSP00000405724"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="Ensembl" id="ENST00000452086">
  <molecule id="A0AUZ9-3"/>
  <property type="protein sequence ID" value="ENSP00000401408"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="Ensembl" id="ENST00000457374">
  <molecule id="A0AUZ9-3"/>
  <property type="protein sequence ID" value="ENSP00000393432"/>
  <property type="gene ID" value="ENSG00000144445"/>
</dbReference>
<dbReference type="UCSC" id="uc002vds.3">
  <molecule id="A0AUZ9-1"/>
  <property type="organism name" value="human"/>
</dbReference>
<dbReference type="UCSC" id="uc002vdt.3">
  <molecule id="A0AUZ9-2"/>
  <property type="organism name" value="human"/>
</dbReference>
<dbReference type="UCSC" id="uc002vdx.1">
  <molecule id="A0AUZ9-4"/>
  <property type="organism name" value="human"/>
</dbReference>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • Isoaspartyl lysine isopeptide (Lys-Asp)

UniProt release 2014_02

Published February 19, 2014

Headline

Epigenetics in the spotlight

In its active form, folate, commonly known as vitamin B9, is a methyl carrier, essential for the biosynthesis of methionine and nucleic acids, most notably thymine, but also purine bases. Methionine synthesis involves first the activation of methionine synthase (MTR) by methionine synthase reductase (MTRR) and then the MTR-catalyzed conversion of homocysteine into methionine concomitant with conversion of 5-methyltetrahydrofolate into tetrahydrofolate. Methionine can be further modified into S-adenosyl methionine which serves as a methyl donor in the biosynthesis of cysteine, carnitine, taurine, lecithin, and phospholipids, among others.

Folate deficiency can result in many health problems, the most notable one being neural tube defects in developing embryos, but the molecular mechanism linking folate metabolism to development remains poorly understood. This is what prompted Padmanabhan et al. to create an animal model to study the impact of abnormal folate metabolism. These authors produced a mouse that contained a gene trap vector inserted in Mtrr gene intron 9. Wild-type Mtrr mRNA was still produced in spite of the insertion, but at lower levels, and folate metabolism was impaired.

When mid-gestation embryos from heterozygous intercrosses were analyzed, it appeared that about half of them displayed developmental defects typical of folate deficiency, ranging from developmental delay to neural tube and heart defects. Surprisingly, wild-type embryos were affected to a similar extent as embryos bearing the mutated gene. Inheritance of the phenotype was not dependent upon the parental genotype, but instead upon that of the maternal grandparents. In other words, Mtrr mutations in either maternal grandparent disrupted the development of their grandchildren, even when the parents and the conceptus were wild-type. These congenital abnormalities persisted in wild-type progeny in generations 4 and 5 of Mtrr mutant maternal ancestors.

What could be the mechanism of this peculiar mode of inheritance? The answer is not yet definite. Because folate plays a key role in one-carbon metabolism, the authors investigated DNA methylation. As expected, global DNA hypomethylation was observed in livers, uteri and placentas. Imprinted loci (differentially methylated regions or DMRs) in wild-type placentas of mid-gestation embryos from heterozygous maternal grandparents were also analyzed. A large proportion of the DMRs assessed in placentas of severely affected embryos had CpG site methylation levels that were statistically different from unrelated wild-type C57BL/6 mice. Surprisingly however, the majority of these sites were hypermethylated and the associated genes down-regulated. There was a positive correlation between epigenetic instability and the severity of the phenotype. Hence, epigenetic instability leading to the misexpression of certain genes may be the cause of developmental phenotypes.

Epigenetic heredity has been reported for Kit and Sox9 genes. In this case, heredity was mediated by RNA, a mechanism rather unlikely for the Mtrr mutations described above. The RNA-mediated heredity observed for Kit and Sox9 required the presence of the tRNA-methyltransferase TRDMT1/DNMT2. Hence, for both phenomena, it seems that the common feature may be methylation, either at the DNA or RNA level.

While awaiting further exciting discoveries in the field of epigenetics, we have already updated MTRR entries with the current knowledge and made them available.

UniProtKB news

Change of the cross-references to PROSITE and HAMAP

The format of the cross-references to the PROSITE and HAMAP databases has been simplified in order to align it with the format of other InterPro member databases.

Text format

Changes for PROSITE:

The optional qualifiers "UNKNOWN", "FALSE_NEG" and "PARTIAL" have been removed. Only matches above the threshold were kept, i.e. cross-references with a "FALSE_NEG" or "PARTIAL" qualifier have been removed.

Examples:

A1RHR2:

Previous format:

DR   PROSITE; PS51257; PROKAR_LIPOPROTEIN; UNKNOWN_1.
DR   PROSITE; PS00922; TRANSGLYCOSYLASE; FALSE_NEG.

New format:

DR   PROSITE; PS51257; PROKAR_LIPOPROTEIN; 1.

O02781:

Previous format:

DR   PROSITE; PS00237; G_PROTEIN_RECEP_F1_1; PARTIAL.
DR   PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1.

New format:

DR   PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1.

Changes for HAMAP:

The optional field that described the nature of signature hits ("atypical", "fused" or "atypical/fused") has been removed. Only matches above the threshold were kept, i.e. "atypical" and "atypical/fused" cross-references have been removed if their match score was below the threshold.

Example:

Q9K3D6:

Previous format:

DR   HAMAP; MF_00006; Arg_succ_lyase; 1; fused.
DR   HAMAP; MF_01105; N-acetyl_glu_synth; 1; atypical/fused.

New format:

DR   HAMAP; MF_00006; Arg_succ_lyase; 1.

XML format

Changes for PROSITE:

The optional values "UNKNOWN", "FALSE_NEG" and "PARTIAL" that were stored in a property of type match status have been removed, so that the match status value has become an integer. Only matches above the threshold were kept, i.e. "FALSE_NEG" and "PARTIAL" cross-references have been removed.

Examples:

A1RHR2:

Previous format:

<dbReference type="PROSITE" id="PS51257">
  <property type="entry name" value="PROKAR_LIPOPROTEIN"/>
  <property type="match status" value="UNKNOWN_1"/>
</dbReference>
<dbReference type="PROSITE" id="PS00922">
  <property type="entry name" value="TRANSGLYCOSYLASE"/>
  <property type="match status" value="FALSE_NEG"/>
</dbReference>

New format:

<dbReference type="PROSITE" id="PS51257">
  <property type="entry name" value="PROKAR_LIPOPROTEIN"/>
  <property type="match status" value="1"/>
</dbReference>

O02781:

Previous format:

<dbReference type="PROSITE" id="PS00237">
  <property type="entry name" value="G_PROTEIN_RECEP_F1_1"/>
  <property type="match status" value="PARTIAL"/>
</dbReference>
<dbReference type="PROSITE" id="PS50262">
  <property type="entry name" value="G_PROTEIN_RECEP_F1_2"/>
  <property type="match status" value="1"/>
</dbReference>

New format:

<dbReference type="PROSITE" id="PS50262">
  <property type="entry name" value="G_PROTEIN_RECEP_F1_2"/>
  <property type="match status" value="1"/>
</dbReference>

Changes for HAMAP:

The optional property of type flag that described the nature of signature hits ("atypical", "fused" or "atypical/fused") has been removed. Only matches above the threshold were kept, i.e. "atypical" and "atypical/fused" cross-references have been removed if their match score was below the threshold.

Example:

Q9K3D6:

Previous format:

<dbReference type="HAMAP" id="MF_00006">
  <property type="entry name" value="Arg_succ_lyase"/>
  <property type="flag" value="fused"/>
  <property type="match status" value="1"/>
</dbReference>
<dbReference type="HAMAP" id="MF_01105">
  <property type="entry name" value="N-acetyl_glu_synth"/>
  <property type="flag" value="atypical/fused"/>
  <property type="match status" value="1"/>
</dbReference>

New format:

<dbReference type="HAMAP" id="MF_00006">
  <property type="entry name" value="Arg_succ_lyase"/>
  <property type="match status" value="1"/>
</dbReference>

These changes did not affect the XSD, but may nevertheless require code changes.

Cross-references to TreeFam

Cross-references have been added to TreeFam, a database composed of phylogenetic trees inferred from animal genomes.

TreeFam is available at http://www.treefam.org.

The format of the explicit links is:

Resource abbreviation TreeFam
Resource identifier TreeFam unique identifier.

Example: Q8CFE6

Show all entries having a cross-reference to TreeFam.

Text format

Example: Q8CFE6

DR   TreeFam; TF328787; -.

XML format

Example: Q8CFE6

<dbReference type="TreeFam" id="TF328787"/>

Cross-references to BioGrid

Cross-references have been added to BioGrid, a public database that archives and disseminates genetic and protein interaction data from model organisms and humans.

BioGrid is available at http://thebiogrid.org.

The format of the explicit links is:

Resource abbreviation BioGrid
Resource identifier BioGrid unique identifier.
Optional information 1 Number of interactions.

Example: O46201

Show all entries having a cross-reference to BioGrid.

Text format

Example: O46201

DR   BioGrid; 69392; 1.

XML format

Example: O46201

<dbReference type="BioGrid" id="69392">
  <property type="interactions" value="1"/>
</dbReference>

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • N-methylglycine
  • N,N-dimethylglycine
  • N,N,N-trimethylglycine

Deleted term:

  • 3-hydroxyhistidine

UniRef news

Revision of the UniParc records used in the UniRef databases

We have stopped importing UniParc records that correspond to Ensembl proteomes sequences in the UniRef databases, as the relevant sequences are now part of UniProtKB. Previously, some sequences from Ensembl proteomes (e.g. from Human, Chicken, Cow) were missing from UniProtKB, but we have recently completed their import into UniProtKB (see FAQ) and thus no longer need to import them via UniParc. The UniRef databases will continue to include UniParc records from the RefSeq and PDB databases that are not in UniProtKB to ensure a complete sequence space coverage.

UniProt release 2014_01

Published January 22, 2014

Headline

Mouse attacks!

In the arid lands of Arizona lives a fierce predator whose howls pierce the desert night, terrifying its prey. This predator is... a mouse, Onychomys torridus, also called the grasshopper mouse. It may sound like a tale looming straight from the imagination of Tim Burton or Monthy Python, but this mouse really exists. It is carnivorous and it regularly howls just before a kill, although the emitted sound is more a sustained whistle than the actual howl of a wolf. Its prey is no less astonishing, including crickets, other rodents, tarantulas and bark scorpions (Centruroides sculpturatus).

Bark scorpions are not easy prey. They are venomous and inflict intensely painful, sometimes lethal stings. Surprisingly grasshopper mice do not seem to be seriously bothered by that, and it takes little time before the scorpion is captured, killed and eaten. How can O. torridus ignore the venom, while common house mice are sensitive to it? Overall, grasshopper mice do feel pain normally, but when they are injected with scorpion venom or with a physiological saline solution in their hind paws, they are much more irritated by the control saline solution than by the venom. In grasshopper mice, bark scorpion venom acts as an analgesic.

Venom from Buthidae scorpions initiates acute pain in sensitive mammals, such as house mice, rats and humans, by activating the voltage-gated sodium channel Nav1.7/SCN9A, but has no effect on the Nav1.8/SCN10A sodium channel. Recent experiments by Rowe et al. on freshly isolated O. torridus sensory neurons showed that, in this species, the venom strongly inhibits Nav1.8/SCN10A Na+ currents. These Na+ currents are necessary for action potential sustained firing and propagation. By inhibiting Nav1.8/SCN10A, the scorpion venom blocks pain transmission to the central nervous system, and hence induces analgesia. The diametrically opposed response of rodents towards scorpion venom seems to be due to only 2 residues within the Nav1.8/SCN10A sequence. In O. torridus, a glutamate residue is found at position 859 (E-859) and a glutamine residue at position 862 (Q-862), while in species known to be sensitive to the venom, these positions are reversed: Q-859 and E-862. Site-directed mutagenesis of these 2 residues in the O. torridus sequence (Q859E/E862Q) abolished venom sensitivity. Conversely, mutation of the glutamine position in Mus musculus (Q861E) conferred inhibition by C. sculpturatus venom.

Pain sensitivity is essential for survival, since it helps avoid damaging situations. Hence any change in pain perception has to be finely tuned in order not to be deleterious. O. torridus has evolved a brilliant strategy allowing it to exploit an abundant food resource in its environment, i.e. bark scorpions, while keeping intact its ability to feel the necessary pain.

Persistent pain can turn into a nightmare and improving our understanding of pain signaling may be a tremendous help in the discovery of new analgesic drugs. Nav1.7/SCN9A is already under close investigation as a potential target for pain prevention. The new and very exciting study by Rowe et al. shows that the Nav1.8/SCN10A channel also plays a crucial key role in the transmission of pain signals and may be an interesting target for analgesic development.

As of this release, the fully annotated O. torridus Nav1.8/SCN10A protein is available in UniProtKB/Swiss-Prot.

UniProtKB news

Removal of the cross-references to IPI

Cross-references to IPI have been removed.

IPI has closed in 2011. The last release is archived at ftp://ftp.ebi.ac.uk/pub/databases/IPI.

The Ensembl and Ensembl Genomes projects offer access to genomic data from vertebrate and non-vertebrate species respectively.

Complete proteome data is available from UniProtKB.

An archive of the last mapping table between UniProtKB and IPI is archived at ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2014_01/.

Documents and RSS feeds for UniProt Forthcoming changes and News

We have replaced the documents sp_soon.htm ("UniProt Knowledgebase - Forthcoming changes") and xml_soon.htm ("UniProt Knowledgebase - Forthcoming changes in XML") by a searchable section Forthcoming changes on our website to announce planned changes for all UniProt data sets and file formats in one place and to provide a common RSS feed. The same information can also be downloaded from our FTP site.

Changes that have been implemented are described in our "News archive", which can be searched in the News section of our website, followed via an RSS feed and downloaded from the FTP site. These news include the historical contents of sp_news.htm ("What's new?"), but not that of xml_news.htm ("What's new in XML?"). The latter file was renamed to xml_news_prior_2014_01.html to archive the XML changes that were implemented before 2014. This file will no longer be updated.

We have generated symbolic links on the FTP site for the files that have been replaced to give everyone time to update their FTP download procedures to the new files' locations:

New version of DASty

Our DAS web client DASty has been redesigned. DASty provides a visual representation of the compilation of protein annotations from different third-party sources. This allows users to get a global overview of all protein annotation available for their protein of interest, from UniProt as well as other sources. The "Third-party data" link that is available on each UniProtKB entry now leads to this new version of DASty. Any bookmarks should be updated accordingly. For instance, the "Third-party data" link for UniProt accession P05067 now links to http://www.ebi.ac.uk/dasty/client/index.html?q=P05067

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to the controlled vocabulary for PTMs

Modified terms for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • 1-(tryptophan-3-yl)-tryptophan (Trp-Trp) (interchain) -> 1-(tryptophan-3-yl)-tryptophan (Trp-Trp) (interchain with W-...)
  • 5'-tyrosyl-5'-aminotyrosine (Tyr-Tyr) (interchain) -> 5'-tyrosyl-5'-aminotyrosine (Tyr-Tyr) (interchain with Y-...)
  • Glycyl threonine ester (Gly-Thr) (interchain with G-...) -> Glycyl threonine ester (Gly-Thr) (interchain with T-...)

Changes to keywords

New keywords:

Modified keywords:

Deleted keyword:

  • Phage maturation

UniProt release 2013_12

Published December 11, 2013

Headline

The aflatoxin biosynthetic pathway annotated in UniProtKB/Swiss-Prot

Aflatoxins are very important members of the family of mycotoxins, that contaminate food and feed crops. More than 14 different aflatoxins have been identified so far. These secondary metabolites are mainly produced by the filamentous fungi Aspergillus flavus and Aspergillus parasiticus. These organisms grow in warm and humid locations, such as those where crops (e.g. rice, maize and ground nuts) are stored.

Intake of aflatoxins has both acute and long term effects. Acute aflatoxin poisoning leads to effects such as hemorrhagic necrosis of the liver, bile duct proliferation, edema and lethargy. In addition, aflatoxins have immunosuppressive effects and interfere with nutrient uptake leading to malnutrition (kwashiorkor). The most toxic of the aflatoxins, aflatoxin B1, is the most potent naturally occurring carcinogen known. The carcinogenic effect of aflatoxins is mediated by 2 cytochromes P-450 enzymes, CYP1A2 and CYP3A4. CYP1A2 and CYP3A4 turn the aflatoxins into much more reactive epoxides that react with DNA bases and induce mutations, leading, in the long term, to liver cancer. Overall it is estimated that aflatoxins negatively impact up to 5 billion people who live in warm and humid climates. The presence of dietary aflatoxin is strongly associated with incidences of liver and lung cancers, HIV/AIDS, malaria, growth stunting and childhood malnutrition, and increased risk of adverse birth outcomes in Asia, Africa, and Central America.

To increase the ability to eliminate or reduce aflatoxin contamination, the mycotoxin biosynthetic pathway has been comprehensively studied. The pathway is composed of over 25 enzymatic steps, each step catalyzed by a different enzyme. 13 of these enzymes have been biochemically characterized in sufficient depth to allow the recent attribution of enzyme classification (EC) numbers.

EC numbers are part of a classification system managed by the International Union for Biochemistry and Molecular Biology (IUBMB). They are composed of 4 digits, which represent both the name of the enzyme and the precise description of the chemical reaction it catalyzes. In UniProtKB, enzymes are annotated with EC numbers (in 'Names and origin', 'Protein names', 'Recommended name', see for instance pksL1 entry), when these are available.

As of this release, the enzymes involved in aflatoxin biosynthesis have been manually annotated and are publicly available in UniProtKB/Swiss-Prot. The newly characterized enzymes from this pathway belong to oxidoreductase, transferase, hydrolase, and lyase classes of the EC classification system.

UniProtKB news

New human 1000 Genomes Project variants file

UniProt would like to announce the release of a new extension to the humsavar.txt variant catalogue. This new variant file, homo_sapiens_variation.txt.gz, supplements the set of manually curated human variants in humsavar.txt with a catalogue of novel Single Nucleotide Variants (SNVs or SNPs) from the 1000 Genomes Project for both UniProtKB/Swiss-Prot and UniProtKB/TrEMBL sequences. These variants have been automatically mapped to UniProtKB sequences, including isoform sequences, through Ensembl. In addition to defining the position and amino acid change due to each variant, the new file maps each affected UniProtKB record to the corresponding Ensembl gene, transcript and protein identifiers, provides the chromosomal location with allele change and, where possible, a cross-reference to OMIM is provided for the variant. This file along with the humsavar.txt file can now be found in the new dedicated 'variants' directory in the UniProt FTP site. We very much welcome the feedback of the community on our efforts. In future UniProt releases, we expect to add additional data sources for human variants that will include somatic variants, new data fields providing additional details concerning the variant and variants from additional species.

Cross-references to GuidetoPHARMACOLOGY

Cross-references have been added to GuidetoPHARMACOLOGY, which provides an expert-driven guide to pharmacological targets and the substances that act on them.

GuidetoPHARMACOLOGY is available at http://www.guidetopharmacology.org/

The format of the explicit links in the flat file is:

Resource abbreviation GuidetoPHARMACOLOGY
Resource identifier GuidetoPHARMACOLOGY identifier
Example Q08460:
DR   GuidetoPHARMACOLOGY; 380; -.

Show all the entries having a cross-reference to GuidetoPHARMACOLOGY.

New cross-reference category: Chemistry

A new database category has been added: Chemistry.

Change of the category of the cross-references BindingDB, ChEMBL and DrugBank

The BindingDB, ChEMBL and DrugBank databases have been moved from the category "Other" to the category "Chemistry".

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Changes to keywords

New keywords:

Modified keyword:

Deleted keyword:

  • Inhibition of host TBK1-IKBKE-DDX3 complex by virus

UniProt release 2013_11

Published November 13, 2013

Headline

Forever young and cancer-free... in a black hole

In east African grasslands and savannas lives a most bizarre rodent: the naked mole-rat (Heterocephalus glaber). Naked mole-rats are small burrowing rodents, about the size of a mouse. They inhabit underground tunnels, where they form colonies ranging in size from 20 to 300 individuals. Naked mole-rats exhibit eusociality, a lifestyle reminiscent of that of ants or some bees. The colony is ruled by a queen; it has 1 to 3 males who breed only with the queen, while the other female members of the colony are sterile workers or soldiers. But this is not the only singularity of this amazing mammal. Among many other unexpected features, naked mole rats exhibit exceptional longevity, some reaching ages of 30 years, about 10 times longer than ordinary mice (in a protected environment). They show negligible senescence, no age-related increase in mortality, and high fecundity until death. In addition, they are highly resistant to cancer.

In 2009, it was reported that naked mole rats may resist cancer thanks to an extremely efficient mechanism of cell contact inhibition, called early contact inhibition (ECI). Contact inhibition is a process that arrests cell growth when cells come in contact with each other or the extracellular matrix. It is a powerful anticancer mechanism. The process of ECI causes naked mole-rat cells to arrest at a much lower density than mouse cells, and the loss of ECI makes naked mole-rat cells more susceptible to malignant transformation.

When culturing naked mole-rat fibroblasts, Tian et al. observed that the culture media became very viscous after a few days, much more than the media conditioned by human, guinea-pig or mouse cells. This increase in viscosity was due to the increased production of an anionic, nonsulfated glycosaminoglycan: high-molecular-mass hyaluronan (HMM-HA). HMM-HA overproduction was not restricted to tissue culture conditions. It was also observed in vivo, including in brain, heart, kidney and skin. Increased HMM-HA production was due to robust synthesis, via the up-regulation of hyaluronan synthase 2 (Has2), the enzyme catalyzing HMM-HA production, combined with slower degradation, due to the down-regulation of HA-degrading enzyme.

Secreted HMM-HA binds to fibroblasts through the Cd44 cell surface receptor and triggers intracellular signaling, leading to the expression of the cyclin-dependent kinase inhibitor Cdkn2a/p16-INK4a and to the induction of ECI. In naked mole-rat cells, this signaling is further optimized, since these cells exhibit a 2-fold higher affinity for HA as compared to mouse or human cells.

HA is widely distributed and one of the main components of the extracellular matrix. The authors hypothesized that the increased HMM-HA production in the naked mole-rat could have evolved as an adaptation to a subterranean lifestyle to provide flexible skin needed to squeeze through underground tunnels. This adaptation to harsh living conditions would turn out to have additional benefits, such as contributing to cancer resistance.

As of this release, naked mole-rat Has2 has been manually annotated and is publicly available in UniProtKB/Swiss-Prot entry G5AY81.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Epileptic encephalopathy, Lennox-Gastaut type
  • Knobloch syndrome 2

Changes to the controlled vocabulary for PTMs

New term for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • N-acetylated lysine

Modified terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • 5-glutamyl N2-arginine -> 5-glutamyl N2-ornithine
  • 5-glutamyl N2-glutamate -> 5-glutamyl glutamate

Changes to keywords

New keyword:

UniProt release 2013_10

Published October 16, 2013

Headline

When the cat's away...

For all creatures, early detection of predators is a matter of survival. Olfaction often plays a crucial role in this regard. Odorant molecules activate specific receptors on sensory neurons. The axons from neurons expressing the same olfactory receptor come together at the same glomeruli, near the surface of the olfactory bulb of the brain. It is generally thought that odorants can be recognized by different receptors and that each glomerulus makes only a small contribution to the global representation of a given odor. However, recent discoveries suggest that the olfactory system may not be as redundant as previously thought.

Mice exhibit innate aversion to volatile amines, such as beta-phenylethylamine (PEA) and isopentylamine (IPA) that are excreted in cat urine. Trace amines robustly activate trace-amine associated receptors (TAARs). There are 15 TAAR genes in mouse. Targeted concomitant deletion of 14 of them (TAAR2 through 9) show no apparent phenotype. Homozygous mutant mice are healthy and breed normally. The only difference with wild-type and heterozygous littermates is that their aversion to PEA and to cat urine is abolished. This effect is specific, since their response to compounds produced by red fox remains unchanged. Among TAAR genes, TAAR4 is of particular interest, since it is exquisitely sensitive to PEA, with apparent affinities rivaling those seen with mammalian pheromone receptors. Amazingly, knockout of this single gene produces a loss of aversion to PEA and to puma or lynx urine, although homozygous mutant animals still avoid other odorants, such as IPA, exactly as their wild-type and heterozygous littermates do. To our knowledge, this is the first report of an individual main olfactory receptor contributing substantially to odor perception.

This type of exciting discovery reported in the literature triggers yet another innate reaction, that of Swiss-Prot curators to update UniProtKB. The revised mouse TAAR4 entry is now publicly available.

UniProtKB news

Cross-references to PRO

Cross-references have been added to PRO (Protein Ontology), which provides an ontological representation of protein-related entities by explicitly defining them and showing the relationships between them.

PRO is available at http://pir.georgetown.edu/pro/pro.shtml

The format of the explicit links in the flat file is:

Resource abbreviation PRO
Resource identifier PRO identifier
Example O42634:
DR   PRO; PR:O42634; -.

Show all the entries having a cross-reference to PRO.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Microphthalmia, isolated, with cataract, 4

Changes to the controlled vocabulary for PTMs

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • Methionine (R)-sulfoxide
  • Methionine (S)-sulfoxide

Changes in subcellular location controlled vocabulary

New subcellular location:

UniProt release 2013_09

Published September 18, 2013

Headline

With a little help from my... Lassa virus

Dystroglycan provides a physical link between components of the extracellular matrix, including laminin, and the intracellular actin cytoskeleton. This link is crucial for a number of cellular processes, including laminin and basement membrane assembly, sarcolemmal stability, cell survival, peripheral nerve myelination, cell migration and epithelial polarization.

The dystroglycan protein is extensively glycosylated at multiple sites, and an unusual O-linked glycan is required for proper interaction with extracellular matrix ligands including laminin. Glycosyltransferases responsible for this modification were first identified using classical biochemical techniques, and mutations in the associated genes were identified in patients presenting with one of a number of dystroglycanopathies. These are a heterogeneous group of disorders characterized by muscular dystrophy that can be associated with brain anomalies, mental retardation, eye malformations, and other clinical symptoms. However until recently some 50% of newly diagnosed cases of dystroglycanopathy showed no significant association with variants in known glycosyltransferase genes.

To address this issue, Jae et al., 2013 developed a powerful approach to dystroglycanopathy candidate gene identification that exploits another, less beneficial property of dystroglycan. The hemorrhagic Lassa virus binds to glycosylated dystroglycan during infection, the efficiency of which depends on the glycosylation level. By using gene-trap insertion mutagenesis the authors were able to identify genes whose inactivation conferred resistance to Lassa virus infection, which by extension may include regulators of the level of dystroglycan glycosylation. These genes included all those previously known to be associated with a dystroglycanopathy, as well as several novel candidates. Exon sequencing of a panel of patients with severe dystroglycanopathy identified variants in two of them, POMK/SGK196 and TMEM5, while confirming the absence of variants in known dystroglycanopathy genes. The other candidates await further characterization.

We may be about to witness the elucidation of the underlying genetic causes of a range of dystroglycanopathies, disorders associated with defective dystroglycan modification, through the use of a deadly virus that normally targets the affected protein.

As of this release, all proteins involved in dystroglycanopathies can be retrieved from UniProtKB/Swiss-Prot with the keyword Dystroglycanopathy.

UniProtKB news

Removal of the cross-reference to Pathway_Interaction_DB

Cross-references to Pathway_Interaction_DB have been removed.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases:

  • Cataract, pulverulent, juvenile-onset, MAF-related
  • 2-aminoadipic 2-oxoadipic aciduria

Changes to keywords

New keywords:

Modified keywords:

UniProt release 2013_08

Published July 24, 2013

Headline

Girls just want to have … IFNE

Interferons (IFNs) are proteins made and released in answer to the presence of pathogens, such as viruses or bacteria, that trigger the protective defenses of the immune system. In other words, they “interfere” with infections, hence their name. Within the large IFN family, type I IFNs are clustered on a defined locus on chromosome 9p21 in humans and in a region of conserved synteny on chromosome 4 in mice. Their expression is induced by the activation of signaling pathways downstream of pattern-recognition receptors and they all bind to the IFN-alpha cell surface receptor complex consisting of IFNAR1 and IFNAR2 chains, leading to the expression of a whole set of genes.

There is, however, an alien on the type I IFN locus: IFN-epsilon (IFNE). IFNE shares less than 40% amino acid identity with bona fide type I IFNs, such as IFN-alpha or IFN-beta, but it does still bind to IFNAR, as expected for a type I IFN. However, unlike any of the other family members, it is not induced by the activation of any known pattern-recognition, including Toll-like receptor pathways. In addition, while other type I IFNs are mainly produced by haemopoietic cells, IFNE is constitutively expressed by epithelial cells of the female reproductive tract in humans and mice. At first glance, these observations seem to challenge a potential protective function for IFNE.

In a recent publication, Fung et al. reported that IFNE expression varied approximately 30-fold at different stages of the estrous cycle in the mouse uterus, with the highest levels at estrus (when estrogen levels are high) and was reduced during pregnancy (when progesterone levels are high). Similarly, in the human endometrium, IFNE levels were highest in the proliferative phase of the menstrual cycle and lowest in postmenopausal women (when estrogen levels are low). The suspected hormonal regulation could then be confirmed in mice and in humans: IFNE is induced by estrogens and reduced by progesterone. What about IFNE function? Fung et al. demonstrated that IFNE regulates IFN-regulated genes, including IRF7 and ISG15, as well as 2’5’oligoadenylate synthetase. What is more, Ifne-/- female mice, whose vaginas were infected with Chlamydia muridarum or herpes simplex virus 2, had more severe clinical disease than wild-type mice, as well as higher levels of virus or bacteria at defined time points after infection. Hence IFNE seems to play an important – though local – protective role against sexually transmitted infections.

These very interesting observations may have pinpointed the cause of susceptibility to infections of the reproductive tract in women on progesterone-containing contraception, i.e. a progesterone-induced decrease in IFNE expression.

In UniProtKB/Swiss-Prot, IFNE entries have been updated accordingly.

UniProtKB news

Cross-references to GeneWiki

Cross-references have been added to GeneWiki, an initiative that aims to create seed articles for every notable human gene.

GeneWiki is available at http://en.wikipedia.org/wiki/Gene_Wiki

The format of the explicit links in the flat file is:

Resource abbreviation GeneWiki
Resource identifier GeneWiki identifier
Example Q96N67:
DR   GeneWiki; Dock7; -.

Show all the entries having a cross-reference to GeneWiki.

Change of the cross-reference GlycoSuiteDB to UniCarbKB

GlycoSuiteDB, an annotated and curated relational database of glycan structures, has been integrated into UniCarbKB, with a new user interface and added functionalities.

We therefore changed the corresponding resource abbreviation from GlycoSuiteDB to UniCarbKB.

Example: P02763:

Previous flat file format:
DR   GlycoSuiteDB; P02763; -.
New flat file format:
DR   UniCarbKB; P02763; -.

UniProtKB/Swiss-Prot is currently linked to this resource from the cross-reference section (DR lines), but we also have some site-specific links from the sequence annotation section (FT CARBOHYD) of relevant UniProtKB/Swiss-Prot entries. An increase of the number of cross-linked entries is planned, including more literature based glycan data from UniCarbKB.

Removal of the cross-reference to GermOnline

Cross-references to GermOnline have been removed.

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases: Deleted diseases:
  • Cataract, congenital, cerulean type, 3
  • Cataract, congenital, non-nuclear polymorphic, autosomal dominant
  • Cataract, cortical, age-related, 2
  • Cataract-microcornea syndrome
  • Cataract, sutural, with punctate and cerulean opacities
  • Cataract, zonular
  • Hereditary non-polyposis colorectal cancer 3
  • Leukotriene C4 synthase deficiency
  • Neuropathy, congenital amyelinating
  • Pallido-ponto-nigral degeneration
  • Platyspondylic lethal skeletal dysplasia Sand Diego type
  • Thromboxane synthetase deficiency
  • Weaver syndrome 2

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 5-glutamyl N2-arginine
  • 5-glutamyl N2-glutamate

Changes in subcellular location controlled vocabulary

New subcellular location:

UniProt release 2013_07

Published June 26, 2013

Headline

How to go green, or red?

Chlorophyll is the major photosynthetic pigment. It performs the essential processes of harvesting light energy in the antenna complexes and transferring this energy to the reaction centers to produce chemical energy.

The chlorophyll molecule is present in all photosynthetic organisms. It is made up of 2 moieties of distinct origin, chlorophyllide and phytol. The early enzymatic steps of chlorophyllide biosynthesis from glutamyl-tRNA to protoporphyrin IX are shared with the heme biosynthesis pathway. Hence, protoporphyrin IX is the last common reactant for the synthesis of both heme and chlorophyll. To produce chlorophyll, a magnesium chelatase (EC=6.6.1.1) inserts Mg(2+) into the protoporphyrin IX ring, while an iron chelatase (EC=4.99.1.1) inserts Fe(2+) into the ring during heme biosynthesis.

In Arabidopsis thaliana, there are 15 enzymes and 27 genes required for chlorophyll biosynthesis from glutamyl-tRNA to chlorophyll b. Nine proteins are encoded by single-copy genes, and the others are encoded by gene families consisting of two to three members. The magnesium chelatase is a complex of three subunits, CHLI, CHLD and CHLH encoded by 4 different genes. As of this release, all 27 proteins are manually annotated in UniProtKB/Swiss-Prot. They all contain the subtopic PATHWAY: Porphyrin-containing compound metabolism; chlorophyll biosynthesis in ‘General annotation (Comments)’ and the keyword Chlorophyll biosynthesis. This keyword also allows the retrieval of additional proteins involved in the regulation of the process or in the biosynthesis of the long phytol side chain, for example.

Enzymes involved in the biosynthesis of the porphyrins, common to both heme and chlorophyll, are also annotated with the comment PATHWAY: Porphyrin-containing compound metabolism; protoporphyrin-IX biosynthesis.

UniProtKB news

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases:

UniProt release 2013_06

Published May 29, 2013

Headline

Back to the wild

Nearly half of our genome consists of mobile elements and their recognizable remnants. These elements are thought to have shaped both our genes and our entire genome, driving genome evolution. However, mobile elements can undergo ‘molecular domestication’, whereby the transposon genes are incorporated into cellular gene expression programs, but are no longer mobile. They can also evolve cellular DNA recombination functions, such as the V(D)J antigen receptor-recombination system. The human genome contains some 50 genes that were derived from transposable elements or transposons, and many are now integral components of cellular gene expression programs.

Human THAP9 is one such transposon-derived gene. It is homologous to Drosophila P element DNA transposase. Both human and Drosophila proteins show a typical site-specific DNA-binding Zn finger domain. Human THAP9 is a single-copy gene and does not contain any terminal inverted repeats or target-site duplications, indicating that it constitutes a bona fide domesticated stationary sequence. It thus came as a surprise that this gene has nevertheless retained the catalytic activity to mobilize P transposable elements in Drosophila and human cells. The physiological relevance of this observation remains elusive, but what is clear is that domesticated transposons may have retained enough “wild” properties to keep our genome on the move.

The human THAP9 entry has been updated accordingly in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to SignaLink

Cross-references have been added to SignaLink, an integrated resource to analyze signaling pathway proteins, cross-talks, transcription factors, miRNAs and regulatory enzymes.

SignaLink is available at http://signalink.org/

The format of the explicit links in the flat file is:

Resource abbreviation SignaLink
Resource identifier UniProtKB accession number
Example Q24306:
DR   SignaLink; Q24306; -.

Show all the entries having a cross-reference to SignaLink.

Removal of the cross-reference to HSSP

Cross-references to HSSP have been removed.

Changes to the controlled vocabulary of human diseases

New diseases: Modified diseases: Deleted disease:
  • Ichthyosis, lamellar, 1

UniProt release 2013_05

Published May 1, 2013

Headline

Human genetic diseases in UniProtKB/Swiss-Prot

During the past decade, next-generation sequencing (NGS) technologies have accelerated the detection of genetic variants resulting in the rapid discovery of new disease-associated genes. More than 100 causative genes in various Mendelian disorders have been identified by means of whole exome sequencing. However, the wealth of variation data made available by NGS is not sufficient, alone, to understand the mechanisms underlying disease pathogenesis and manifestation. Diseases are the consequences of series of events that include not only primary mutations in disease-causing genes, but also variations in disease-modifying genes, as well as the combined effects of gene-gene and gene-environment interactions. That is why new approaches to unravel disease mechanisms are based on biological network analysis.

In addition to providing a large amount of information on protein functions, interactions and biological pathways, UniProt pays particular attention to the annotation of human genetic diseases and disease-linked variants. Information on genetic diseases is shown in the ‘Involvement in disease’ subsection of the ‘General Annotation (Comments)’ section. In the current release, over 4,600 phenotypes are described in close to 3,000 human entries. The great majority of UniProtKB disease descriptions have links to the Online Mendelian Inheritance in Man knowledgebase (OMIM), allowing users to retrieve more detailed information.

In order to improve the clarity of medical annotation and to facilitate the retrieval of disease information from UniProtKB, we have modified the format of the subsection ‘Involvement in disease’. The newly modified subsection is organized in 2 parts. Firstly, the disease name, acronym and features are defined using a controlled vocabulary. Secondly, the role of the gene/protein in the disease is described in a ‘Note:’, that allows discrimination between disease-causing, disease-modifying and susceptibility genes. This note, partly written in free text, provides information on the biological context or other interesting information that may not be directly related to the phenotype description, such as the involvement of different proteins in the pathological mechanism. For example, multiple sulfatase deficiency (MSD) is due to the simultaneous decrease of activity of all sulfatases. However, the primary cause is a mutation in SUMF1, an enzyme required for post-translational modification and catalytic activation of these enzymes. This additional information is stored in the ‘Involvement in disease’ note.

Genetic diseases annotated in UniProtKB/Swiss-Prot are indexed in the humdisease.txt file, available for our users as of this release. Each record in this file consists of a disease identifier, acronym, and description, as well as known disease synonyms, links to OMIM, Medical Subject Headings (MeSH) and associated UniProtKB keywords.

UniProtKB news

Complete proteomes for Ensembl species

For UniProt release 2013_05, one new species from Ensembl vertebrates and 3 new Ensembl Genomes have been made available. These are:

Felis catus (Cat)
Brassica rapa subsp. pekinensis (Chinese cabbage)
Hyaloperonospora arabidopsidis (Downy mildew agent)
Magnaporthe poae (Kentucky bluegrass fungus)

In addition to the new imports, existing proteomes derived from Ensembl species have been updated with data from Ensembl release 70.
All predicted protein sequences from an Ensembl Genome are mapped to their UniProtKB counterparts under stringent conditions: 100% identity over 100% of the length of the two sequences is required. Any sequence found to be absent from UniProtKB is imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL. All UniProtKB entries that map to an Ensembl Genome are used to build the proteome; they are tagged with the keyword Complete proteome and an Ensembl Genome cross-reference is added.
We very much welcome the feedback of the community on our efforts. In future UniProt releases, we expect to make proteomes for the remaining Ensembl and Ensembl Genomes species currently absent from UniProtKB.

Removal of the cross-reference to GenomeReviews

Cross-references to GenomeReviews have been removed.

Changes to keywords

New keywords: Modified keyword:

UniProt release 2013_04

Published April 3, 2013

Headline

Major progress in adenovirus annotation

Adenoviruses were first isolated by Wallace Rowe in 1953 from adenoid tissue of sick children. These viruses infect a wide range of vertebrates, including humans. Infectious virions are spread primarily via respiratory droplets, however they can also be spread by fecal routes. Most infections with Human Adenovirus (HAdV) result in upper respiratory tract diseases; they account for about 10% of acute respiratory infections in children. They can also cause fever, diarrhea, pink eye (conjunctivitis), bladder infection (cystitis), rash illness, etc.

HAdV are medium-sized (90-100 nm), non-enveloped icosahedral viruses composed of a capsid and a double-stranded linear DNA genome. The viral genome is approximately 36kb long. It encodes 37 proteins which are produced by complex alternative splicing of 6 mRNA transcription units. The viral genome replicates in the host cell nucleus, but never integrates into the host genome. This is the reason why adenoviruses are widely used in gene therapy and anticancer virus vector trials.

The JCVI adenovirus project recently resulted in the sequencing of 150 new HAdV genomes. In order to support the annotation of these new genomes, the community needs a high quality set of data that can serve as a reference. In this context, a collaboration including UniProt, NCBI, JCVI and several field experts has been initiated to update reference adenovirus genomes and proteomes. Gene predictions have been corrected with the most recent proteomic and cDNA sequencing data. This major collaborative effort has resulted in a consistent and up-to-date annotation of the viral genome in NCBI RefSeq and of the HAdV reference proteome in UniProtKB/Swiss-Prot.

UniProtKB news

Removal of MEDLINE identifiers

We have removed the MEDLINE identifiers from the bibliographic database cross-references of literature citations since they have been superceded by PubMed identifiers. The valid bibliographic database names and their associated identifiers are now:

Name Identifier
PubMed PubMed Unique Identifier (PMID)
DOI Digital Object Identifier (DOI)
AGRICOLA AGRICOLA Unique Identifier

UniProt release 2013_03

Published March 6, 2013

Headline

Latest from the prokaryotic world: bacterial Cas9, a new tool for genome engineering

The CRISPR system (Clustered Regularly Interspaced Short Palindromic Repeat) is a bacterial and archaeal, RNA-based adaptive immune system, which degrades invading genetic material. Very briefly, invading viruses or plasmids are recognized by their complementarity to CRISPR RNA (crRNA) and degraded by dedicated nucleases.

There are 3 major CRISPR systems, with a growing number of recognized subtypes depending on the Cas proteins (CRISPR-associated proteins) used to affect the various steps of crRNA generation and invading nucleic acid destruction. In type I and III CRISPR systems, different specialized Cas endonucleases generate crRNAs, which then assemble with other Cas proteins to create large crRNA-protein complexes that recognize and degrade invading nucleic acids complementary to the crRNA. Type II CRISPR systems are a little different. In these systems, correct processing of pre-crRNA requires a trans-encoded small RNA (tracrRNA), endogenous RNase III and the Cas9 protein. The tracrRNA serves as a guide for RNase III-aided processing of pre-crRNA. Subsequently the Cas9/crRNA/tracrRNA complex endonucleolytically cleaves linear or circular dsDNA target complementary to the crRNA. Degradation requires the Cas9 protein and both RNA species. Thus, in type II CRISPR systems, crRNA-guided degradation of DNA relies upon a single protein. This discovery has implications beyond the world of bacteria. Expressing Cas9 with specifically chosen crRNA should allow site-specific genome modifications, knocking-out genes on demand not only in bacteria where it is already relatively simple to do so, but also in higher organisms, such as vertebrates.

And indeed it works! In 2 back-to-back Science articles published online in January of this year, Streptococcus pyogenes strain SF370 Cas9 endonuclease was codon-optimized and targeted to the nucleus in human or mouse cells. In one article, RNase III was engineered in a similar fashion while the tracrRNA and pre-crRNA were expressed either separately or as a hybrid molecule, while in the other, only a hybrid crRNA-tracrRNA was expressed. In both papers, various gene targets were cloned into the crRNA locus, leading to site-specific target cleavage which was subsequently repaired by either nonhomologous end-joining or homologous recombination. While the efficiency of the process varies, introducing multiple targets within a single gene or targeting multiple genes at a time is feasible, allowing for comparatively easy manipulation of a genome of interest. Additionally, no toxicity has been observed upon expression in human cells.

A similar approach has been successfully used not only in other bacteria, but also in zebrafish, as well as in different human cell lines.

The work described above has been carried out using Cas9 from Streptococcus pyogenes strain SF370, and the corresponding UniProtKB/Swiss-Prot entry has been updated, as have been experimentally characterized orthologous proteins in other bacteria (Streptococcus thermophilus strain DGCC7710, Streptococcus thermophilus strain ATCC BAA-491 / LMD-9 and Listeria innocua serovar 6a strain CLIP 11262). Additionally, a new HAMAP rule has been made for the Cas9 family (MF_01480).

UniProtKB news

Cross-references to ChiTaRS

Cross-references have been added to ChiTaRS, a database of human, mouse and fruit fly chimeric transcripts and RNA-sequencing data.

ChiTaRS is available at http://chitars.bioinfo.cnio.es/

The format of the explicit links in the flat file is:

Resource abbreviation ChiTaRS
Resource identifier gene name
Optional information 1 organism name
Example P16320:
DR   ChiTaRS; ATP6AP1; drosophila.

Show all the entries having a cross-reference to ChiTaRS.

Cross-references to SABIO-RK

Cross-references have been added to SABIO-RK, a database of biochemical reaction kinetics.

SABIO-RK is available at http://sabiork.h-its.org/

The format of the explicit links in the flat file is:

Resource abbreviation SABIO-RK
Resource identifier UniProtKB accession number
Example P10172:
DR   SABIO-RK; P10172; -.

Show all the entries having a cross-reference to SABIO-RK.

Removal of the cross-reference to 8 2D gel databases

Cross-references to 2DBase-Ecoli, Aarhus/Ghent-2DPAGE, ANU-2DPAGE, Cornea-2DPAGE, PHCI-2DPAGE, PMMA-2DPAGE, Siena-2DPAGE, and Rat-heart-2DPAGE have been removed.

Removal of the cross-reference to AGD

Cross-references to AGD have been removed.

Gene3D

The Gene3D database no longer provides names for their signatures. The entry name that has been displayed in the cross-references was therefore replaced by a dash (’-’).

Examples: Q12933:

Previous format:
DR   Gene3D; 2.60.210.10; TRAF-type; 1.
DR   Gene3D; 3.30.40.10; Znf_RING/FYVE/PHD; 1.
New format:
DR   Gene3D; 2.60.210.10; -; 1.
DR   Gene3D; 3.30.40.10; -; 1.

Changes to keywords

New keyword: Modified keywords:

UniProt release 2013_02

Published February 6, 2013

Headline

The smoke's devils

The first written evidence of the therapeutic and psychoactive use of Cannabis is attributed to the legendary emperor of China Shen-nung who lived some 5,000 years ago. He stated in his famous herbal “Pen-ts’ao Ching” that “the fruits of hemp, if taken in excess will allow ‘seeing devils’. If taken over a long term, it makes one communicate with spirits and lightens one’s body” (in An archaeological and historical account of Cannabis in China). Until 1942, Cannabis was listed in the United States Pharmacopoeia and it was only in 1971 that most European countries banned Cannabis by adopting the Convention on Psychotropic Substances established by the United Nations.

Although marijuana has been used for centuries, the biological processes underlying its psychoactive effects have long remained a mystery. It is only recently that the cannabinoid biosynthetic pathway has been elucidated. The production of Cannabis’ major psychoactive ingredient, delta-9-tetrahydrocannabinol (THC), starts with the condensation of hexanoyl-CoA with three molecules of malonyl-CoA to yield olivetolic acid (OA). It was postulated that a type III polyketide synthase was catalyzing this reaction, although all type III PKSs from Cannabis characterized so far were only able to produce byproducts instead of OA. A few months ago, it was shown that the inability of OLS/TKS, a cloned tetraketide synthase, to synthesize OA was due to the absence of an accessory protein, olivetolic acid cyclase. In the presence of olivetolic acid cyclase, OA is synthesized. It is then geranylated to form cannabigerolic acid, which is further converted by oxidocyclase enzymes to the major cannabinoids, delta-9-tetrahydrocannabinolic acid (THCA) in “drug-type” Cannabis and cannabidiolic acid (CBDA) in “fiber-type” Cannabis. THCA and CBDA are decarboxylated by a non-enzymatic reaction during storage or smoking to give rise to their chemically neutral forms, THC (the neurologically active substance) and CBD, respectively.

Thanks to the recent publication of the complete sequence of the genome of Cannabis sativa, most enzymes involved in the THC/CBD biosynthetic pathway have been identified and manually annotated in UniProtKB/Swiss-Prot.

UniProtKB news

Cross-references to mycoCLAP

Cross-references have been added to mycoCLAP, a database of fungal genes encoding lignocellulose-active proteins.

mycoCLAP is available at https://mycoclap.fungalgenomics.ca/mycoCLAP/

The format of the explicit links in the flat file is:

Resource abbreviation mycoCLAP
Resource identifier mycoCLAP identifier
Example P55296:
DR   mycoCLAP; MAN26A_PIRSP; -.

Show all the entries having a cross-reference to mycoCLAP.

Changes to keywords

New keyword: Modified keywords:

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • (3R)-3-hydroxyarginine
  • (3S)-3-hydroxyhistidine

UniProt release 2013_01

Published January 9, 2013

Headline

Hereditary sensory and autonomic neuropathy type IA: New dietary hope?

Hereditary neuropathies are common neurological conditions characterized by progressive loss of motor and/or sensory function. There are no effective treatments. HSAN1A is one of many hereditary peripheral neuropathies, characterized by axonal degeneration and disappearance of myelin sheaths. The prominent feature of this pathology is sensory abnormalities with a variable degree of motor and autonomic dysfunction. HSAN1A patients most frequently present with decreased sensation in the feet, as well as painless blisters and ulcers, often preceded by hyperpathia and spontaneous shooting or lancinating pain. The loss of sensation, especially pain, leads to the horrible complications of unheeded infections and painless ulcers that can result in amputations of the affected extremities.

The culprits are mutations in the SPTLC1 gene. SPTLC1 is a subunit of serine palmitoyltransferase. It catalyzes the condensation of serine and palmitoyl-CoA, the initial step in the de novo synthesis of sphingolipids.

The most frequent HSAN1A mutation is found at position 133 where a cysteine residue is substituted by a tryptophan (C133W). This mutation induces a shift in the substrate specificity, allowing the condensation of alanine or glycine, instead of serine, and subsequent formation of 2 atypical deoxysphingolipids: 1-deoxy-sphinganine and 1-deoxymethylsphinganine, respectively. These metabolites lack the C1 hydroxyl group of sphinganine and can therefore neither be converted to complex sphingolipids, nor degraded by the classical catabolic pathway. Accumulation of these metabolites is toxic for sensory neurons.

In cultured cells, as well as in transgenic mice, a serine-enriched medium/diet can force the defective enzyme to use serine, hence restoring the original reaction. A pilot study in 14 human patients showed a marked decrease in plasma deoxysphingolipid levels. Unfortunately, only the biochemical effects of the diet were evaluated, while the neurological outcome was not assessed. In addition, the number of patients is too small to draw any conclusion, but it opens a door for a new potentially efficient and simple treatment for a specific type of hereditary neuropathy.

Missense neutral polymorphisms and disease-causing mutations are annotated in UniProtKB/Swiss-Prot in ‘Sequence annotation (Features)’. The SPTLC1 variant C133W has now joined some 68,000 polymorphisms reported in the knowledgebase.

UniRef news

Modification of the UniRef clustering algorithm

UniRef clusters are formed in a hierarchical fashion by the serial application of the CD-HIT algorithm to sequences from UniProtKB and selected UniParc entries. Identical sequences (and sub-fragments) are first clustered to form UniRef100. Then the longest sequence is selected from each UniRef100 cluster as input for clustering in UniRef90. Each UniRef90 cluster in turn provides its longest sequence as input for clustering in UniRef50.

Until now, UniRef90 and UniRef50 clusters are computed only with identity thresholds of 90% and 50%, respectively. Starting with the first release of 2013, an 80% overlap threshold will be used for the computation of UniRef90 and UniRef50 clusters. This means that the longest (seed) sequence of each UniRef90 and UniRef50 cluster will have a minimum length overlap of 80% with each of the other member sequences.

Our motivations for introducing this overlap threshold were:
  • to create tighter clusters to support use cases such as sequence similarity searches
  • to improve cluster computation performance by avoiding false positive sequence alignments arise during clustering

Based on our analyses this change will have a minimal impact on existing cluster topologies (less than 5% increase in the number of clusters and less than 2% changes of the representative sequence) and will at the same time provide a more than five-fold gain in computation time for UniRef50.

UniProt release 2012_11

Published November 28, 2012

Headline

RALF, a growing family of plant peptide hormones

The first plant peptide hormone to be identified was systemin. Systemin regulates systemic wound signaling during herbivore and pathogen attacks. Since its discovery in 1991, several other polypeptide signals have been reported in plants, including phytosulfokines and CLAVATA3 and CLAVATA3-related proteins.

In 2001, Pearce et al. used a cell suspension culture assay to identify polypeptide hormones in plant extracts that cause alkalinization of the medium. In addition to systemins, the authors isolated a 5-kDa polypeptide from tobacco leaves that induced rapid alkalinization of the culture medium and the concomitant activation of an intracellular mitogen-activated protein kinase. The peptide has been called RALF for Rapid ALkalinization Factor. The 49-amino acid long active peptide is produced by processing of a 115 amino acid long preprotein. Genes encoding RALF preproproteins are expressed in various tissues and organs in many different plant species. In Arabidopsis thaliana, the RALF family consists of 36 members. As in tobacco, they are produced by the processing of precursors containing signal peptides and, for some of them, the cleavage of an additional propeptide is required. The presence of disulfide bonds contributes to their stabilization after secretion. One member of the family, RALF1, has been shown to induce an intracellular Ca(2+) increase, likely caused by both Ca(2+) influx across the plasma membrane and release of Ca(2+) from intracellular stores. This mechanism could be common to other RALFs.

Further studies are needed for a better understanding of RALF functions, but as of this release, all Arabidopsis thaliana RALF family members have been manually annotated with all available information.

UniProtKB news

Cross-references to ChEMBL

Cross-references have been added to ChEMBL, a database of bioactive drug-like small molecules.

ChEMBL is available at https://www.ebi.ac.uk/chembldb

The format of the explicit links in the flat file is:

Resource abbreviation ChEMBL
Resource identifier ChEMBL identifier
Example P69332:
DR   ChEMBL; CHEMBL4259; -.

Show all the entries having a cross-reference to ChEMBL.

Cross-references to PaxDb

Cross-references have been added to PaxDb (Protein Abundance Across Organisms), a comprehensive absolute protein abundance database, which contains whole genome protein abundance information across organisms.

PaxDb is available at http://pax-db.org

The format of the explicit links in the flat file is:

Resource abbreviation PaxDb
Resource identifier UniProtKB accession number
Example P85829:
DR   PaxDb; P85829; -.

Show all the entries having a cross-reference to PaxDb.

Removal of the cross-reference to ECO2DBASE

Cross-references to ECO2DBASE have been removed.

Removal of the cross-reference to TIGR

Cross-references to TIGR have been removed.

New format of the documentation files yeast.txt, yeast chromosome files, pombe.txt and calbican.txt

UniProtKB provides documentation files for some key species. These files list the relevant UniProtKB/Swiss-Prot entries with information like the primary accession number and entry name, gene designations, protein length, cross-references to organism-specific databases and whether a 3D structure is available or not.

We have slightly changed the file format so that all information from one protein is now found on a single line, which should make it easier to parse these files.

The following files are affected by this change:

Yeast
Yeast chromosome I
Yeast chromosome II
Yeast chromosome III
Yeast chromosome IV
Yeast chromosome V
Yeast chromosome VI
Yeast chromosome VII
Yeast chromosome VIII
Yeast chromosome IX
Yeast chromosome X
Yeast chromosome XI
Yeast chromosome XII
Yeast chromosome XIII
Yeast chromosome XIV
Yeast chromosome XV
Yeast chromosome XVI
Candida albicans
Schizosaccharomyces pombe

UniProt release 2012_10

Published October 31, 2012

Headline

CIA: on your Genome service

Life evolved in an anaerobic world and it is thought that iron-sulfur (Fe-S) clusters played a crucial role in this process by facilitating chemical transformations. Once photosynthesis evolved, oxygen became prevalent, threatening Fe-S clusters as they are susceptible to destruction by oxidation. Despite this potential problem, Fe-S clusters are still cofactors in hundreds of proteins. They are required in virtually all organisms from bacteria to humans and are involved not only in ‘redox’ catalysis in some enzymes, but also in many other functions. Interestingly, Fe-S clusters have been found in many proteins involved in DNA repair and replication and telomere length maintenance.

In eukaryotic cells, most biosynthesis of Fe-S clusters occurs in the mitochondria, but it may also occur in the cytosol and nucleus. In the cytosol, Fe-S clusters are escorted and presented to their cytoplasmic and nuclear apoproteins by the conserved cytoplasmic iron-sulfur assembly (CIA) machinery. However, it is not clear how Fe-S clusters are transferred to target apoproteins, nor how target specificity is achieved.

Two recent and elegant publications have shown that the MMS19 protein is associated with the CIA machinery. This protein also binds a subset of cellular Fe-S proteins, specifically nuclear ones involved in DNA metabolism, including the DNA helicases RTEL1, ERCC2 and ERCC3. MMS19 is required for in vivo incorporation of iron into various DNA repair enzymes and, in the absence of MMS19, cells become more sensitive to DNA damage. The authors suggest that MMS19 functions as a platform to facilitate Fe-S cluster transfer to proteins critical for DNA replication and repair. These experiments point to the importance of Fe-S clusters for the maintenance of genome integrity and imply a central role for mitochondria in genomic DNA metabolism.

In spite of their interest, we might have missed these publications, if not for one of our users who contacted us asking for their review and integration into UniProtKB/Swiss-Prot. We immensely value feedback and update requests and we would like to thank all users who are taking time to help us improve UniProtKB. If you would like to contribute, please use ‘Send feedback’ button in the clickable box found at the top-right corner of each entry and we will handle your request with high priority.

UniProtKB news

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N6-crotonyl-L-lysine

UniProt release 2012_09

Published October 3, 2012

Headline

New discovery for an old virus: PA-X, influenza’s twelfth protein

Influenza A virus (IAV) remains a major cause of human mortality and morbidity due to its remarkable genetic variability which limits vaccine effectiveness. Understanding the determinants of influenza virus molecular biology is a fundamental step for effective control of viral epidemics, and this virus has been the subject of intensive research efforts for more than 60 years. The viral RNA genome was first sequenced in the early 80’s. It comprises eight segments totaling 13.5 kb, that was thought to encode eleven proteins. The eleventh protein PB1-F2 was characterized in 2001. Coinciding with the 30th anniversary of the first segment 3 sequence, Jagger et al. have published in Science the identification of the twelfth protein of influenza A virus. This protein is expressed by an unusual ribosomal frameshifting in the polymerase acidic (PA) protein open reading frame encoded on segment 3. The frameshift product, called PA-X, comprises the endonuclease domain of the viral PA protein with a C-terminal domain encoded by the X-ORF. Its function is to repress cellular gene expression and modulate IAV virulence in a mouse infection model, acting to decrease pathogenicity. It is not surprising to discover a new open reading frame in a small RNA virus 30 years after it was first sequenced, because non-structural viral proteins are often difficult to identify in the midst of host proteins. Moreover, PA-X expression relies on an unusual ribosomal frameshift which could not be predicted. This new finding will allow a better understanding of host-virus interactions and improve the surveillance of new outbreaks.

As of this release, 87 new PA-X entries have been manually annotated in UniProtKB/Swiss-Prot.

UniProtKB news

Changes to keywords

New keywords: Modified keyword:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N6-(3,6-diaminohexanoyl)-5-hydroxylysine

UniProt release 2012_08

Published September 5, 2012

Headline

Prokaryotes do it too: CRISPR, an RNA-based adaptive immune system in UniProt

Like all other cellular organisms, bacteria and archaea are constantly bombarded by viruses, and unlike eukaryotes, many are also susceptible to infective plasmids. While we have known about defenses such as restriction-modification systems, blockage of absorption and/or DNA injection and abortive infection for some time, new ways in which bacteria and archaea defend themselves against these infective agents have been found more recently. One of these is the clustered regularly interspaced short palindromic repeat (CRISPR) sequences. CRISPR is an RNA-based adaptive immune system, which degrades invading genetic material. The system is mechanistically different from eukaryotic RNA interference (RNAi) and the proteins involved in prokaryotes are not homologous to those in eukaryotes (review).

CRISPRs are repetitive loci on the genome consisting of unique sequences 20-50 bases long (the spacer sequences) interspaced with repeated sequences of about the same length. Examination of the spacer sequences has shown that some are identical to viral and plasmid sequences; they are thought to serve as a “memory” of a previous infection. Bacteria and archaea can have from 0 to 18 CRISPR loci, with between 2 and 249 repeat-spacer units. While many pathogens have CRISPR loci, obligate parasites do not. The CRISPR loci are transcribed and processed to give short CRISPR-derived RNA (crRNA) complementary to a previously-encountered infective agent. It is this crRNA that is at the heart of adaptive prokaryotic immunity.

There are a large number of proteins associated with CRISPR loci, the operon-encoded CRISPR-associated or Cas proteins. The Cas proteins present in each locus have allowed the definition of 3 major CRISPR-Cas systems with further division into a number of subtypes. The number of subtypes will probably continue to increase as more prokaryotic genomes are fully sequenced. Many of these proteins are predicted to be nucleases, helicases and/or RNA binding proteins as is to be expected given the function of CRISPR.

There are 3 stages in CRISPR-Cas mediated immunity:

Stage 1, adaptation or acquisition, is the least well characterized. A short piece of DNA homologous to an invading agent is integrated into the 5’ end of the CRISPR loci. This requires the metal-dependent Cas1 endoribonuclease, the only Cas protein found in all organisms with CRISPR loci, although almost all organisms also encode Cas2, another metal-dependent endoribonuclease which is also thought to be involved in adaptation.

Stage 2, expression or crRNA biogenesis, requires transcription and processing of the CRISPR loci to produce the crRNA. Type I CRISPR systems use one of the related, metal-independent Cas6, Cas6e or Cas6f endoribonucleases to process the precursor, while type III systems use endogenous RNase III to generate the crRNA. It is not yet known which protein produces crRNA in type II systems.

Stage 3, interference, is the destruction of the target (be it virus or plasmid) and is performed by a complex of crRNA and proteins. While it is generally thought to recognize invading DNA, the type III-B CRISPR system of Pyrococcus furiosis cleaves target RNA.

While CRISPR-Cas systems can now be assumed to be involved in adaptive immunity, there are tantalizing hints that they may perform other functions as well. In Pseudomonas aeruginosa UCBPP-PA14, the type I CRISPR system does not confer resistance to phages DMS3 or MP22, but is required for DMS3-dependent inhibition of biofilm formation and possibly motility, while in Myxococcus xanthus, a CRISPR system is involved in the regulation of fruiting body development.

We have recently annotated and updated characterized Cas proteins in UniProtKB/Swiss-Prot, although the field moves so quickly that it is impossible to be fully up-to-date with all the latest research. All manually annotated CRISPR-associated protein entries can be retrieved from UniProtKB/Swiss-Prot using the query term ‘CRISPR’ in ‘Protein name’.

UniProtKB news

Cross-references to GenomeRNAi

Cross-references have been added to GenomeRNAi, a database containing phenotypes from RNA interference (RNAi) screens in Drosophila and Homo sapiens.

GenomeRNAi is available at http://genomernai.de/GenomeRNAi/

The format of the explicit links in the flat file is:

Resource abbreviation GenomeRNAi
Resource identifier GenomeRNAi identifier
Example Q9BXP5:
DR   GenomeRNAi; 51593; -.

Show all the entries having a cross-reference to GenomeRNAi

Cross-references to UniPathway

Cross-references have been added to UniPathway, a fully manually curated resource for the representation and annotation of metabolic pathways.

UniPathway provides explicit representations of enzyme-catalyzed and spontaneous chemical reactions, as well as a hierarchical representation of metabolic pathways. All of the pathway data in UniPathway has been extensively cross-linked to existing pathway resources such as KEGG and MetaCyc, as well as sequence resources such as UniProtKB, for which UniPathway provides a controlled vocabulary for pathway annotation.

The format of the explicit links in the flat file is:

Resource abbreviation UniPathway
Resource identifier UniPathway pathway ID (UPA)
Optional information UniPathway enzymatic reaction ID (UER)
Examples Q8LL69:
DR   UniPathway; UPA00842; -.
Q9M6F0:
DR   UniPathway; UPA00842; UER00808.

Show all the entries having a cross-reference to UniPathway

Changes to keywords

New keywords:

UniProt release 2012_07

Published July 11, 2012

Headline

To pee or not to pee

There is a season and a time for every purpose. There is a time to sleep and Nature has done its best to avoid as much as possible to have it interrupted by an urgent need to urinate. During a sound sleep, healthy humans produce less urine than during the daytime and also store more urine, as if bladder had an increased capacity at night. This is not simply due to the fact that we usually drink less at night, since temporal variation in urine production is maintained in subjects who take food and drink equally during 24 hours. This phenomenon is also observed in rodents, with an inverted clock, the active phase being at night and the resting phase during the day.

The contraction of smooth muscles of the urinary bladder on a sensation of fullness leads to micturition. This event is precisely controlled by regulation of the central and peripheral nerves. It has been formerly reported that an increase in connexin-43/GJA1 enhances intercellular electrical and chemical transmission and sensitizes the response of bladder muscles to cholinergic neural stimuli. Connexin-43 is a gap junction protein expressed in the urinary bladder. Gap junctions are channels that directly connect the cytoplasm of two cells, allowing various molecules and ions to pass freely between cells and hence establishing a direct chemical and electrical communication between cells. An increase in connexin-43 levels lead to enhanced intercellular communication and a better response of bladder smooth muscle cells to signals from the nervous system.

Does connexin-43 link urinary bladder capacity to the circadian clock? The answer came from a recent publication by Negoro and al.. The authors measured micturition frequency and urine volume using wild-type and heterozygous connexin-43 knockout mice. Both genotypes show the typical day/night variation, but, while the total urine volume is not significantly different, the heterozygous connexin-43 knockout animals exhibit a higher urine volume voided per micturition. This suggests that connexin-43 does not influence the urine volume, but determines the functional capacity of the urinary bladder. Interestingly, connexin-43 expression exhibits a circadian rhythm. mRNA levels peak at the beginning of the active phase and drop by the end of the night, closely followed by protein levels. Circadian connexin-43 expression seems to be transcriptionally regulated by the direct binding of NR1D1/Rev-erbA-alpha to SP1 sites in a biological clock-dependent manner. Connexin-43 expression levels closely correlate with cell-cell communication rates and show an inverse correlation with urine volume by micturition.

Now the pieces of the puzzle give a coherent, although probably still partial, picture: bladder muscle cells have an internal rhythm that generates an oscillation in gap junction function. During the active phase, the intercellular communication is optimal, the sensation of bladder fullness is readily perceived, and animals frequently urinate small volumes. When resting, the decrease in gap junctions leads to a decreased sensitivity to neuronal signals and hence to an increase in bladder capacity. This limits disturbance of sleep by micturition.

As of this release, this new information has been annotated in connexin-43/GJA1 UniProtKB/Swiss-Prot entries.

UniProtKB news

Removal of the cross-reference to CMR

Cross-references to CMR have been removed.

Changes to keywords

New keywords: Modified keywords: Deleted keyword:
  • Dephosphorylation of host translation factors by virus

Changes in subcellular location controlled vocabulary

New subcellular location:

UniProt release 2012_06

Published June 13, 2012

Headline

Fungal prion proteins – disease or evolutionary motor?

The word “prion”, coined in 1982 by Stanley B. Prusiner, is derived from the words “protein” and “infection”. It is used to describe the infectious, non-chromosomal genetic elements that are at the heart of the mammalian transmissible spongiform encephalopathies (TSEs, including scrapie of sheep, “Mad cow disease”, and Creutzfeldt-Jakob disease of humans). It is believed that these diseases are caused by the self-propagating conformational change of a protein, PRNP, or its assembly into an amyloid form.

A prion is an infectious agent made of a protein in a misfolded form. This altered inactive form converts its normal active counterpart into the same inactive form. Three distinct genetic traits have been defined that must be satisfied by a prion: 1. “Curing” of a prion is reversible. In the appropriate conditions, for instance in the absence of a specific molecular chaperone, the protein can reacquire its active conformation. The prion form can nonetheless arise again de novo because the protein is still present in the cell. 2. Overproduction of the protein should increase its frequency of conversion to the prion (infectious) form, whatever the mechanism. 3. Prions being inactive forms of physiological occurring proteins, the protein-encoding gene should be necessary for propagation of the prion, and inactivating mutations of this gene could produce a similar phenotype to that observed in the presence of the prion. Based on these criteria, prion proteins were also identified in fungi, primarily in the yeast Saccharomyces cerevisiae, for example the well-studied [PSI+], [URE3], and [PIN] prions. These classical, amyloid-forming prion proteins provide an excellent model for the understanding of the disease-forming mammalian prions.

In recent years, several additional fungal prion proteins have been identified. Their study provided 2 fundamental insights into prion biology. First, a protein does not need to form amyloid aggregates to be infectious. Other mechanisms like covalent autoactivation of an enzyme ([beta]) or even the interaction between two proteins ([GAR+]) can turn proteins into prions. But even more interesting is the fact that some of the fungal prions are not associated with any disease state, but may even have a beneficial role for the host. The Podospora anserina [Het-s] prion confers heterokaryon incompatibility, a process that ensures that during spontaneous, vegetative cell fusion only compatible cells from the same colony survive (non-self-recognition). In S. cerevisiae, the prevalence of transcriptional regulators (Cyc8, Mot3, Sfp1, Swi1 and Ure2) among the yeast prions led to the speculation that prion properties of transcription factors may generate an optimized phenotypic heterogeneity that buffers yeast populations against diverse environmental insults. Even more recent results on the adaptation of cells to anti-fungal drugs by the prion form of the mitochondrial tRNA dimethylallyltransferase ([MOD+]) shows that this may also be true for enzymes and supports the hypothesis that fungal prions may be beneficial for the host and contribute to cellular adaptation in living organisms.

As of this release, all prion-forming fungal proteins known to date have been reviewed and updated, with a special emphasis put on the prion-forming mechanism and on the consequences and phenotypes of the intracellular prion form. To make a clear distinction between the prion form characteristics and the physiological properties of the soluble cellular protein, the annotation dealing with prion have been integrated in a separate subsection (‘Miscellaneous’) in ‘General annotation (Comments)’.

Fungal prion-forming proteins can be retrieved using the keyword ‘Prion’.

UniProtKB news

Complete proteomes for Ensembl Genomes species

Ensembl Genomes species were made available for the first time in UniProt release 2012_04.

For UniProt release 2012_06, 5 new Ensembl Genome species have been made available, these are:

Amphimedon queenslandica
Gibberella zeae
Brachypodium distachyon
Glycine max
Oryza glaberrima

All predicted protein sequences from an Ensembl Genome are mapped to their UniProtKB counterparts under stringent conditions: 100% identity over 100% of the length of the two sequences is required. Any sequence found to be absent from UniProtKB is imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL. All UniProtKB entries that map to an Ensembl Genome are used to build the proteome; they are tagged with the keyword Complete proteome and an Ensembl Genome cross-reference is added.

We very much welcome the feedback of the community on our efforts. In future UniProt releases, we expect to make proteomes for the remaining Ensembl Genome species currently absent from UniProtKB.

Genome submission for Bos taurus updated to be in line with Ensembl

The underlying genome submission for Bos taurus has been updated to be in sync with the third party assembly of the genome used by Ensembl for their annotations. For details of the Ensembl assembly for Bos taurus, see the Ensembl website.

Changes to cross-reference to PhosSite

The resource identifiers of the cross-references to the Phosphorylation Site Database for Archaea and Bacteria (PhosSite) have changed from a UniProtKB primary accession number to a Phosphorylation Site Database unique identifier for a phosphoprotein.

Example:
Previous format:
DR   PhosSite; P08839; -.
New format:
DR   PhosSite; P0810428; -.

Show all the entries having a cross-reference to PhosSite.

UniProt Gene Ontology Annotation

UniProt is a central member of the Gene Ontology Consortium, an initiative founded in 1998 to develop and use a set of ontologies to represent three aspects of biology carried out by gene products from any organism. Terms within the Gene Ontology (GO) describe those molecular functions and biological processes that gene products carry out and the subcellular locations in which they are located.

UniProt curators contribute manual GO annotations to proteins from a wide range of species. In addition, to ensure that UniProt provides a comprehensive GO annotation resource and to avoid duplication of effort, GO annotations are also integrated from more than 30 external model organism and multi-species databases including dictyBase, EcoCyc, FlyBase, Gramene, Human Protein Atlas, IntAct, LifeDB, MGI, PomBase, Reactome, RGD, TAIR, SGD, WormBase and ZFIN.

High-quality automatic GO annotations are also supplied to the UniProt GO annotation set by Ensembl, EnsemblGenomes, InterPro and UniProt prediction pipelines. Such automatic pipelines differently exploit gene orthology data, protein sequence signatures and existing cross-references or keywords from external controlled vocabularies, to infer that a protein has a particular function or subcellular location. The inclusion of such high-quality, automatic annotation predictions ensures the UniProt GO annotation dataset supplies functional information to a wide range of proteins, including those from poorly characterised, non-model organism species. In the May 2012 UniProt release, a total of 125 million GO annotations are supplied for 14.8 million proteins from more than 338,000 taxonomic groups.

GO annotations are present in the ‘Ontologies’ section UniProtKB entries (see for example P09960) and are available to download from the GOA ftp site. GO annotations can additionally be viewed via the QuickGO browser. We are pleased to announce an addition to the UniProt GO automatic annotation pipelines: UniPathway2GO.

New UniPathway2GO pipeline

In collaboration with the SIB Swiss Institute of Bioinformatics, INRIA (Rhone-Alpes) and Laboratoire d’Ecologie Alpine (Grenoble), UniProt is pleased to announce the inclusion of an additional 113,285 GO annotations that describe the pathway(s) in which 105,041 UniProtKB entries are involved.

UniPathway is a manually curated resource of enzyme-catalyzed and spontaneous chemical reactions that provides a hierarchical representation of metabolic pathways.

Currently 425 UniPathway pathway terms have been manually mapped to GO terms and 48% of these annotations apply a GO term that either uniquely describes a protein’s involvement in a certain process, or supplies a more granular term than is supplied by other automatic annotation methods.

UniProt release 2012_05

Published May 16, 2012

Headline

Sex by deception

All is fair in love and war and… species survival, including the most brazen cheating. In this context, strategies developed by orchids of the genus Ophrys to attract pollinators are astounding. While the majority of flowering plants achieve pollination by exploiting the food-seeking behavior of animals, Ophrys uses alternative ploys that exploit their mate-seeking behavior. These beautiful flowers imitate female insects to attract males, predominantly male hymenoptera. They mimic the insect body through one modified petal, called the labellum, but the misleading cues are not only visual and tactile: they are also chemical. During development the Ophrys labellum accumulates substances that mimic sex pheromones – which consist mostly of cuticular hydrocarbons, such as alkanes and alkenes – that induce the pollinator to attempt mating (pseudocopulation) with the labellum. During pseudocopulation, pollen becomes attached to the hapless suitor, which transfers this pollen to other flowers when it is once again enticed into pseudocopulation.

This pollination system is highly specialized, with each orchid species targeting a single pollinator with chemical cues consisting of alkenes whose specificity is determined by the precise position of the double bonds. This allows even closely related Ophrys species, living in the same environment and in the absence of geographic barriers, to remain reproductively separated, since they attract different insects. The enzymes involved in Ophrys alkene synthesis have been recently identified. The SAD2 desaturase has the catalytic activity and tissue-specific expression pattern (i.e preferentially in the labellum) expected for a determinant of pollination specificity. Small differences in the expression level or sequence of SAD2 homologs could explain the observed differences in desaturation among Ophrys species, and hence the selective attraction of specific pollinators. Although alignments of orthologous SAD2 sequences from Ophrys sphegodes and O. exaltata indicate striking identity, as yet uncharacterized variations could conceivably affect the precise reaction products.

As of this release, SAD2 gene products have been manually annotated in UniProtKB. They can be retrieved by searching the Swiss-Prot section for SAD2 in ‘Gene names’ (gene:SAD2 AND reviewed:yes). Both available sequences (from O. sphegodes and O. exaltata) can be selected and aligned directly from the search output.

UniProtKB news

Update to Reference proteomes in UniProtKB

With the significant increase in the number of complete genomes sequenced, it is critically important to organize this data in a way that allows users to effectively navigate the growing number of available complete proteome sequences. In collaboration with Ensembl and the NCBI Reference Sequence collection, UniProt began this organization by defining a set of ‘reference proteomes’. These were first introduced in UniProt release 2011_09 and the keyword ‘Reference proteome’ was created to allow their easy retrieval.

The number of reference proteomes has grown from 455 in UniProt release 2011_09 to 549 in release 2012_05. The proteomes have been selected to provide broad coverage of the tree of life, and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB.

The reference proteome will be continuously reviewed as new proteomes of interest become available and as existing taxonomic classifications are revised. We would very much welcome feedback on our current list of reference proteomes and suggestions for new candidates via helpnull@unipnullrot.org.

Link to complete and reference proteomes.

UniProt release 2012_04

Published April 18, 2012

Headline

Of serpents, humans and pain

Pain can be viewed as an indispensable communication tool to warn us that something is wrong, and to help us minimize physical harm to our body. Congenital insensitivity to pain leads to severe problems. Although pain is very useful, persistent pain can turn into a nightmare and the spontaneous reaction of most persons is to seek relief – often by reaching for painkillers. Understanding the mechanism of nociception could help develop treatments that provide relief for millions of people.

Surprisingly, a hint may come from a predator: the Texas coral snake. This beautiful snake with black, yellow and red banding lives in the southern United States and throughout most of Mexico. In the absence of antivenom treatment, the fatality rate of coral snake envenomations is estimated at 10%. Death is primarily due to respiratory or cardiovascular failure. In addition, coral snake bite causes excruciating and unremitting pain.

The culprit is MitTx, a venom toxin active as a heterodimer made of MitTx-alpha and MitTx-beta. MitTx-alpha contains a BPTI/Kunitz domain, found in many protease inhibitors. MitTx-beta belongs to the phospholipase A2 (PLA2) family, but it lacks critical catalytic residues normally found in the active site of related PLA2 enzymes and has been shown to be inactive as a phospholipase. The MitTx heterodimer activates acid-sensing ion channels (ASICs). ASICs are voltage-independent channels expressed in neurons and activated by acid. They are preferentially permeable to Na+, but to a lesser extent can also conduct other cations, such as Ca2+, K+ and Li+ and H+. Physiologically ASICs can be triggered by tissue injuries, inflammation or build-up of lactic acid. This alert system is hijacked by coral snake venom. Whereas protons elicit very transient responses, those evoked by MitTx are dramatically prolonged, reflecting both lack of desensitization and slow reversibility after washout.

At neutral pH, the most robust toxin-evoked responses are observed with the ACCN2 ASIC subtype. However, if the extracellular pH drops below neutrality, the toxin becomes an excellent ACCN3 agonist, essentially enhancing the potency of protons by three orders of magnitude.

Brazilian coral snake venom also activates ACCN2 expressing cells. This very channel had already been shown to be targeted by the PcTx1 toxin from the Trinidad chevron tarantula. In this case, the toxin does not activate the channel by itself, but rather serves as a functional antagonist of proton-evoked responses by locking the channel in a desensitized state.

Animal toxins often act on very restricted targets and have proven to be extremely useful tools for basic research. The identification of MitTx should allow further investigation the role of ASICs in pain signaling, and eventually the development of new analgesics.

For more information on toxins in UniProtKB, see the Animal toxin annotation program.

UniProtKB news

Complete proteomes for Ensembl Genomes species

The source of the UniProtKB complete proteomes are genomes in INSDC and Ensembl and now, to further increase the taxonomic coverage, species from Ensembl Genomes will also be incorporated. Ensembl Genomes aims to work with all sections of the scientific community to represent the best annotation for every genome. Its role varies according to the species, from displaying the genome assembly, gene prediction and functional annotation, through to providing a portal through which genomic data from model organism and community databases can be visualised and analysed in their wider context, and also integrated with other data stored in the core repositories maintained by the EBI.

The new species are:
Caenorhabditis japonica
Phytophthora ramorum
Pristionchus pacificus
Strongylocentrotus purpuratus

All predicted protein sequences from an Ensembl Genome are mapped to their UniProtKB counterparts under stringent conditions: 100% identity over 100% of the length of the two sequences is required. Any sequence found to be absent from UniProtKB is imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL. All UniProtKB entries that map to an Ensembl Genome are used to build the proteome; they are tagged with the keyword Complete proteome and an Ensembl Genomes cross-reference is added.

We very much welcome the feedback of the community on our efforts. In future UniProt releases, we expect to make proteomes for the remaining Ensembl Genomes species currently absent from UniProtKB.

Update of Complete proteomes with Ensembl release 66

Ensembl release 66 was made available at the end of February 2012 and, in response, the appropriate complete proteomes have been updated in UniProtKB. Of note, the human reference proteome has grown in size by just over 8,000 new UniProtKB entries. This growth is a consequence of the following updates:

  • Incorporation of the latest set of cDNAs from the European Nucleotide Archive and NCBI RefSeq. A total of 224,907 cDNAs are aligned to the current genome showing an increase of 491 cDNAs compared to release 65.
  • New CCDS import – the updated gene set includes 26,437 transcript models.
  • The patches for GRCh37.p6 were annotated using a combination of manual annotation, annotation projected from the primary assembly and annotation derived from cDNA and protein alignment evidence.
  • Update of Havana manual annotation representing data present in Vega release 46 which includes GENCODE release 11.

The proteomes of 35 chordate species are now fully synchronised with Ensembl 66. The species are:
Ailuropoda melanoleuca (Giant panda)
Anolis carolinensis (American chameleon)
Bos taurus (Cow)
Callithrix jacchus (White-tufted-ear marmoset)
Canis familiaris (Dog)
Cavia porcellus (Guinea pig)
Ciona intestinalis (Transparent sea squirt)
Ciona savignyi (Pacific transparent sea squirt)
Danio rerio (Zebrafish)
Equus caballus (Horse)
Gallus gallus (Chicken)
Gasterosteus aculeatus (Three-spined stickleback)
Gorilla gorilla (Lowland gorilla)
Homo sapiens (Human)
Latimeria chalumnae (West Indian ocean coelacanth)
Loxodonta africana (African elephant)
Macaca mulatta (Rhesus macaque)
Meleagris gallopavo (Common turkey)
Monodelphis domestica (Gray short-tailed opossum)
Mus musculus (Mouse)
Myotis lucifugus (Little brown bat)
Nomascus leucogenys (Northern white-cheeked gibbon)
Ornithorhynchus anatinus (Duckbill platypus)
Oryctolagus cuniculus (Rabbit)
Oryzias latipes (Medaka fish)
Otolemur garnettii (Garnett’s greater bushbaby)
Pan troglodytes (Chimpanzee)
Pongo abelii (Sumatran orangutan)
Rattus norvegicus (Rat)
Sarcophilus harrisii (Tasmanian devil)
Sus scrofa (Pig)
Taeniopygia guttata (Zebra finch)
Takifugu rubripes (Japanese pufferfish)
Tetraodon nigroviridis (Spotted green pufferfish)
Xenopus tropicalis (Western clawed frog)

Update to the Tetraodon nigroviridis proteome

The Tetraodon nigroviridis complete proteome has been updated with data from Ensembl release 66. Until now the proteome has reflected the Genoscope gene model annotations provided within the whole genome shotgun project (accession CAAE00000000) that were made available in March 2007. The proteome has been updated to reflect the annotations of the genome using Ensembl’s more conservative, evidence-based pipeline. Although a consequence of this update is a slightly reduced proteome size, the gene model predictions are high-quality and fit well into the Ensembl Compara gene trees. An example of an Ensembl sourced protein sequence is entry H3C526.

Cross-references to EvolutionaryTrace

Cross-references have been added to EvolutionaryTrace, which ranks amino acid residues in a protein sequence by their relative evolutionary importance.

EvolutionaryTrace is available at http://mammoth.bcm.tmc.edu/ETserver.html

The format of the explicit links in the flat file is:

Resource abbreviation EvolutionaryTrace
Resource identifier UniProtKB accession number
Example P06611:
DR   EvolutionaryTrace; P06611; -.

Show all the entries having a cross-reference to EvolutionaryTrace.

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 3-hydroxyhistidine
  • (3S)-3-hydroxyaspartate
  • (5R)-5-hydroxylysine
  • (5S)-5-hydroxylysine

UniProt release 2012_03

Published March 21, 2012

Headline

The importance of being manual

Manual annotation is a time-consuming and expensive process, but undoubtedly adds great value to knowledgebases like UniProtKB. A recent and very elegant study on sirtuin-5 illustrates how new functions continue to be discovered within what are thought to be well characterized protein families. Curating this information facilitates its dissemination as well as its subsequent (re)use in automatic annotation and function prediction systems.

Sirtuins, also called Sir2 proteins, are NAD-dependent deacetylases that regulate important biological processes. The name ‘Sir2’ comes from the yeast ‘silent information regulation 2’ gene, a gene involved among others in transcriptional repression. Sirtuins belong to a family of evolutionally conserved proteins occurring in all kingdoms. Mammals have seven sirtuins, SIRT1 to SIRT7. Robust deacetylase activity has been demonstrated for mammalian SIRT1 to SIRT3 and the annotation concerning this potential function has been propagated to other paralogues on the basis of their sequence similarity. However, so far SIRT4 to SIRT7 have been shown to have only a very weak deacetylase activity, if any. While this could be due to an inappropriate choice of peptides for the analysis, it could also be envisioned that their physiological activity is different.

A major breakthrough in the field came from the study of SIRT5 crystal structure. It appeared that the pocket used by SIRT2 to host acetyl groups was much larger in SIRT5, large enough to host a negatively charged acyl group instead. The most common acyl-CoA molecules with a carboxylate group in cells are malonyl-CoA and succinyl-CoA. Hence, malonyl-and succinyl-peptides were produced and tested as substrates for SIRT5. Goal! SIRT5 was actually able to catalyze their hydrolysis, proving it is a desuccinylase and a demalonylase, rather than a deacetylase. This discovery raised another question: do such post-translational modifications (PTMs) exist at all? Lysine succinylation has been shown on E.coli homoserine trans-succinylase, but not on mammalian proteins, and lysine malonylation had never been reported.

The presence of these PTMs was investigated in mitochondria, the organelle hosting SIRT5. Goal! Several proteins were found to be either succinylated or malonylated or both. Among them is CPS1 whose activity has been previously shown to be regulated by SIRT5.

UniProtKB/Swiss-Prot SIRT5 entries have been updated and lysine-succinylation and malonylation have been introduced in the UniProtKB controlled vocabulary of PTMs.

UniProtKB news

Cross-references to DNASU

Cross-references have been added to DNASU, a plasmid repository providing centralized archival and distribution of over 131,000 plasmids and empty vectors, including over 45,000 plasmids containing more than 7,000 human genes.

DNASU is available at http://dnasu.asu.edu

The format of the explicit links in the flat file is:

Resource abbreviation DNASU
Resource identifier DNASU identifier
Example A0EJG6:
DR   DNASU; 1400; -.

Show all the entries having a cross-reference to DNASU.

Changes to keywords

New keyword:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • Methionine sulfoxide

UniProt release 2012_02

Published February 22, 2012

Headline

Thiamine thiazole synthase: enzyme, catalyst or co-substrate?

Thiamine is a cofactor essential for many biochemical reactions in all living beings. Humans depend on their diet to supply it as vitamin B1, while bacteria, plants and yeast can make their own. They do so by coupling two precursor molecules: a sulfur-containing ring structure known as a thiazole and a nitrogenous pyrimidine.

In eukaryotes, it has been known for some time that thiamine thiazole synthase catalyzes thiazole biosynthesis, but the source of the thiazole sulfur at the heart of the reaction remained elusive. A recent publication unveiled a very unusual mechanism, whereby a sulfide ion is transferred from a conserved cysteine of thiamine thiazole synthase itself to become part of the thiazole precursor in Saccharomyces cerevisiae. This transfer is strictly dependent on the presence of Fe(2+). The donor cysteine, Cys-205, is irreversibly converted to dehydroalanine, leading to the inactivation of the enzyme. Surprisingly the inactivated protein is not degraded, but accumulates in the cell where it can form up to about 1.5% of total cellular protein. Could it have another physiological function? This has yet to be explored, but it has been suggested to play a role in mitochondrial DNA damage tolerance.

Although very rare, the use of a protein as a metabolic reagent has already been observed. The best characterized example is methylated-DNA--protein-cysteine methyltransferase, which repairs O-6 alkylated guanine lesions in DNA by stoichiometrically transferring the alkyl group to a cysteine residue in the enzyme. Here again we face a suicidal reaction, the enzyme being irreversibly inactivated. Interestingly, the inactive enzyme serves as a signal to induce other DNA repair enzymes.

Can such proteins be considered as “enzymes”? An enzyme is defined as a protein that catalyzes chemical reactions of other substances without itself being destroyed or altered upon completion of the reactions. Thiamine thiazole synthase functions as a “one-shot” reagent, so therefore does not comply with the definition. At most it can be considered as a catalyst, i.e. a reagent which promotes a reaction and may act repeatedly or only once.

Such an unusual mechanism has led to some inconsistencies. The Enzyme Commission attributed the EC number 2.1.1.63 to methylated-DNA--protein-cysteine methyltransferases, mentioning the ambiguity of this attribution: “This enzyme catalyzes only one turnover and therefore is not strictly catalytic.” Actually the protein is a catalyst, but it is not strictly an enzyme. The later decision not to provide an EC number to thiamine thiazole synthases is more consistent in view of the definition of an enzyme. This inconsistency is also visible in UniProtKB entries which show EC numbers in the ‘Names and origin’ section of methylated-DNA--protein-cysteine methyltransferases, but not in that of thiamine thiazole synthases.

As of this release, thiamine thiazole synthases, have been updated in UniProtKB/Swiss-Prot and a new post-translational modification, 2,3-didehydroalanine has been introduced.

UniProtKB news

Update to the human proteome

The human reference proteome has been updated with data from Ensembl release 65. Ensembl 65 has numerous updates to the human genome including an update of Havana manual annotation representing data present in Vega release 45. As a result, the human reference proteome has increased in size by over 7,000 entries. These new entries correspond to fragment entries that have transcription evidence captured by Havana and as such they are considered valid members of the proteome. Two examples of these fragment entries are H0Y5B1 and H0Y653.

Change of the cross-reference GeneDB_Spombe to PomBase

The Schizosaccharomyces pombe GeneDB was replaced by PomBase, the new model organism database for the fission yeast Schizosaccharomyces pombe. We have therefore changed the corresponding resource abbreviation from GeneDB_SPombe to PomBase.

Change of the category of the cross-reference KO

The KO database has been moved from the category “Family and domain databases” to the category “Phylogenomic databases”.

Removal of the cross-reference NMPDR

Cross-references to NMPDR have been removed.

Changes to the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 2,3-didehydroalanine (Cys)
  • N6-malonyllysine

UniProt release 2012_01

Published January 25, 2012

Headline

What’s in a (species) name?

Carl Linnaeus, the father of taxonomy, was responsible during his life for the naming of nearly 8,000 plants, many animals and the scientific designation for humans: Homo sapiens. Linnaeus used many of his supporters and detractors as inspiration for naming plants. The most beautiful plants were often named in honor of his supporters while his detractors often supplied the names of common weeds or unattractive plants. Rather like an artist signing a painting, Linnaeus signed all his descriptions, his signature becoming over the centuries a simple L followed by a point. Sober. Even to this day taxonomically approved names may use this idea, but less soberly; the red alga Gracilaria chilensis was discovered in 1986 by C.J. Bird, J. McLachlan & E.C. Oliveira, giving us Gracilaria chilensis C.J. Bird, J. McLachlan & E.C. Oliveira, 1986.

Linnaeus advocated the use of commemorative personal names as botanical names. In ‘Critica Botanica’, he commented with humor about the naming of Linnaea borealis: “It is commonly believed that the name of a plant which is derived from that of a botanist shows no connection between the two… [but]... Linnaea was named by the celebrated [Jan Frederik] Gronovius and is a plant of Lapland, lowly, insignificant, disregarded, flowering but for a brief space – after Linnaeus who resembles it”. It may not be an excessively objective statement.

Thunbergia was named in 1780 by Retzius in honor of Carl Peter Thunberg (1743-1828), the Swedish naturalist, and perhaps the greatest pupil of Linnaeus. Kosteletzkya for Vincenz Franz Kosteletzky (1801-1887), Bohemian physician and botanist. Jacobsenia for Hermann Johannes Heinrich Jacobsen (1898-1978), German botanist and curator at Kiel botanic garden… there are many, many more examples.

Latin is still necessary at least to understand the species epithet. Ehrharta longiflora, longiflora referring to the elongate flowers of this species. And Ehrharta? J.F. Ehrhart (1742-1795) was a German botanist, yet another of Linnaeus’ pupils.

Sometimes scientific names bear the names of people who described the species or were instrumental in discovering them. Several archaeabacteria have been named in honor of Carl Woese (1928-), famous for defining the archaea in 1977, such as Pyrococcus woesei, or Methanobrevibacter woesei or Conexibacter woesei.

Euzebya tangerina, tangerine-colored bacterium was named in 2010 after Jean-Paul Euzéby, a French microbiologist who has contributed significantly to microbial systematics, including the Latinization of microbial names.

And Przewalskium albirostris? The Latin etymology of the name suggests that this creature has a white beak (albus: white and rostrum: beak, trunk or proboscis), or a white-lip. It was formerly named Cervus albirostris. Cervus means deer!! Now we know: it is a white-lipped deer. But it was renamed Przewalskium albirostris after N.M Przhevalsky (1839-1888), a Russian geographer.

In view of the names cited above, you may have the feeling of attending a popularity contest in the scientific community, but other kinds of tribute are also possible: a pheasant was named Chrysolophus amherstiae to commemorate Sarah Countess Amherst who sent the first specimen to London in 1828. In a more recent past, a newly discovered bacterium was named Midichloria mitochondrii. Does Midichloria remind you of anything? Schoooooooooooooooo…Luke, I am your father. Midichloria, a gram-negative bacterium, takes its name from the Star Wars microbes, midi-chlorians, which grant the Jedi and the Sith the ability to use the Force. In real life, Midichloria mitochondrii are non-obligate symbionts that reside primarily in the mitochondria.

Of course the appreciation of people deserving a tribute remains questionable. A nice cactus has been called Rebutia einsteinii, but there is no Opuntia oppenheimerii Why not oppenheimerii? Why a cactus? We leave the question open for the future generations of taxonomists.

UniProtKB news

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 5-hydroxy-3-methylproline (Ile)
Deleted term:
  • 5-hydroxy-3-methylproline

Website news

Clustal Omega replaces Clustal W as UniProt’s protein alignment program

We have upgraded the alignment web service used to align protein sequences in UniProt from Clustal W to Clustal Omega. This has been made possible with the help of a new bioinformatics analysis tools framework at EMBL-EBI. Clustal Omega is the latest addition to the Clustal family of programs. It offers a significant improvement upon Clustal W in the following areas:

  • Accuracy – Better quality protein sequence alignments.
  • Scalability – Better at aligning larger numbers of sequences.
  • Speed – Faster alignments, making use of multiple processors where present.

Clustal Omega is currently only suitable for aligning protein sequences and not DNA or RNA sequences.

UniProt release 2011_12

Published December 14, 2011

Headline

Between Charybdis and Cilia

A large number of genetic disorders, displaying a widely varying set of symptoms, are highly related in their root cause and can be grouped into a single category. This is the case for the ciliopathies in which the underlying cause is a cilium dysfunction. This emerging class of disease groups very different types of syndromes, including the Alstrom, Bardet-Biedl, Ellis-van Creveld, Joubert, Meckel, Sensenbrenner syndromes and many more.

Cilia are organelles found in almost all vertebrate cells. They contain a ciliary axoneme, i.e. a ring-shaped core of 9 microtubule doublets, which connects the base of the cilium to its tip. This axoneme is covered by the ciliary membrane and projects from a modified centriole, the basal body.

There are two types of cilia: motile cilia and non-motile (primary cilia). Motile cilia are found in certain types of highly specialized cells and are dedicated to a powerful motion of the extracellular fluid, for example, in the epithelial cells lining of the trachea, where they sweep mucus and dirt out of the lungs. By contrast, the majority of cells develop a single, non-motile primary cilium, which typically serves as a sensory organelle. The primary cilium membrane harbours receptors for crucial signaling cascades, most prominently Hedgehog, Wnt, planar cell polarity, FGF, Notch, mTor, PDGF or Hippo signaling. As a result, primary cilia play a role in cell proliferation, polarity, differentiation, tissue maintenance, and nerve growth.

The range of diseases due to cilia defects therefore include multiple phenotypes that affect different organs (predominantly kidney, eye, liver, bone and brain) and often show overlapping clinical features. Commonly observed clinical manifestations are renal cysts, retinal degeneration, polydactyly, mental retardation, and obesity.

The genetics of ciliopathies is complex. In some cases, identical phenotypes are caused by mutations in different genes. For example, over 15 genes have been shown to be involved in Bardet-Biedl syndrome, and close to 10 and 15 genes in Meckel and Joubert syndromes, respectively. On the other hand, multiple allelism at a single locus can lead to different phenotypes. For example, mutations in CEP290, a centrosomal protein involved in ciliogenesis, cause Bardet-Biedl syndrome type 14, Joubert syndrome type 5, Senior-Loken syndrome type 6, Leber congenital amaurosis type 10, Meckel syndrome type 4. Additionally, recent studies suggest that ciliopathy loci can be modulated by pathogenic lesions in other ciliary genes to either exacerbate overall severity or induce specific phenotypes.

Variations across multiple sites of the ciliary proteome may influence the clinical outcome and explain the variable penetrance and expressivity of ciliopathies. Examples are the TTC21B and KIF7 genes, which code for two ciliary proteins involved in the regulation of sonic hedgehog signaling. TTC21B mutations primarily cause nephronophthisis type 12 and asphyxiating thoracic dystrophy type 4, but have also been found in patients with Bardet-Biedl syndrome or Meckel-Gruber syndrome carrying disease causing mutations in other ciliopathy genes. KIF7 mutations are primarily responsible for acrocallosal syndrome, Joubert syndrome type 12, and hydrolethalus syndrome type 2, but may also genetically interact with Bardet-Biedl syndrome genes and contribute to disease manifestation and severity in Bardet-Biedl syndrome patients.

A number of ciliopathies have been annotated in UniProtKB/Swiss-Prot. The newly created keyword Ciliopathy allows users to retrieve all proteins involved in these diseases. More specific keywords can be used to restrict the set of proteins to those associated with special types of ciliopathies, such as Bardet-Biedl syndrome, Joubert syndrome, Kartagener syndrome, Meckel syndrome, Nephronophthisis, Primary ciliary dyskinesia, or Senior-Loken syndrome.

Proteins involved in cilia formation, organization, maintenance and degradation can be retrieved with the keyword Cilium biogenesis/degradation.

UniProtKB news

Cross-references to DMDM

Cross-references have been added to DMDM (Domain Mapping of Disease Mutations), a database in which each disease mutation can be displayed by its gene, protein or domain location. DMDM provides a unique domain-level view where all human coding mutations are mapped on the protein domain.

DMDM is available at http://bioinf.umbc.edu/dmdm/.

The format of the explicit links in the flat file is:

Resource abbreviation DMDM
Resource identifier DMDM identifier
Example Q9N2K0:
DR   DMDM; 44887889; -.

Show all the entries having a cross-reference to DMDM.

Cross-references to PATRIC

Cross-references have been added to PATRIC, a resource which integrates vital information on pathogens, provides key resources and tools to scientists, and helps researchers to analyze genomic, proteomic and other data arising from infectious disease research.

PATRIC is available at http://www.patricbrc.org/.

The format of the explicit links in the flat file is:

Resource abbreviation PATRIC
Resource identifier PATRIC identifier
Optional information 1 PATRIC locus tag
Example A5A616:
DR   PATRIC; 32118368; VBIEscCol129921_1604.

Show all the entries having a cross-reference to PATRIC.

Changes in the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):

  • N2,N2-dimethylarginine

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):

  • 2-(4-guanidinobutanoyl)-5-hydroxyimidazole-4-carbothionic acid (Arg-Cys)
  • 5-methyloxazole-4-carboxylic acid (Cys-Thr)
  • 5-methyloxazole-4-carboxylic acid (Thr-Thr)
  • 5-methyloxazoline-4-carboxylic acid (Ser-Thr)
  • Oxazole-4-carboxylic acid (Ile-Ser)
  • Oxazole-4-carboxylic acid (Ser-Ser)
  • Thiazole-4-carboxylic acid (Arg-Cys)
  • Threonine 5-hydroxy-oxazole-4-carbonthionic acid (Thr-Cys)

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniProt release 2011_11

Published November 16, 2011

Headline

Who wants to be a millionaire? The first million HAMAP-annotated entries in UniProtKB/TrEMBL

As humanity explores more environmental and ecological niches, we are discovering a treasure-trove of organisms of which very little, if anything, is known. Sequencing genomes is becoming cheaper, and so to understand this diversity we sequence; but to begin to appreciate a genome’s possibilities quality annotation is required. HAMAP is an annotation project started over 10 years ago to provide annotation to the massive influx of completely sequenced bacterial and archaeal genomes and is now an integral part of the UniProt Automatic Annotation program.

The HAMAP rules automatically annotate bacterial and archaeal proteins, as well as related plastid-encoded proteins, based on manually-annotated, characterized template entries. These latter entries are used to generate the HAMAP profiles. UniProtKB/TrEMBL entries that belong to a family, i.e. that match a HAMAP profile, acquire annotation based on the manually annotated templates as well as template-based feature propagation. The propagated annotation also includes protein and gene names, general annotation (comments), keywords and GO terms. The annotation templates (http://hamap.expasy.org/families.html), seed alignments used to generate the HAMAP profiles and much more are available on the HAMAP website and will be integrated into the www.uniprot.org automatic annotation portal in the future.

Two years ago we wrote a headline highlighting the incorporation of 300,000 HAMAP annotated entries into UniProtKB/Swiss-Prot. Since that time we have discontinued incorporation of these semi-automatically annotated entries into UniProtKB/Swiss-Prot; this annotation is now added to UniProtKB/TrEMBL entries instead, while manually annotated ‘template’ entries (see above) are still integrated into UniProtKB/Swiss-Prot. With this release there are over 1 million bacterial, archaeal and plastid-encoded proteins in UniProtKB/TrEMBL that have been annotated by the HAMAP rules. With each UniProt release, and as families and new template entries are created or updated based on new experiments, entries from all genomes are (re)annotated, enriching them beyond what was known when the genomes were originally submitted to the DNA databases. All these entries are thus improved by this high quality semi-automated annotation, rendering them more useful to the community.

UniProtKB news

Cross-references to KO (KEGG Orthology)

Cross-references have been added to KO consisting of manually defined ortholog groups that correspond to KEGG pathway nodes, BRITE hierarchy nodes, and KEGG module nodes.

KO is available at http://www.genome.jp/kegg/ko.html.

The format of the explicit links in the flat file is:

Resource abbreviation KO
Resource identifier KO identifier
Example P41932:
DR   KO; K06630; -.

Show all the entries having a cross-reference to KO.

Changes to keywords

New keyword:

Changes in the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 4-hydroxyglutamate
New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • 3-hydroxypyridine-2,5-dicarboxylic acid (Ser-Cys) (with S-...)
  • 3-hydroxypyridine-2,5-dicarboxylic acid (Ser-Ser) (with C-...)
  • Thiazole-4-carboxylic acid (Glu-Cys)

UniProt release 2011_10

Published October 19, 2011

Headline

The sound of silence

Cytosine methylation is the major and best characterized epigenetic modification of metazoan DNA. It is implicated in long-term gene silencing, X chromosome inactivation, genomic imprinting, etc. 5-methylcytosine (5mC) is recognized by methyl-binding proteins (MBDs), that in turn recruit repressive histone modifiers, such as H3K9 methyltransferases, to establish a heterochromatin state.

Cytosine base methylation is catalyzed by the C5-methyltransferase enzyme family. DNMT3A and DNMT3B methylate DNA de novo. DNMT1 maintains the methylation status across cell divisions. In the absence of DNMT1 activity, DNA methylation is progressively lost since methylation is not replicated onto the newly synthesized strand, leading to passive DNA demethylation.

However, passive DNA demethylation cannot account for rapid demethylation that occurs in the paternal genome in the zygote within the first 4 hours following fertilization or that observed in primordial germ cells, both of which are independent of DNA replication. While demethylases have been identified in Arabidopsis thaliana, the mechanism of active demethylation in mammals remained elusive (reviews). 2011 has unveiled the central role played by TET family members. These enzymes have already been shown to to catalyze the conversion of 5mC into 5-hydroxymethylcytosine. 5hmC can be further processed, either by G/T mismatch-specific thymine DNA glycosylase (TDG) or by deamination enzymes, such as APOBEC1 and AICDA/AID, and eventually removed and replaced by unmodified cytosine by base excision repair mechanism.

Interestingly, TET1 has a role in transcriptional repression, independently of its enzymatic activity. It binds a significant proportion of Polycomb group target genes and associates and colocalizes with the SIN3A co-repressor complex.

These new exciting data pave the way for understanding transcriptional fine-tuning during embryonic development, as well as in adult organisms and will keep us busy updating UniProtKB for quite a while.

UniProtKB news

Changes to keywords

Deleted keyword:

UniProt release 2011_09

Published September 21, 2011

Headline

Reference proteomes in UniProt

With the significant increase in the number of complete genomes sequenced, it is critically important to organise this data in a way that allows users to effectively navigate the growing number of available complete proteome sequences. The approach adopted by UniProt to meet this challenge is to define a set of “reference proteomes” which are “landmarks” in proteome space.

Reference proteomes have been selected to provide broad coverage of the tree of life, and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB. They include the proteomes of well-studied model organisms and other proteomes of interest for biomedical and biotechnological research. Species of particular importance may be represented by numerous reference proteomes for specific ecotypes or strains of interest.

Currently, UniProt has defined 455 reference proteomes in collaboration with Ensembl and NCBI Reference Sequence collection. The keyword ‘Reference proteome’ has been created to allow their easy retrieval, and the keyword ‘Virus reference strain’ has been deprecated to reflect this.

The reference proteome will be continuously reviewed as new proteomes of interest become available and as existing taxonomic classifications are revised. We would very much welcome feedback on our current list of reference proteomes and suggestions for new candidates via helpstuff@stuffunipstuffrot.org.

Link to complete and reference proteomes.

UniProtKB news

Changes to keywords

Replacement of the keyword ‘Virus reference strain’ by ‘Reference proteome’

We have introduced the more widely applicable keyword ‘Reference proteome’ to replace the keyword ‘Virus reference strain’. All ‘Virus reference strains’ are now defined as ‘Reference proteomes’. See preceding text for further information on ‘Reference proteomes’.

New keyword:

UniProt release 2011_08

Published July 27, 2011

Headline

UniProt collaboration with IMEx for the annotation of protein interactions to MIMIx standard

UniProt is committed to the development and application of workflows and standards in the curation of biological data, its dissemination and exchange, and works with other consortia and data providers to achieve this. An example of this ongoing effort is the collaboration between UniProt and the International Molecular Exchange (IMEx) consortium.

The IMEx consortium is an international collaboration between a group of major public interaction data providers who share curation effort, and work to common curation rules using common standards. Our collaboration with IMEx will increase the flow of curated interaction data into IMEx, and will allow UniProt to leverage existing standards for the curation of protein interaction data and to contribute to the future development of such standards.

The standard we have chosen to adopt for the curation of protein interaction data in UniProt is the “minimum information required for reporting a molecular interaction experiment” standard, or MIMIx. MIMIx provides a useful compromise between free-text descriptions of protein interactions (which are difficult to parse) and the very detailed curation performed within IMEx (which aims to capture most experimental parameters). MIMIx-level annotation requires the accession numbers of the interacting proteins as well as a number of key experimental annotations made using terms from the Proteomics Standards Initiative (PSI) molecular interaction (MI) vocabulary. These annotations cover the type of interaction, the methods used to detect the interaction and identify the participants, the experimental roles of the participants, and the host organism in which the interaction was observed. MIMIx provides information that should be sufficient to allow a trained biologist to evaluate the biological relevance of an experimentally observed interaction.

UniProt curators have begun to curate protein interaction data to MIMIx standards as part of their normal workflow. Interactions are curated directly within the IntAct database, which forms the contact point between UniProt and the wider IMEx consortium. These curated interactions form a small part of the larger IntAct dataset which can be accessed from the IntAct website. A subset of presumably reliable interactions is extracted from the IntAct dataset and made available within the ‘Binary interactions’ section of UniProtKB entries (see for example entry Q13426). From UniProt release 2011_09, export from IntAct to UniProt will be determined using a simple scoring system developed by IntAct, coupled to a score threshold that has been deliberately chosen to exclude interactions supported by only one experimental observation. Further details of how interactions are scored can be found at the IntAct website. This simple score-based filter will be used in combination with a set of defined rules that excludes certain types of data, such as interactions that have been inferred but not experimentally proven.

We anticipate that these developments will enhance the availability and usability of high quality protein interaction data within UniProtKB, and promote the use of the MIMIx in reporting such data. We welcome feedback on this development and other curation standards.

UniProtKB News

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • (2-aminosuccinimidyl)acetic acid (Asn-Gly)
  • N,N-(cysteine-1,S-diyl)phenylalanine (Cys-Phe)
New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • S-bacillithiol cysteine disulfide
Modified terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 4-amino-3-isothiazolidinone serine (Cys-Ser) > N,N(cysteine-1,S-diyl)serine (Cys-Ser)

UniProt release 2011_07

Published June 28, 2011

Headline

Killing myself softly – Bacterial and Archaeal Type II Toxin-Antitoxin modules (TA)

Bacteria produce many kinds of toxins that attack other organisms: cholera toxin, botulinum toxin, aerolysins, insecticidal toxins, extracellular proteases, to name just a few. In the past 5 years it has become apparent that most free-living bacteria produce another kind of internal toxin. These toxins are almost always encoded as bicistronic antitoxin-toxin operons (TA module), where the antitoxin is unstable and neutralizes the toxin. In type I and III systems the antitoxin is a small RNA, while in type II systems the antitoxin is a protein. If the antitoxin levels decrease then the toxin levels increase and toxic effects can be seen at the cellular level. In most type II cases studied, the antitoxin acts as an autoregulator, repressing transcription; frequently but not always, the toxin acts as a corepressor. First identified in bacteriophage and on plasmids, where their role is clearly in plasmid or phage maintenance, the role of the chromosomally encoded toxins in bacteria is hotly debated. Proposed functions include maintenance of mobile genetic elements, programmed cell death, induction of persistence (dormancy), stress response, virulence promotion in a host, and regulation of biofilm formation. The toxin’s role may in fact depend on the physiology of the organism in question. Although they are widespread in Archaea, no function for these toxins has been shown in vivo.

There are many toxin families. The best characterized so far are the bacterial ribonuclease toxins which belong to the MazF, RelE, MqsR, HigB, YoeB and VapC families. Most of these toxins degrade mRNA; some are sequence-specific, some work only in association with ribosomes, while for others the mode is unknown. VapC has been shown to degrade the anticodon loop of tRNAfMet. Other cellular functions are also toxin targets; DNA gyrase is targeted by the ParE toxin, HipA toxin probably acts by inappropriately phosphorylating cellular targets, RatA blocks ribosomal subunit association, PezT/zeta toxin corrupts peptidoglycan synthesis and CbtA (formerly YeeV) toxin binds FtsZ and MreB, inhibiting them, possibly simultaneously. While the toxins form distinct families, their cognate antitoxins do not, although almost all of them have a DNA-binding domain, in accordance with their probable role in operon regulation. To further complicate matters, in Mycobacterium tuberculosis H37Rv cross-talk between toxins and some non-cognate antitoxins has been seen, while in Caulobacter crescentus such cross-talk does not occur. Additionally, potential new toxins are detected quite frequently.

We recently performed a major update of many of the type II TA families in UniProtKB/Swiss-Prot, with particular attention given to the model organisms Mycobacterium tuberculosis strain H37Rv and E.coli K12 / MG1655. 65 TA modules have been annotated in M. tuberculosis and 15 in E.coli; gene names for M.tuberculosis were assigned in collaboration with the TubercuList database. Interestingly, TA modules are more abundant in pathogens than in related non-pathogenic strains. Hence Mycobacterium smegmatis, a non-pathogenic mycobacterium, is only predicted to encode 3 TA modules. Since January 2011 the mode of action of at least 4 toxin families has been elucidated (CbtA, PezT/zeta, RatA and VapC).

Although the PezT/zeta toxin and associated antitoxin module have not been predicted to exist in M.tuberculosis, there are indeed loci belonging to this TA module encoded in the genome. This is currently such a hot topic that integrating the data will keep us busy for quite a while yet.

All manually annotated type II TA module entries can be retrieved from UniProtKB/Swiss-Prot using the query toxin-antitoxin (TA) module.

UniProtKB News

Provision of complete proteome data sets for IPI species by UniProt

Complete proteome data sets are now available for download from the FTP and web sites for the species in the International Protein Index (IPI) which is scheduled for closure this year. IPI is an integrated database which clusters protein sequences from different databases to provide non-redundant complete data sets for selected higher eukaryotic organisms. Since it was launched in 2001, IPI has covered the gaps in the gene predictions between different databases, but since then the situation has improved for many of the most-studied genomes. This is due to a close collaboration between Ensembl, RefSeq and UniProt which aims to provide a standard set of gene predictions for the genomes of interest. These new complete proteomes will therefore provide high coverage complete proteomes for IPI users. The complete UniProtKB proteomes will be based on existing UniProtKB sequences supplemented by missing high quality predictions imported from Ensembl.

For Homo sapiens, a first pass annotation of the complete proteome was completed by UniProt in 2008 and all entries were incorporated into UniProtKB/Swiss-Prot. Within this UniProtKB/Swiss-Prot complete H. sapiens proteome, approximately 20,000 putative protein-coding genes are represented by one canonical protein sequence, with some entries describing multiple isoform sequences. Since its initial release, the UniProtKB/Swiss-Prot complete H. sapiens proteome has been extensively curated and the Ensembl cross-references -mapped based on sequence identity -are in the process of being manually verified. All predicted protein sequences from Ensembl (except fragments) that were found to be absent from the UniProtKB/Swiss-Prot complete H. sapiens proteome were imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL (see release 2011_05 headline). These imported UniProtKB/TrEMBL entries were tagged with the keyword ‘Complete proteome’. The aim of this import was to increase the coverage of the existing complete proteome, by supplementing it with those Ensembl protein sequences that had no UniProtKB counterpart. The resulting UniProtKB complete H. sapiens proteome includes both reviewed sequences from UniProtKB/Swiss-Prot (equivalent to an updated version of the complete H. sapiens proteome completed in 2008), now supplemented by unreviewed sequences from UniProtKB/TrEMBL. This process will enable the synchronization of the UniProt set with the CCDS project. This version of the complete H. sapiens proteome provides higher sequence coverage than the preceding version, but now includes sequences that have not been manually reviewed. Users can choose to opt either for this expanded complete H. sapiens proteome or a reduced version that derives exclusively from UniProtKB/Swiss-Prot.

For the other IPI species (mouse, rat, chicken, zebrafish, cow and dog), we added the keyword ‘Complete proteome’ to the existing UniProtKB/Swiss-Prot entries. We identified those entries in UniProtKB/TrEMBL which mapped to the complete genome in Ensembl and imported the predicted protein sequences (except fragments) from Ensembl which were found not to be present in UniProtKB. The keyword ‘Complete proteome’ was also added to these entries. As for the human counterpart, these proteomes can now be easily retrieved using this keyword. The Ensembl cross-references have been added to the UniProtKB entries on the basis of 100% sequence identity over their full length.

We will expand the coverage to other species of interest in the near future and expect this will be very useful for our users as it will eliminate the need to combine data from different databases.

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • 2-(3-methylbutanoyl)-5-hydroxy-oxazole-4-carbothionic acid (Leu-Cys)
  • Proline 5-hydroxy-oxazole-4-carbothionic acid (Pro-Cys)
New term for the feature key ‘Lipidation’ (‘LIPID’ in the flat file):
  • S-(15-deoxy-Delta12,14-prostaglandin J2-9-yl)cysteine

Changes to keywords

Modified keyword:

UniProt release 2011_06

Published May 31, 2011

Headline

New biocuration pages on UniProt website

One of the central activities of the UniProt Consortium is the biocuration of the UniProt Knowledgebase (UniProtKB). This involves the integration and interpretation of information from a variety of sources as well as accurate and comprehensive representation of the data. The biocuration process adds a wealth of information to UniProtKB records including information related to the role of a protein such as its function, structure, subcellular location, interactions with other proteins, and domain composition, as well as a wide range of sequence features such as active sites and post-translational modifications.

Both manual and automatic approaches are used to add information to UniProtKB records. Manual curation provides high-quality data for experimentally characterised proteins and consists of a critical review of experimental and predicted data for each protein as well as manual verification of each protein sequence. This information is included in the manually reviewed Swiss-Prot section of UniProtKB. In response to the ever-increasing amounts of sequence data, automated methods have been developed by the UniProt Consortium to annotate uncharacterised proteins with a high degree of accuracy and these methods are used to enhance the unreviewed records in UniProtKB/TrEMBL by enriching them with automatic classification and annotation.

In order to keep UniProt users informed of curation practices and priorities within the project, the UniProt website has been updated to include a new section describing UniProt biocuration. This section provides an overview of the manual curation process as well as details of current manual curation priorities. In addition, information is provided about the automatic annotation systems developed and used within the group. Additional useful information such as statistics, links to related resources and relevant publications are also provided.

The pages will continue to be updated on a regular basis to provide users with the latest information about the UniProt curation process and activities.

UniProtKB News

Changes concerning the controlled vocabulary for PTMs

New term for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • S-(2-aminovinyl)-D-cysteine (Cys-Cys)

Changes to keywords

Modified keyword:

UniProt release 2011_05

Published May 3, 2011

Headline

Complete proteomes for Homo sapiens and Mus musculus

With the imminent closure of the International Protein Index (IPI), UniProt has pledged to provide comprehensive and non-redundant complete proteomes for all species that are currently covered by this soon to be defunct resource. With this release of UniProtKB, we provide the first version of the complete proteome for Mus musculus and an updated version of the Homo sapiens set.

We describe here how each of these complete proteomes is produced, and outline their major characteristics.

For Homo sapiens, a first pass annotation of the complete proteome was completed by UniProt in 2008 and all entries were incorporated into UniProtKB/Swiss-Prot. Within this UniProtKB/Swiss-Prot complete H. sapiens proteome, approximately 20,000 putative protein-coding genes are represented by one canonical protein sequence, with some entries describing multiple isoform sequences. Since its initial release, the UniProtKB/Swiss-Prot complete H. sapiens proteome has been extensively curated and the Ensembl cross-references – mapped based on sequence identity – are in the process of being manually verified. All predicted protein sequences from Ensembl (except fragments) that were found to be absent from the UniProtKB/Swiss-Prot complete H. sapiens proteome were imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL. These imported UniProtKB/TrEMBL entries were tagged with the keyword ‘Complete proteome’. The aim of this import was to increase the coverage of the existing complete proteome, by supplementing it with those Ensembl protein sequences that had no UniProtKB counterpart. The resulting UniProtKB complete H. sapiens proteome includes both reviewed sequences from UniProtKB/Swiss-Prot (equivalent to an updated version of the complete H. sapiens proteome completed in 2008), now supplemented by unreviewed sequences from UniProtKB/TrEMBL. This process will enable the synchronization of the UniProt set with the CCDS project. This version of the complete H. sapiens proteome provides higher sequence coverage than the preceding version, but now includes sequences that have not been manually reviewed. Users can choose to opt either for this expanded complete H. sapiens proteome or a reduced version that derives exclusively from UniProtKB/Swiss-Prot.

For Mus musculus, we added the keyword ‘Complete proteome’ to the existing UniProtKB/Swiss-Prot mouse entries. We identified those entries in UniProtKB/TrEMBL which mapped to the complete genome in Ensembl and imported the predicted protein sequences (except fragments) from Ensembl which were found not to be present in UniProtKB. The keyword ‘Complete proteome’ was also added to these entries. As for the human counterpart, the UniProtKB complete Mus musculus proteome can now be easily retrieved using this keyword. The Ensembl cross-references have been added to the UniProtKB entries on the basis of 100% sequence identity over their full length.

There has been a deliberate introduction of redundancy into the proteomes based on the complete genomes to ensure that all alternative protein variants and isoforms are presented in the set. Over time, these will be merged with the parent entry as is UniProtKB/Swiss-Prot curation policy. We are also evaluating the fragment sequences predicted in the Ensembl complete proteomes for future incorporation.

We very much welcome the feedback of the community on our efforts. We expect to make the remaining IPI species (Gallus gallus, Bos taurus, Danio rerio, Arabidopsis thaliana and Rattus norvegicus) and some additional species of interest (Sus scrofa, Canis familiaris) available soon.

UniProtKB News

Changes to the taxonomy of UniProtKB entries from the model fungal organisms Saccharomyces cerevisiae (YEAST) and Schizosaccharomyces pombe (SCHPO)

Historically, UniProt assigned species identification codes (i.e. the 5-letter mnemonic that forms the second part of the composite UniProtKB entry name) at species-level. All sequences of a given protein from a single species were merged into a single UniProtKB/Swiss-Prot entry, which could therefore contain sequences from many different strains of that species. Discrepancies between individual sequences were annotated as “conflicts” or “variants” in the feature table of the entry.

The number of complete genome submissions from different strains of individual species is now increasing at an ever accelerating rate and this data has elucidated that distinct strains of many species can exhibit considerable differences in both gene content and within individual shared genes. The sheer amount of complete proteome data and the associated variability between proteomes means that manual merging of individual strains is no longer sustainable, and this approach has been discontinued. UniProt now assigns mnemonic codes at strain-level for complete genome sequences. We are in the process of reassigning many proteomes corresponding to known strains from a species-level taxonomic identifier (defined by the NCBI_TaxID) to the appropriate strain-level taxonomic identifier. In parallel we are also resolving some of the most significant historical cases of strain-level merging.

Here we describe the changes which accompanied the reassignment of the proteomes of the model fungal organisms Saccharomyces cerevisiae (YEAST) and Schizosaccharomyces pombe (SCHPO) to a strain-level taxonomic identifier.

We have changed the taxonomy of all UniProtKB entries of the S. cerevisiae complete proteome from strain S288c (the reference genome sequence stored at SGD) from NCBI_TaxID=4932 to NCBI_TaxID=559292:

OS   Saccharomyces cerevisiae (Baker's yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina;
OC   Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.
OX   NCBI_TaxID=4932;
to
OS   Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina;
OC   Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.
OX   NCBI_TaxID=559292;

To facilitate entry recognition and tracking of the entries, those of strain S288c (NCBI_TaxID=559292) have kept the mnemonic YEAST.

All S. cerevisiae entries not originating from the genome strain S288c, or one of the other completely sequenced strains (strain RM11-1a (YEAS1), strain JAY291 (YEAS2), strain AWRI1631 (YEAS6), strain YJM789 (YEAS7) and Lalvin EC1118 (YEAS8)), remain at a species level taxonomic identifier (NCBI_TaxID=4932), for which the new mnemonic YEASX was created (e.g. MAL62_YEASX for P07265).

A similar procedure was applied to the S. pombe proteome. We have changed the taxonomy of all UniProtKB entries of the S. pombe complete proteome from strain 972 (the reference genome sequence stored at S. pombe GeneDB) from NCBI_TaxID=4896 to NCBI_TaxID=284812:

OS   Schizosaccharomyces pombe (Fission yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina;
OC   Schizosaccharomycetes; Schizosaccharomycetales;
OC   Schizosaccharomycetaceae; Schizosaccharomyces.
OX   NCBI_TaxID=4896;
to
OS   Schizosaccharomyces pombe (strain ATCC 38366 / 972) (Fission yeast).
OC   Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina;
OC   Schizosaccharomycetes; Schizosaccharomycetales;
OC   Schizosaccharomycetaceae; Schizosaccharomyces.
OX   NCBI_TaxID=284812;

All S. pombe entries not originating from the genome strain 972 retain a species-level taxonomic identifier (NCBI_TaxID=4896), for which the new mnemonic SCHPM was created.

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • 1-(tryptophan-3-yl)-tryptophan (Trp-Trp) (interchain)
  • S-(2-aminovinyl)-L-cysteine (Cys-Cys)
New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • (E)-2,3-didehydrobutyrine
  • (Z)-2,3-didehydrobutyrine
  • 5’-chlorotryptophan
  • L-allo-isoleucine

UniProt release 2011_04

Published April 5, 2011

Headline

The art of defining the unknown

When the human gene C22orf28 was predicted in 1998, its existence was supported by many cDNAs identified by large scale cDNA sequencing projects. Although nothing was known about the protein it encoded, it was conserved in many species in all 3 kingdoms. A consensus pattern was created in the PROSITE database which was distinctive for all related sequences, from bacteria to mammals. This allowed us to classify all these proteins into a single family, named ‘the Uncharacterized Protein Family (UPF) 0027 (UPF0027)’ to unambiguously indicate the lack of functional data.

For several years, the only annotation in the ‘General annotation (Comments)’ section of the C22orf28 entry was: “Belongs to the UPF0027 (rtcB) family.” Things changed at the beginning of this year with the publication of 3 articles that unraveled C22orf28 function in human, archaea and bacteria. In archaea and human, the protein was shown to be involved in the ligation step during tRNA splicing. In bacteria, the ligation activity may be used in the context of tRNA repair. As of this release, all entries belonging to this family have been updated with these new data and UPF0027 has been deleted from UniProtKB and replaced by the RtcB family.

In the course of manual annotation, we have encountered many examples of uncharacterized conserved proteins. This has led to the definition of a total of 765 UPFs which are listed in the upflist.txt file, available from the UniProt documentation pages, along with all associated entries and information concerning the taxonomic range concerned. Within this list, characterized, hence deleted UPFs represent about 28%. They are tagged with a comment indicating the reason for the deletion, for example, for UPF0027, the comment states that it is “now characterized as a family of RNA-splicing ligases”. It should be mentioned that, in parallel with our efforts to create UPFs, the Pfam database has established an analogous classification system based on ‘Domains of Unknown Function’ (DUFs). It currently reports some 3’000 DUFs.

For bench scientists, UPFs provide a pool of exciting targets for future research since protein sequence conservation suggests important, yet unknown, functions. For database maintenance, the classification of uncharacterized proteins into families presents the major advantage of simplifying the update when functional information becomes available for at least one member.

UniProtKB news

Changes to keywords

Modified keyword:

Changes in subcellular location controlled vocabulary

New subcellular locations:

Modified subcellular locations:

UniProt release 2011_03

Published March 8, 2011

Headline

Dealing with erroneous information: a tricky task

There are many reasons for mistakes in databases, including annotation errors. However, sometimes the annotation is correct, but the original source of information contains erroneous data. Histone arginine demethylase JMJD6 is a good example of the problems raised when this happens.

Histone arginine demethylase JMJD6 was discovered in 2000. At the origin of this discovery lies a monoclonal antibody (mAb217) raised against stimulated macrophages. Phosphatidylserine-displaying liposomes inhibited the binding of mAb217 to macrophages and the antibody prevented the uptake of apoptotic cells. These characteristics suggested that mAb217 interacted with a receptor for phosphatidylserine on the membrane. Using this antibody, a 48-kDa protein was isolated and called “Phosphatidylserine receptor” (PTDSR). The effect of the deletion of the corresponding gene was investigated in knockout mice, but the reactivity of mAb217 was not compared between cells from knockout and wild-type animals. When this experiment was finally carried out, the result was quite surprising: similar staining patterns were observed with cells of both genotypes. It appeared that mAb217 could bind weakly to a PTDSR peptide, but the antibody mainly recognizes another membrane-associated protein. Parallel studies revealed that PTDRS was actually a nuclear protein, an unlikely location for a membrane receptor. After several years and many publications in high-profile journals, it was eventually demonstrated that the 48-kDa protein is not a phosphatidylserine receptor, but a dioxygenase that acts in the nucleus as a histone arginine demethylase and a lysyl-hydroxylase. Since then, other genes have emerged as candidates for the role of phosphatidylserine receptor, including STAB2 , BAI1 and TIMD4.

Once the true function of the 48-kDa protein had been established, curators were faced with the challenge of updating the existing annotation to reflect this. First of all, the original recommended protein name “Phosphatidylserine receptor” had to be modified into “Bifunctional arginine demethylase and lysyl-hydroxylase JMJD6”. “Phosphatidylserine receptor” became an alternative name, as it is UniProtKB policy to keep all protein names, even obsolete ones, to facilitate the identification of the protein of interest. The problem in this case is that the obsolete name is misleading. In order to clarify this, the update of the ‘General annotation’ section included the addition of a ‘Caution’ comment, as well as the review of other subsections, such as ‘Function’ or ‘Subcellular location’. In the ‘Caution’ comment, the attention of the users is drawn to the ambiguity of the ‘Alternative name’, as well as to the erroneous conclusions reported in published references still cited in the entries. These references describe the original sequence of the protein and as such cannot be simply deleted from the entry, since this contribution has to be acknowledged. In addition, it confirms for the users that this information has been reviewed in the context of the protein and has not been overlooked.

We thus advise our users to carefully read the ‘Caution’ subsections found in entries which have had an interesting evolution (examples). Our users are also encouraged to send us feedback using the option “Contribute” at the top of each entry if they find mistakes or inconsistencies in our entries.

It is possible to track all changes occurring in an entry across releases by clicking on ‘History’, an option available at the top of each entry. For example, the major update of the human PTDSR entry can be visualized by comparing version 27 (which contained the original information) with the current one.

UniProtKB News

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • Cyclopeptide (His-Asn)
  • Cyclopeptide (His-Asp)
New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N6-succinyllysine
  • Lysino-D-alanine (Lys)

UniProt release 2011_02

Published February 8, 2011

Headline

Automatic annotation of UniProtKB/TrEMBL using PDB-derived data

Producing manual annotations for UniProtKB/Swiss-Prot entries containing 3D-structural data has been a priority from the very start of the Swiss-Prot database in 1986, at which point the PDB archive contained a total of 213 structures. Almost 25 years later, the PDB archive had a record 7,971 structures deposited in 2010 (giving a total of 70,229 structures) of which 7,848 contained a polypeptide chain. Although not all polypeptides are mapped to UniProtKB (immunoglobulins and synthetic sequences are not mapped, for example, as these sequences are not within the scope of UniProtKB), the vast majority are. In addition, most vertebrate proteins whose 3D structure has been deposited to the PDB archive have been manually annotated in the Swiss-Prot section of UniProtKB.

Of these mapped polypeptides, about 85,000 PDB cross-references map to 25,000 UniProtKB entries. These are divided between approximately 16,000 UniProtKB/Swiss-Prot and 8,000 UniProtKB/TrEMBL entries with at least one PDB cross-reference. Only three years earlier, the UniProtKB/TrEMBL section contained close to 3,000 entries compared to UniProtKB/Swiss-Prot’s about 12,000 entries with a PDB cross-reference. The vast majority of these 8,000 TrEMBL entries are from bacteria and other microbes.

In UniProtKB/Swiss-Prot, 3D-structure data are manually annotated and integrated mainly in the ‘Sequence annotation (Features)’ section (see Q9C0B1 as an example). This enables users to find the salient structural information directly in the UniProtKB/Swiss-Prot entry. The situation is different for entries in UniProtKB/TrEMBL which have not yet benefited from the manual addition of these data. Furthermore, with the number of UniProtKB/TrEMBL entries with a PDB cross-reference more than doubling in three years, an ever-growing amount of new and potentially interesting PDB-derived data would remain difficult to access for many UniProtKB users. The new UniProt-PDB import pipeline addresses this issue and the UniProtKB/TrEMBL now contains annotations derived from the PDBe and PDBe Motif databases. These annotations include 15,000 new sequence annotations (‘Features’) where ligand interactions are shown for interactions with small molecules and metal ions (e.g. Q8U2I8, Q8U2V3, Q939U1), and 10,000 new citations in almost 8,000 entries.

The procedure produces feature annotations for about 200 types of small molecules in the PDB archive, which have been hand-picked to offer the most unambiguous biological activity. Typically, these include the most common metals, enzyme cofactors (as detailed in the CoFactor database), post-translationally modified residues, carbohydrates, flavins and nucleotide phosphates. None of this would have been possible without the UniProtKB-PDB mappings data produced and maintained in a collaborative effort between UniProt and the PDBe in the form of the SIFTS project which provides the crucial link between UniProtKB and PDB sequence coordinates down to the residue level.

By including these data, we hope to improve the accessibility to experimental, position-specific data about ligand binding sites for the scientific community.

All UniProtKB entries containing 3D-structure data can be retrieved using the keyword 3D-structure. They include UniProtKB/Swiss-Prot and UniProtKB/TrEMBL entries.

UniProtKB news

Cross-references to neXtProt

Cross-references have been added to neXtProt, the human protein knowledge platform.

neXtProt is available at http://www.nextprot.org/.

The format of the explicit links in the flat file is:

Resource abbreviation neXtProt
Resource identifier neXtProt unique identifier.
Example P31946:
DR   neXtProt; NX_P31946; -.

Show all the entries having a cross-reference to neXtProt.

Cross-references to GeneTree

Cross-references have been added to the phylogenetic gene trees that are available at www.ensembl.org and www.ensemblgenomes.org.

The format of the explicit links in the flat file is:

Resource abbreviation GeneTree
Resource identifier GeneTree unique identifier.
Example P32234:
DR   GeneTree; EMGT00050000006238; -.

Show all the entries having a cross-reference to GeneTree.

UniProt release 2011_01

Published January 11, 2011

Headline

An old-timer, but still trendy: 10’000 entries for Arabidopsis thaliana in UniProtKB/Swiss-Prot

Arabidopsis thaliana belongs to the Brassicaceae family that includes the well-known dietary staples cauliflower, broccoli, cabbage, turnip, radish, canola and mustard. The split between the Arabidopsis group and the other crops of the genus Brassica has been estimated at around 43 million years ago. A. thaliana has been widely studied since the 1980s, when the development of T-DNA mediated transformation made the generation of mutants and their study relatively easy. Since then, A. thaliana has been become the model of choice for the study of many biological processes of flowering plants, as well as those specific to the Brassica species.

Its utility as a model organism was further boosted by the completion of the whole genome sequence, which revealed a relatively small genome with a low level of duplication compared to other flowering plants, and by the inception of a number of complementary efforts to sequence the transcriptome.

In 2001, the Swiss-Prot group created the Plant Proteome Annotation Program (PPAP) whose main focus is the annotation of proteins and protein families of A. thaliana and rice. One decade on, we have now annotated over 10’000 A. thaliana entries in UniProtKB/Swiss-Prot. This corresponds to around 36% of the A. thaliana proteome according to version 10 Arabidopsis thaliana genome annotation from The Arabidopsis Information Resource (TAIR), which estimates at 27’416 the number of protein-coding genes in this organism. According to the Multinational Arabidopsis Steering Committee (MASC) report 2010, at least one third of these genes still have no known function, so the work of experimentally characterizing and of annotating each gene is far from finished.

All manually annotated A.thaliana entries can be retrieved from UniProtKB/Swiss-Prot using the organism name “Arabidopsis thaliana” (or the taxonomy identifier 3702), with the restriction: “reviewed:yes”.

UniProtKB news

Cross-references to Allergome

Cross-references have been added to Allergome, a platform for allergen knowledge.

Allergome is available at http://www.allergome.org/.

The format of the explicit link in the flat file is:

Resource abbreviation Allergome
Resource identifier Allergome unique identifier
Optional information 1 Allergen name
Example O76821
DR   Allergome; 2; Aca s 13.
DR   Allergome; 3051; Aca s 13.0101.

Changes to keywords

New keyword: Modified keyword: Deleted keywords:
  • Core protein
  • Fiber protein
  • Fusion protein
  • Hexon protein
  • Hexon-associated protein
  • Phage recognition

Changes in the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N,N-dimethylleucine

UniProt release 2010_12

Published November 30, 2010

Headline

Fishing for new mutations in the human exome

Understanding the role of genetic variants in human health and disease is crucial in modern biology and medicine. The International HapMap Project and, more recently, the 1000 Genomes Project are progressively unveiling the map of human genome variation at the scale of the human population, generating a flood of interesting data. Smaller research projects focused on disease-causing mutations also contribute through the development of new fruitful approaches. One of the current trends in large and small scale projects is exome sequencing. The rationale is that the clear majority of allelic variants known to underlie mendelian disorders disrupt protein-coding sequences. Restricting sequencing to exons decreases the sample size to 2-5% of that of the whole genome, thus saving time and money, while allowing the identification of missense and nonsense mutations, of small insertions and deletions (indels), as well as of splice donor and acceptor site variants. By definition, exome sequencing does not permit the discovery of mutations in non-coding, regulatory or intronic genomic regions which are known to affect disease.

The exome sequencing strategy is proving to be quite effective, as it has recently been used to pinpoint several genes whose mutations are associated with diseases, including DHODH involved in postaxial acrofacial dysostosis (Ng et al., 2010), WDR62 in severe cerebral cortical malformations (Bilguvar et al., 2010) and MLL2 in Kabuki syndrome (Ng et al., 2010).

The annotation of single amino acid polymorphisms (SAPs) has always been a priority in UniProtKB/Swiss-Prot, including not only ‘neutral’ polymorphisms, resulting from normal variations among individuals, but also disease-associated mutations. Thus missense SAPs identified by the exome-sequencing strategy have been quickly annotated and integrated in the ‘Sequence annotation (Features)’ section of their respective entries (Q02127, O43379 and O14686). The associated phenotypes are described in the ‘General annotation (Comments)’ section in ‘Involvement in disease’ (Q02127, O43379 and O14686).

Over the years, we have developed a defined format to describe SAPs in the ‘Sequence annotation (Features)’ section, including dbSNP accession numbers, when they exist, and links to bibliographic references. Disease-causing mutations are tagged, whenever possible, with the official abbreviation of the phenotype provided by the OMIM database. In addition to missense mutations, in-frame indels are also reported (P35453, P02730 or P33897). When it is not possible to represent the whole variation landscape for a given protein within the UniProtKB entry, we try and provide cross-references to specialized resources (see for instance the ‘Web resources’ section in human p53 entry). Our annotation effort does not include the representation of mutations that cause major changes to a protein sequence, such as frameshift mutations or variations at splice sites, as their deleterious effects on protein function are usually obvious.

Close to 63’000 human SAPs are currently stored in UniProtKB/Swiss-Prot and about 30% of them are reported as disease-associated in the literature. SAPs selected from this pool are mapped to reference nucleotide sequences from RefSeq and LRG, following the guidelines established by the Human Genome Variation Society for sequence variant designation, and submitted to dbSNP (see for instance dbSNP/Swiss-Prot variant rs121908210). Thanks to a tight collaboration with Ensembl, all human variants stored in UniProtKB and characterized by a dbSNP accession number (or submitted to dbSNP) can also be accessed from the Ensembl database and viewed in the context of their nucleotide sequence (see variant rs1269215 stored in UniProtKB entry Q9BVK8). Our ultimate goal is to spread information about protein variations to the broadest possible audience.

UniProtKB news

Line length limit

Historically, UniProtKB flat file entries were formatted to not exceed 75 characters per line. This limitation served on one hand to display them nicely on small screens and to allow them to be processed by programs that had memory limitations. Meanwhile, computers have become more powerful and most programs have been adapted accordingly. UniProt has already made a few exceptions to the line length limit for data that cannot be wrapped, such as URLs or DOIs, or where wrapping does not increase readability, such as for protein names and a few cross-references to other databases. Especially for the latter, we have increasingly more additional information to incorporate. We will continue to wrap lines at 75 characters where it helps to increase readability, but allow for more characters where necessary. The new upper limit is 255 characters per line, as some users still depend on software with this limitation.

Changes to cross-references to RefSeq

We have introduced an additional field to the cross-reference (DR line in the flat file) to the NCBI Reference Sequences database to show the RefSeq nucleotide accession number.

The format of the explicit links in the flat file is:

DR   RefSeq; RefSeq protein accession number; RefSeq nucleotide accession number.

Example: P00816

Previous format in the flat file:

DR   RefSeq; AP_000992.1; -.
DR RefSeq; NP_414874.1; -.

New format:

DR   RefSeq; AP_000992.1; AC_000091.1.
DR RefSeq; NP_414874.1; NC_000913.2.

Changes to keywords

New keywords:

Changes in subcellular location controlled vocabulary

New subcellular locations:

Changes in the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 3’-nitrotyrosine

UniProt release 2010_11

Published November 2, 2010

Headline

Pupylation: a ubiquitin-like tagging system in bacteria

While ubiquitin has been known for decades as a post-translationally conjugated protein degradation tag in eukaryotes, the first identified prokaryotic protein that is functionally analogous to ubiquitin, i.e. prokaryotic ubiquitin-like protein Pup, has only recently been discovered in mycobacteria.

Pup (64 residues) and ubiquitin (76 residues) show neither structural nor sequence homology except for a GG motif near or at the C-terminus. Although both Pup and ubiquitin are attached to the epsilon-amino group of lysine side chains in substrates and target the substrates for degradation by the proteasome, the enzymology of ubiquitination and pupylation and the chemistry of the coupling reaction appear completely different. Ubiquitin is coupled to substrates via the carboxyl group of its C-terminal glycine in a multistep reaction involving several enzymes (see release 2010_10 headline). In the mycobacterial pupylation pathway, the C-terminal glutamine of Pup is first deamidated to glutamate by Dop (deamidase of Pup) after which it is ligated to the substrate lysine of target proteins by proteasome accessory factor A (PafA). Neither Dop nor PafA is similar to ubiquitin-activating enzymes. The covalently Pup-modified protein is then recognized and unfolded by the proteasomal ATPase Mpa and degraded by the proteasome. The very recent discovery of a depupylase activity provided by Dop, able to remove conjugated Pup from target proteins in a manner analogous to the deconjugation of ubiquitin from eukaryotic proteins, strengthens the parallels between the Pup- and ubiquitin-tagging systems of prokaryotes and eukaryotes, respectively. However Mycobacterium appears to have a single Pup ligase to mediate all pupylation and a single depupylase for all pupylated substrates, in contrast to the human genome that encodes hundreds of ubiquitin ligases and dozens of deubiquitinating enzymes.

Taken together, prokaryotes and eukaryotes appear to have developed distinct but parallel mechanisms to regulate protein stability by a similar proteolytic machinery: the proteasome found in all eukaryotes and archaea, and in bacteria of the class Actinobacteria, including the genus Mycobacterium.

All the known pupylation-related proteins in bacteria have now been annotated in UniProtKB/Swiss-Prot.

UniProtKB news

Changes concerning keywords

New keywords: Modified keyword:

Changes in subcellular location controlled vocabulary

New subcellular locations:
  • Filopodium tip
  • Pseudopodium tip
  • Bleb
  • Phagocytic cup

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • N,N,N-trimethylserine
  • N,N-dimethylserine
  • N-methylserine
  • CysO-cysteine adduct

UniProt release 2010_10

Published October 5, 2010

UniProtKB/Swiss-Prot ubiquitin pathway annotation

Post-translational modifications (PTMs) can have a profound effect on protein function. They act as switches to activate or inactivate polypeptides, change their subcellular location, modify protein-protein partnerships, etc. However, no PTM is as versatile as ubiquitination, i.e. the post-translational conjugation of ubiquitin. Ubiquitination can occur on a large range of proteins and not only controls their lifespan, but also expands their functional repertoire (see reviews). In view of its importance in many cellular events, we have decided to qualitatively and quantitatively improve our annotation of proteins involved in the ubiquitin and ubiquitin-like pathways in various species, ranging from plants to mammals. Bacteria and archaea which have been recently shown to have an ubiquitin-like system for protein degradation, called pupylation, were not neglected (see next release’s headline).

Ubiquitin (see for instance entry P0CG47 featuring one of the human ubiquitin gene products) is a small 76 amino-acid protein that is ubiquitously expressed (hence its name) in all eukaryotic cells and highly conserved among eukaryotic species: human and yeast ubiquitin share 96% sequence identity. Ubiquitination most frequently occurs via an isopeptide bond between a lysine of the target protein and the C-terminal glycine of ubiquitin. Substrates can be monoubiquitinated, via the attachment of a single ubiquitin, or multiubiquitinated, when more than one amino acid is modified with monoubiquitin. Ubiquitin can also be added sequentially to substrates to form ubiquitin chains resulting in polyubiquitination. In ubiquitin polymers, the lysine side chain of one ubiquitin molecule is linked to the C terminus of another ubiquitin molecule, and so on. Ubiquitin contains 7 lysine residues, all of which can contribute to such linkages with a different functional outcome for the target protein. For instance the most prominent function of ubiquitin is labeling proteins for proteasomal degradation. This signal is conveyed by polyubiquitin chains linked through the ubiquitin lysine-48 side chain (‘Lys-48’-linked chains). ‘Lys-63’-linked polyubiquitin chain functions in signal transduction and DNA repair without functioning as a degradation signal. Monoubiquitination has recently been shown to have a signaling function in the endocytic pathway.

Three types of enzyme – E1, E2 and E3 – carry out ubiquitination. E1s activate ubiquitin, E2s pick up the ubiquitin from E1 and, in close collaboration with E3, conjugate it to substrates. E3s have a crucial role in recognition of the substrate. They are either catalytically active and directly transfer the activated ubiquitin to the target, or serve as a scaffold linking catalytic E2 to the appropriate substrate. All eukaryotes encode a very limited number of E1 enzymes (a single gene in many species, 3 in humans), but multiple isozymes of E2 and E3, up to several dozen E2s and many hundreds of E3s. This allows the modification of many proteins in a highly specific and controlled manner.

Ubiquitin modification is only transient: enzymes, known as deubiquitinating enzymes (DUBs), can remove ubiquitin molecules that are attached to proteins. They also show specificity towards the type of ubiquitin linkage. For instance, the BRCC3 metalloprotease specifically cleaves ‘Lys-63’-linked polyubiquitin chains, while the cysteine protease USP15 shows preference for ‘Lys-48’ chains.

The ubiquitin pathway turned out to be even more complex with the discovery of several ubiquitin-like proteins, including SUMO, ISG15, NEDD8, UFM1. These proteins also regulate a vast array of cellular events, such as nuclear transport, transcriptional regulation, apoptosis, protein stability, signalling, protein-protein interactions, etc.

The UniProtKB annotation marathon led to the integration of 940 new eukaryotic entries and annotation of 942 new sites of ubiquitination. Close to 4’000 experimental GO terms have been manually added to UniProtKB entries. 469 proteins directly involved in the process of ubiquitination (and ubiquitin-like conjugation) have been annotated or updated and can now be retrieved with the keyword ‘Ubl conjugation pathway’, along with some other 3’400 manually reviewed entries. Proteins undergoing ubiquitination, including autoubiquitination classically observed in E3 proteins, are tagged with the keyword ‘Ubl conjugation’ and, when known, the effect of the PTM is indicated in the ‘Post-translational modification’ subsection of ‘General Annotation’ (see for instance entry Q9Y243).

UniProtKB News

Changes to cross-references to Ensembl

The cross-references to the Ensembl database have been modified. The optional field describing the species name has been removed, because it is no longer necessary to build a valid URL.

Example:

Previous format in the flat file:

DR   Ensembl; ENST00000220809; ENSP00000220809; ENSG00000104368; Homo sapiens.

New format:

DR   Ensembl; ENST00000220809; ENSP00000220809; ENSG00000104368.

UniProt release 2010_09

Published August 10, 2010

Headline

‘De-merge’ of multi-gene entries derived from a single species in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot has historically “merged” 100% identical protein sequences from different genes in the same species into one single record. The aim of this approach was to reduce sequence redundancy within the proteome of individual species, facilitating protein identification and the functional annotation of protein sequences. These merged entries provide extensive annotation of the protein sequence itself, as well as information on each of the individual source genes, including cross-references to external gene-centric resources that provide gene models and genomic information.
As the availability and usage of genomic information has greatly increased in recent years, UniProtKB is modifying its merging policy. We have already begun to “de-merge” entries containing multiple individual genes coding for 100% identical protein sequences into individual UniProtKB/Swiss-Prot entries containing a single gene. This will give a gene-centric view of protein space, where the same protein sequence can be represented multiple times by distinct UniProtKB/Swiss-Prot entries, each of which is based on the translation of a single distinct gene. It will allow a cleaner and more logical mapping of gene and genomic resources to UniProtKB, which provide the major point of entry to the resulting proteome for many users. It will also facilitate the annotation of protein features that are uniquely associated with specific copies of duplicated genes, such as alternative splice forms that are found in genes encoded by multiple exons but not in single exon copies derived from retro-transposed cDNAs. This type of information can be most effectively captured in a gene-centric view of protein space, providing a precise description of how genome evolution and structure impact the protein complement of a cell. One consequence of this change in annotation policy is that the level of protein sequence redundancy in UniProtKB will slightly increase, as multiple identical instances of a given protein sequence may now exist within the proteome of a particular species or strain. The process of de-merging has already begun with a number of proteins from Escherichia coli and Homo sapiens and other vertebrates, and will be an ongoing process in UniProtKB. For pragmatic reasons, there are several multi-gene families which will not be targeted for de-merge in the near future, as the difficulties associated with maintaining these individual annotated sequences are significant. These include the human histone genes and the calmodulins, which will continue to be grouped into one entry for the current time. However for simpler cases, especially those in which the genomic context of the gene affects the properties of the encoded protein, de-merging will be preferred.

The de-merge procedure

In simple cases, the demerge procedure simply involves the creation of one new UniProtKB entry for each gene in the current merged UniProtKB entry. A new primary accession number is attributed to each de-merged entry, and the primary accession number of the formerly merged entry is retained as a secondary accession number in each of the resulting de-merged entries. To illustrate how the demerge procedure affects the representation of protein sequences in UniProtKB, consider the example of the human ubiquitin protein. Ubiquitin in humans is encoded by four distinct genes, RPS27A, UBA52, UBB and UBC. RPS27A and UBA52 include a single ubiquitin moiety as an N-terminal fusion to a ribosomal protein, while UBB and UBC encode distinct poly-ubiquitin chains. In UniProtKB release 2010_08, the human ubiquitin protein sequence was represented by one single UniProtKB entry (UBIQ_HUMAN, P62988), that included the ubiquitin protein sequences derived from all four of the aforementioned genes. For UniProt release 2010_09, these four genes were de-merged into 4 distinct UniProtKB entries corresponding to each of the four ubiquitin genes. Following the de-merge, ubiquitin chains from RPS27A and UBA52 were then re-merged to the entries describing their cognate ribosomal proteins, and are now represented as peptides derived from the translated ubiquitin-ribosomal protein fusion. The final result of this process is four distinct UniProtKB entries that include ubiquitin protein sequences derived from four loci: RPS27A, UBA52, UBB, and UBC. Each of these entries retains the primary accession number of the old merged entry UBIQ_HUMAN (P62988) as a secondary accession number.

UniProtKB news

Cross-references to Protein Model Portal

Cross-references have been added to Protein Model Portal, developed as a module of the PSI-Nature Structural Biology Knowledgebase (http://sbkb.org/). The Protein Model Portal provides a single interface to query simultaneously the existing precomputed models
at various sites, gives access to interactive services for template selection, target-template alignment, model building, and quality assessment. Models are provided by the PSI centers (CSMP, JCSG, MCSG, NESG, NYSGXRC, JCMM), and by independent modeling groups. The task of the portal is to unify the model data from the different sites.

Protein Model Portal is available at http://www.proteinmodelportal.org/.

The format of the explicit links in the flat file is:

Resource abbreviation ProteinModelPortal
Resource identifier UniProtKB accession number
Examples P84155:
DR   ProteinModelPortal; P84155; -.
P27362:
DR   ProteinModelPortal; P27362; -.

Show all the entries having a cross-reference to Protein Model Portal.

Changes to keywords

New keywords:

Changes in subcellular location controlled vocabulary

New subcellular locations:

Changes to the controlled vocabulary for PTMs

New term for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • Glycyl cysteine thioester (Cys-Gly) (interchain with G-...)

Website news

New BLAST features

We have updated the BLAST results view of uniprot.org:

  • All information that is visible or configurable in the results page of a text search in one of the UniProt core datasets is now also available in the results page of a BLAST search. Click the Customize display link on the BLAST results page to see which additional columns you can add to the Detailed BLAST results table. For instance, select Comment, press Show, tick Function, press Show to add the column comment(FUNCTION) to see the available functional annotation of your BLAST hits.
  • BLAST results can now be filtered by dataset and taxonomy. The Filter section will show you in brackets the number of hits for each dataset or taxonomy branch. For instance, after running BLAST against the full UniProtKB dataset, you can filter your results to show only hits that are from Bacteria.
  • When running a BLAST search against UniProtKB, it is possible to project the sequence annotations of the matched UniProtKB entries onto the alignments generated by BLAST. To see an alignment, click on it in the Alignments column of the Detailed BLAST results table, then tick the Annotations that you would like to highlite in the alignment. This allows you to see at a glance if important positions are conserved.

Another new feature is the option to run BLAST searches against UniParc. Please use this with caution, as UniParc is an archive that also contains pseudogenes and incorrect CDS predictions.

Updated look and feel

The website received a small face lift to improve the navigation. The UniProt entry views, as well as the various tools’ results views, now have blue navigation bars at the top and bottom with links that allow you to quickly access different sections of big views. Where applicable, the top bar features a Customize display link that lets you customize the view.

UniProt release 2010_08

Published July 13, 2010

Headline

Viral reference strains: a virtual vaccine against virus pandemic in sequence databases

Viruses are not only the most abundant biological entities on the planet, they are also the most represented taxonomic group in UniProtKB. Without contest the title holder is the HIV-1 virus with about 350’000 entries. Taking into account that the HIV genomes encode about 9 proteins, these entries correspond to the equivalent of about 35’000 complete genomes!

While these numbers reflect the tremendous sequence diversity of viruses, they also make it difficult to find one’s way around, and users looking for general information on a viral species face a dilemma: which one to choose? Retrieving only manually reviewed proteins will still leave the user in doubt as the same viral proteins can be present by the dozen in UniProtKB/Swiss-Prot. For example, which Influenza A Hemagglutinin proteins should be selected preferentially among the 170 reviewed entries?

The UniProt solution to this problem is to define viral reference strains, each being representative of one virus genus, to curate them to the highest quality standards and to continuously maintain their annotation. The reference strains that have been selected are those whose genomes belong to the NCBI Reference Sequence collection (RefSeq). Therefore not only their proteomes, but also their genomes are carefully reviewed. The keyword ‘Virus reference strain’ has been created to allow their easy retrieval. At the current time we have defined 355 viral reference strains. These reference strains contain 12’576 proteins, of which 4’500 entries, most representing double strand DNA viruses, have been tagged with the ‘Virus reference strain’ keyword. We are actively updating the remaining 8’000 entries to provide a full set of tagged entries reflecting the diversity of the virus world.

Reference strains allow users to identify the strain with the best and most up-to-date information for any given virus. For bioinformaticians, they present another interesting feature as they can serve as templates for high quality automated annotation of other viruses of the same genus, following a pipeline analogous to the one used in UniProtKB for microbial proteins (see HAMAP program).

The viral reference strains are also accessible via the ViralZone fact sheet which provides links to the corresponding UniProtKB proteome and RefSeq genome (see for instance Influenza A).

UniProtKB News

Format change in the cross-references to WormBase

C.elegans and C.briggsae entries used to have cross-references to both WormPep and WormBase databases. WormPep is no longer active, and all worm sequences are contained in WormBase, a comprehensive database for biological information on worm sequences and annotation. We have therefore removed cross-references to WormPep and modified the WormBase cross-references to include transcript and protein identifiers from WormPep. Proteins with alternative products have one WormBase cross-reference per gene product.

Previous format in the flat file:

DR   WormPep; TranscriptIdentifier; ProteinIdentifier.
DR WormBase; GeneIdentifier; GeneName.

New format:

DR   WormBase; TranscriptIdentifier; ProteinIdentifier; GeneIdentifier; GeneName.

If there is no GeneName, a dash (’-’) is stored in that position.

Example: O45818

Previous format in the flat file:

DR   WormBase; WBGene00012019; dkf-2.
DR WormPep; T25E12.4a; CE18967.
DR WormPep; T25E12.4b; CE18283.
DR WormPep; T25E12.4c; CE42507.

New format:

DR   WormBase; T25E12.4a; CE18967; WBGene00012019; dkf-2.
DR WormBase; T25E12.4b; CE18283; WBGene00012019; dkf-2.
DR WormBase; T25E12.4c; CE42507; WBGene00012019; dkf-2.

Show all the entries having a cross-reference to WormBase.

Cross-references to WormPep have been removed.

Changes concerning keywords

New keywords:

Changes concerning the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • S-(coelenterazin-3a-yl)cysteine

Deleted terms:

  • Glutamyl lysine isopeptide (Gln-Lys) (interchain with K-...)
  • Glutamyl lysine isopeptide (Lys-Gln) (interchain with Q-...)

UniProt release 2010_07

Published June 15, 2010

Headlines

UniProt and the International Nucleotide Sequence Database Collaboration

UniProt has had a very beneficial and long-standing collaboration with the three members of the International Nucleotide Sequence Database Collaboration (INSDC) – the EMBL-Bank, GenBank and the DNA Data Bank of Japan (DDBJ). It began at the most basic level with an exchange of nucleotide and protein sequences, evolved through co-development of the nucleotide entry feature table definition to ensure efficient automatic integration of appropriate protein information into UniProt followed by reciprocal cross-references, and from there has recently progressed to a joint endorsement of protein naming guidelines section. This was one outcome of the third NCBI Genome Annotation Workshop in Washington, USA in April 2010 where researchers from life science organizations world-wide collaborated to establish minimal standards for prokaryotic and viral annotation. Extremely productive discussions concerning annotation and underlying problems led to a number of resolutions that were adopted by the international microbial sequencing community. The highlight was the development and acceptance by the community of prokaryotic protein naming guidelines (see file proknameprot.txt) based on an initial proposal from the INSDC and UniProt. Following this agreement, INSDC and UniProt also created a more generalised protein guideline (see file gennameprot.txt) to make this useful for taxa outside cellular prokaryotes. The decision by the INSDC to provide these guidelines for adoption by all submitters to their databases will greatly enhance the annotation of complete genomes and proteomes and ensure that the user community can exploit this data to its full potential. This is a particularly timely and exciting development given the data avalanche. Future plans for the INSDC and UniProt involve collaboration with the NCBI’s Genome project and the Reference Sequence (RefSeq) collection groups to provide synchronized well-annotated genomes and proteomes.

The new files gennameprot.txt and proknameprot.txt are available in UniProt Documents, Nomenclature and guidelines section, and can be accessed from the Documentation/Help pages.

UniProtKB News

New feature key INTRAMEM in the flat file

In addition to the feature keys TOPO_DOM (which describes the topology of regions for transmembrane proteins that span membrane compartments) and TRANSMEM (which describes the extent of the region spanning a membrane), we have introduced a new feature key INTRAMEM in the flat file to describe the extent of a region located in a membrane without crossing it.

Cross-references to EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants and EnsemblProtists

Cross-references have been added to Bacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants and EnsemblProtists. These databases are part of Ensembl Genomes. Ensembl Genomes has been created to complement the existing Ensembl site, which focuses on vertebrate genomes.

The format of the explicit links in the flat file is:

Resource abbreviation EnsemblBacteria or EnsemblFungi or EnsemblMetazoa or
EnsemblPlants or EnsemblProtists
Resource identifier Transcript ID
Optional information 1 Protein ID
Optional information 2 Gene ID
Examples Q53653:
DR   EnsemblBacteria; EBSTAT00000032812; EBSTAP00000031682; EBSTAG00000032810.
Q07163:
DR   EnsemblFungi; YDR365W-B; YDR365W-B; YDR365W-B.
Q9NDJ2:
DR   EnsemblMetazoa; FBtr0071602; FBpp0071528; FBgn0020306.
DR   EnsemblMetazoa; FBtr0071603; FBpp0071529; FBgn0020306.
DR   EnsemblMetazoa; FBtr0071604; FBpp0071530; FBgn0020306.
P49333:
DR   EnsemblPlants; AT1G66340.1-TAIR; AT1G66340.1-P; AT1G66340-TAIR-G.
Q54L85:
DR   EnsemblProtists; DDB0305146; DDB0305146; DDB_G0286833.

Show all the entries having a cross-reference to EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants or EnsemblProtists.

Changes concerning keywords

New keywords:

UniProt release 2010_06

Published May 18, 2010

Headlines

UniProt and Ensembl

The Ensembl project was launched in 2000 as a joint project between the EBI and the Wellcome Trust Sanger Institute, some years before the draft human genome was completed. Even at that early stage, it was clear that manual annotation of 3 billion base pairs of sequence would not be able to offer researchers timely access to the latest data. The goal of Ensembl was therefore to automatically annotate the genome, integrate this annotation with other available biological data and make all this publicly available. Since the launch, many more genomes have been added and the range of available data has expanded to include comparative genomics, variation and regulatory data. A collaboration between UniProt and Ensembl was initiated in 2008 to contribute towards the goal of having the complete human proteome available in UniProtKB/Swiss-Prot. A pipeline was established to import those Ensembl sequences not yet in UniProtKB which is updated with each Ensembl release along with a quality assurance feedback loop which ensures that the Ensembl predictions benefit from the manual review in UniProtKB. Since then, the scope of Ensembl has been extended to include manual annotation by the Human And
Vertebrate Analysis aNd Annotation (Havana) group at Sanger Institute which further adds value to the predictions. Ensembl and UniProt are pleased to announce that this collaboration has now been extended to Mus musculus and Rattus norvegicus and will shortly be extended to Gallus gallus and Bos taurus. The provision of a complete set of protein sequences to users is a priority for the UniProt Consortium and this collaboration contributes significantly to this effort.

UniProtKB News

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • Glycyl serine ester (Gly-Ser) (interchain with S-...)
  • Glycyl threonine ester (Gly-Thr) (interchain with G-...)
New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 7’-hydroxytryptophan

UniProt release 2010_05

Published April 20, 2010

Headlines

Nonsense-mediated mRNA decay: To be or not to be… integrated in UniProtKB

It has been known for over 30 years that, in yeast, nonsense mutations reduce mRNA levels and that the strength of the reduction depends on the position of the nonsense codon within the locus. This observation, followed by many others in a great variety of eukaryotic organisms, led to the concept of ‘Nonsense-mediated mRNA decay’ (NMD), ‘a surveillance mechanism that detects and degrades mRNAs with premature termination codons (PTCs), thereby preventing the production of faulty proteins’. The key question was what Mother Nature considers a ‘premature’ stop. For mammals, a rule was established stating that ‘if a termination codon is more than about 50 nucleotides upstream of the final exon, it is a PTC and the mRNA that harbors it will be degraded’ (see Nagy and Maquat, 1998). Although we know today that NMD is a much more sophisticated mechanism than previously anticipated (see reviews), the ‘50 nucleotide rule’ is still used to predict potential NMD targets and, on this basis, some databases deleted them from their collections. Since many PTCs are generated by alternative splicing (at least one third of the human alternatively spliced mRNAs contain PTCs), several alternatively spliced isoforms have disappeared from databases, victims of the ‘50 nucleotide rule’.

Eukaryotic cells detect PTC during the first round of translation undergone by mRNAs freshly exported from the nucleus. During this ‘pioneer’ round of translation, if the ribosome terminates at a termination codon (TC) in the vicinity of the poly(A) tail, PABPC1 – a poly(A)-binding protein – sends a signal which promotes proper termination of translation. This results in efficient reinitiation of the ribosome at the 5’ end of the mRNA, and the production of a stable mRNP. If the ribosome terminates at a TC that is too far away from the poly(A) tail for it to receive the PABPC1 – mediated translation-termination-promoting signal, the UPF1 protein binds to the stalled ribosome instead, thereby marking this TC as premature. Subsequently, a PTC-specific protein complex forms around UPF1, promoting UPF1 phosphorylation and committing the mRNA to rapid degradation.

It is thought that the physical distance, rather than the number of nucleotides, between a TC and the poly(A) tail is a crucial determinant in defining a TC as premature (Eberle et al., 2008). This distance depends on the 3D structure of the mRNA 3’ UTR. This structure can be modified by altering (1) intramolecular base pairing, (2) interaction of the mRNA with RNA-binding proteins and (3) interactions between the involved proteins through post-translational modifications (PTMs). In other words, it can be regulated in a tissue-specific manner, during development, and by environmental cues.

In higher eukaryotes, an additional level of complexity exists which links PTC detection and mRNA splicing. During pre-mRNA processing, the spliceosome removes intron sequences and a set of proteins called the exon-junction complex (EJC) is deposited 20-24 nucleotides upstream of the sites of intron removal. EJCs located within the ORF are removed from the mRNA by elongating ribosomes, and only EJCs located downstream of the TC will still be present when the first ribosome terminates. In organisms producing a large number of PTC-containing mRNAs by extensive alternative pre-mRNA splicing, such as humans, the EJC may have evolved to facilitate efficient recognition and degradation of these transcripts. An EJC downstream of a TC functions as an NMD enhancer by shortening the time window between UPF1 binding and its phosphorylation, hence promoting mRNA degradation.

NMD rarely downregulates the expression of a transcript completely. More commonly, 10-30% of the PTC-containing transcripts survive and may allow the production of physiologically relevant levels of protein products (Neu-Yilik et al., 2004). This is why in UniProtKB, we favour a conservative approach when dealing with protein isoforms predicted to be encoded by an NMD target mRNA. We do not delete them from the database, but rather tag them with the comment: ‘May be produced at very low levels due to a premature stop codon in the mRNA, leading to nonsense-mediated mRNA decay.’ For instance, in entry Q9HB09 (human Bcl-2-like protein 12), 2 isoforms are described, one of which has been predicted to be an NMD target by Hillman et al., 2004 (see also the ‘References’ section of the entry). In some cases, despite the presence of a PTC in the encoding mRNA, the isoform produced seems to be the predominant form, at least in some tissues (see human Gamma-aminobutyric acid type B receptor subunit 1 isoform 1E in entry Q9UBS5).

Currently in UniProtKB/Swiss-Prot, over 300 protein entries describe isoforms that could be produced at low levels due to NMD. 228 proteins from different species are directly involved in the NMD process itself and can be retrieved from UniProtKB with the keyword ‘Nonsense mediated mRNA decay’.

UniProtKB News

Cross-references to UCD-2DPAGE

Cross-references have been added to the University College Dublin 2-DE Proteome Database, (UCD-2DPAGE). The database HSC-2DPAGE,previously hosted at Harefield Hospital (and previously also cross-referenced from UniProtKB/Swiss-Prot), has been integrated into UCD-2DPAGE. UCD-2DPAGE currently contains data from Canis familiaris (dog), Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat) and Saccharomyces cerevisiae (baker’s yeast).

UCD-2DPAGE is available at http://proteomics-portal.ucd.ie:8082/cgi-bin/2d/2d.cgi.

The format of the explicit links in the flat file is:

Resource abbreviation UCD-2DPAGE
Resource identifier UCD-2DPAGE accession number (in most cases the primary UniProtKB accession number)
Examples P02648:
DR   UCD-2DPAGE; P02648; -.
O75112:
DR   UCD-2DPAGE; O75112; -.
DR   UCD-2DPAGE; Q9Y4Z5; -.

Show all the entries having a cross-reference to UCD-2DPAGE.

Changes concerning cross-references to HSC-2DPAGE

Cross-references to HSC-2DPAGE have been removed.

Changes concerning keywords

New keyword:

Changes in subcellular location controlled vocabulary

New subcellular location:

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
  • Glycyl serine ester (interchain with G-Cter in ubiquitin)
  • Glycyl threonine ester (interchain with G-Cter in ubiquitin)
New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 2-(S-cysteinyl)pyruvic acid O-phosphothioketal

UniProt release 2010_04

Published March 23, 2010

Headlines

UniProtKB wonder web

UniProtKB/Swiss-Prot was the first biomolecular database to include cross-references in its entries, long before the advent of the internet, and a high level of integration with other databases is a hallmark of the resource. UniProtKB is indeed a general interest database, and the cross-references it includes provide users with easy access to relevant additional information from more specialized resources.

The number of cross-references keeps growing. Over the past year, 21 new databases have been added and 6 out of the 8 phylogenomic databases cross-referenced in UniProtKB have been added during the last 10 months. Today 126 databases are explicitly cross-referenced in the knowledgebase. Most links are stored in the ‘Cross-references’ section.

As of this release, the total number of cross-references in UniProtKB/Swiss-Prot passed 13 million and the average number per entry is over 25. In TrEMBL, the unreviewed section of UniProtKB, the average number of cross-references per entry is approximately half lower (over 11). For both sections, the most represented databases reflect our information sources and annotation strategies. They are:
  1. EMBL-Bank (on average 1.7 cross-references per entry): the vast majority of UniProtKB sequences come from translated CDS submitted to the EMBL-Bank/GenBank/DDBJ, it is therefore not surprising that more than 98% UniProtKB/Swiss-Prot entries contain a cross-reference to the original nucleotide submission(s). For extensively studied organisms, such as human, the average number of EMBL-Bank cross-references may exceed 7.
  2. InterPro (on average 3.1 cross-references per entry): this integrated database classifies proteins at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. In UniProtKB, we have always paid special attention to domain and family annotation. InterPro predictions are automatically integrated into TrEMBL entries and domain/family annotation is later manually reviewed and completed before integration into Swiss-Prot.
  3. Gene Ontology (GO): UniProtKB annotators manually assign GO terms to all entries they curate and high-quality manually assigned GO terms from other GO Consortium groups are imported to ensure that a comprehensive collection of GO annotations is available through UniProtKB. In addition, UniProtKB incorporates GO terms generated from a range of electronic mapping methods. As a result, the number of GO cross-references per entry is expected to further grow significantly in the near future.

In addition to the “regular” ‘Cross-references’ section, the ‘Web resources’ section offers links to specific web pages or databases whose scope is too specialized to warrant the creation of specific cross-references. For instance, the IARC TP53 mutation database, a repository of somatic and germline TP53 mutations in human cancers is only available from the human p53 entry. Currently more than 6’500 entries contain ‘Web resources’ sections, which represent some 8’500 additional links. Note that links to relevant databases pepper all sections of Swiss-Prot entries. Cross-references to ENZYME are available from the EC numbers provided in the ‘Protein names’ subsection, links to PubMed from the ‘References’ section, etc.

In conclusion, for a complete overview on a given protein, users should use different resources, each of them shedding complementary light on the field. The coexistence of various databases does not imply competition between them, but rather collaboration, to better serve the life science community. UniProtKB may be used to get a manually reviewed summary of the current knowledge and to direct users to more specialized databases, such as organism-oriented, phylogenomic or genome annotation databases, for more detailed information.

For detailed statistics on cross-references, see our release notes, section 5 (‘Statistics for some line types’).

UniProtKB News

Change of release numbers

In the past, we have distinguished major and minor releases of the UniProt knowledgebase and this was reflected in the release number format: major releases were numbered x.0, minor releases were x.1, x.2, etc. We have abandoned this distinction and changed the format to YYYY_XX where YYYY is the calendar year and XX a 2-digit number that is incremented for each release of a given year, e.g. 2010_01, 2010_02, etc. We will archive previous releases on our ftp site for at least 2 years.

Change of release cycle

UniProt releases are now published every 4 weeks.

Cross-references to GenoList

Cross-references have been added to the GenoList Integrated Environment for the Analysis of Microbial Genomes. GenoList hosts numerous model organism databases for complete microbial genomes, including BuruList, ListiList, MypuList, PhotoList, SagaList and SubtiList which used to be cross-referenced from UniProtKB individually.
Relevant UniProtKB entries from the following organisms are therefore now linked to GenoList:

  • Mycobacterium ulcerans (strain Agy99) (formerly linked to BuruList)
  • Listeria monocytogenes and innocua (formerly linked to ListiList)
  • Mycoplasma pulmonis (formerly linked to MypuList)
  • Photorhabdus luminescens subsp. laumondii (formerly linked to PhotoList)
  • Streptococcus agalactiae serotype III (formerly linked to SagaList)
  • Bacillus subtilis (formerly linked to SubtiList)

GenoList is available at http://genodb.pasteur.fr/cgi-bin/WebObjects/GenoList.woa/

The format of the explicit links in the flat file is:

Resource abbreviation GenoList
Resource identifier Ordered locus name
Examples Q925X3:
DR   GenoList; LIN0124; -.
DR   GenoList; LIN2378; -.
DR   GenoList; LIN2564; -.
P37551:
DR   GenoList; BSU00470; -.

Show all the entries having a cross-reference to GenoList.

Changes concerning cross-references to BuruList, ListiList, MypuList, PhotoList, SagaList and SubtiList.

Cross-references to BuruList, ListiList, MypuList, PhotoList, SagaList and SubtiList have been removed.

Cross-references to ConoServer

Cross-references have been added to the Cone snail toxin database ConoServer. The ConoServer database is a manually curated database dedicated to conopeptides. ConoServer uses standardized names and a genetic and structural classification scheme to present data retrieved from UniProtKB, GenBank, the Protein Data Bank and the literature.

The ConoServer web site incorporates specialized features like the graphic display of post-translational modifications that are extensively present in conopeptides. ConoServer manages nucleic sequences, proteic sequences, and 3D structures. The aim of this resource is to give a comprehensive overview over the diversity of conopeptides and their uses as drugs, drug leads and diagnostic tools.

ConoServer is available at http://www.conoserver.org/.

The format of the explicit links in the flat file is:

Resource abbreviation ConoServer
Resource identifier ConoServer identifier
Optional information 1 Toxin name
Examples P0C8R2:
DR   ConoServer; 2838; ArIA precursor.
DR   ConoServer; 3450; Sequence 299 from Patent EP1852440.
P0C1W3:
DR   ConoServer; 1574; RVIIIA.

Show all the entries having a cross-reference to ConoServer.

Cross-references to MINT

Cross-references have been added to the Molecular INTeraction database MINT, which focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators.

MINT is available at http://mint.bio.uniroma2.it/mint/.

The format of the explicit links in the flat file is:

Resource abbreviation MINT
Resource identifier MINT interactor ID
Examples P00925:
DR   MINT; MINT-517950; -.
P0A887:
DR   MINT; MINT-1243319; -.

Show all the entries having a cross-reference to MINT.

Changes concerning keywords

New keywords: Modified keyword:

Changes in subcellular location controlled vocabulary

New subcellular location:
  • Host basolateral cell membrane

Changes concerning the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • S-(dipyrrolylmethanemethyl)cysteine

UniProt release 15.15

Published March 2, 2010

Headlines

Bacillus subtilis, a Gram-positive model bacterium fully annotated in UniProtKB/Swiss-Prot

We are all aware of the importance of model bacterial systems. Escherichia coli K12 is the paradigm for Gram-negative bacteria, but what of Gram-positive bacteria? There are a large variety of these bacteria that serve us, are neutral or infect us, and model systems for these bacteria are in demand.

Bacillus subtilis, a rod-shaped, soil-and water-dwelling bacterium originally described as Vibrio subtilis in 1835 by Ehrenberg and renamed in 1872 by Cohn has served this role for over a century. B.subtilis differentiates to produce endospores, can be made naturally competent for DNA uptake and is a bacteriophage host. In the wild it has been seen to produce over 2 dozen different antibiotics. These characteristics make it an obvious choice as a model system for bacterial differentiation and genetics, as well as a model for other - often more dangerous - bacteria such as Bacillus anthracis, Mycobacterium tuberculosis or Staphylococcus aureus. Additionally, it is used for the production of various industrially interesting enzymes such as amylases and proteases. A substrain, B.subtilis natto, is used to prepare natto, a traditional Japanese dish made from fermented soybeans. Although B.subtilis is not considered pathogenic for any known organism, it has been isolated from patients suffering from various illness such as endocarditis, pneumonia etc., and also occasionally from spoiled food where it might be responsible for cases of food poisoning.

The genome of B.subtilis 168, a widely used laboratory strain, was sequenced by a large international consortium in 1997 - the 6th bacterium to be fully sequenced. The sequence was updated and reannotated in 2009 by the Institut Pasteur and the Génoscope. In coordination with them we have annotated the complete proteome, providing all 4'192 B.subtilis proteins in UniProtKB/Swiss-Prot, each of which has a cross-reference to the dedicated B.subtilis database SubtiList/GenoList as well as other databases. A list of all B.subtilis UniProtKB/Swiss-Prot entries is available in the bacsu.txt file. This of course provides a snapshot of the knowledge about this first fully manually annotated Gram-positive model organism and will date easily. Despite having been so intently studied for so long, there are many B.subtilis proteins about which we know very little. There will be work for years to come for the B.subtilis (and larger scientific) community as these proteins and their homologues are characterized.

All B.subtilis entries can be retrieved from UniProtKB/Swiss-Prot combining the organism name "Bacillus subtilis" (or the taxonomy identifier 1423) with the keyword 'Complete proteome' (organism:"Bacillus subtilis" AND keyword:"Complete proteome" or organism:1423 AND keyword:181).

UniProtKB News

Cross-references to EuPathDB

Cross-references have been added to the Eukaryotic Pathogen Database Resources EuPathDB (formerly ApiDB), an integrated database covering the eukaryotic pathogens of the genera Cryptosporidium, Giardia, Leishmania, Neospora, Plasmodium, Toxoplasma, Trichomonas and Trypanosoma. While each of these groups is supported by a taxon-specific database built upon the same infrastructure, the EuPathDB portal offers an entry point to all these resources ("child databases": e.g. ToxoDB, PlasmoDB, CryptoDB...), and the opportunity to leverage orthology for searches across genera.

EuPathDB is available at http://www.eupathdb.org/.

The format of the explicit links in the flat file is:

Resource abbreviation EuPathDB
Resource identifier Combination of the child database name and the accession number in this database concatenated by a ":".
Examples
P84155:
DR   EuPathDB; TritrypDB:LmjF06.1270; -.

Q38FA5:
DR   EuPathDB; TritrypDB:Tb09.160.2970; -.

Show all the entries having a cross-reference to EuPathDB.

Cross-references to ProtClustDB

Cross-references have been added to Entrez Protein Clusters ProtClustDB, a collection of related protein sequences (clusters) which consists of Reference Sequence proteins encoded by complete genomes. This database contains both curated and non-curated clusters. The Protein Clusters database provides easy access to annotation information, publications, domains, structures, and external links and analysis tools including multiple alignments, phylogenetic trees, and genomic neighborhoods (ProtMap).

ProtClustDB is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters.

The format of the explicit links in the flat file is:

Resource abbreviation ProtClustDB
Resource identifier ProtClustDB accession number.
Examples
P99178:
DR   ProtClustDB; PRK05431; -.

P92693:
DR   ProtClustDB; MTH00098; -.

Show all the entries having a cross-reference to ProtClustDB.

Cross-references to SUPFAM

Cross-references have been added to the Superfamily database of structural and functional annotation SUPFAM, a database of structural and functional annotation for all proteins and genomes. The SUPFAM annotation is based on a collection of hidden Markov models, which represent structural protein domains at the SCOP superfamily level. A superfamily groups together domains which have an evolutionary relationship. The annotation is produced by scanning protein sequences from over 1,200 completely sequenced genomes against the hidden Markov models.

SUPFAM is available at http://supfam.org.

The format of the explicit links in the flat file is:

Resource abbreviation SUPFAM
Resource identifier SUPFAM superfamily identifier.
Optional information 1 SUPFAM superfamily domain name.
Optional information 2 Number of hits found.
Examples
P08519:
DR   SUPFAM; SSF57440; Kringle-like; 38.
DR   SUPFAM; SSF50494; Pept_Ser_Cys; 1.

P00967:
DR   SUPFAM; SSF56042; AIR_synth_C; 2.
DR   SUPFAM; SSF53328; formyl_transf; 1.
DR   SUPFAM; SSF52440; PreATP-grasp-like; 1.
DR   SUPFAM; SSF55326; PurM_N-like; 2.
DR   SUPFAM; SSF51246; Rudmnt_hyb_motif; 1.

Show all the entries having a cross-reference to SUPFAM.

Format change in the cross-references to HOVERGEN

The format of the cross-references to the HOVERGEN project has changed: The resource identifier, which was a UniProtKB accession number, has been replaced by a HOVERGEN identifier.

Example:

Previous format:

DR   HOVERGEN; P32754; -.

New format:

DR   HOVERGEN; HBG005987; -.

Show all the entries having a cross-reference to HOVERGEN.

Changes concerning keywords

New keywords:

Changes concerning the controlled vocabulary for PTMs

Modified term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

New terms:

  • Alanine isoaspartyl cyclopeptide (Ala-Asn)
  • Glycyl cysteine dithioester (Cys-Gly) (interchain with G-...)
  • Trithiocysteine (Cys-Cys)

Modified terms for the feature key 'Lipidation' ('LIPID' in the flat file):

New terms:

  • N-[(12R)-12-hydroxymyristoyl]cysteine
  • N-(12-oxomyristoyl)cysteine

Modified terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

New terms:

  • S-(4-hydroxycinnamyl)cysteine
  • S-cysteinyl cysteine
  • Tele-(1,2,3-trihydroxypropan-2-yl)histidine

UniProt release 15.14

Published February 9, 2010

Headlines

Bornavirus: another viral stowaway in the human genome

Analysis of the human genome sequence has revealed that our 'book of life' is multi-authored. About 0.5% of human genes are derived from bacteria and 8% of our total genetic material results from viral infections (see also release 2.1 headline). These genomic viral "fossils" are ancient retroviruses, which are known to insert their genetic information into host chromosomal DNA. They do so by producing a DNA copy from their RNA genome by use of a viral enzyme, called reverse transcriptase. The viral DNA then integrates into the host genome, becoming a permanent part of the cell.

A recent Japanese study has unveiled another viral stowaway in the human gene pool. Several copies of the bornavirus N gene turn out to be part of the human genome and of other mammalian genomes, including chimpanzees, gorillas and African elephants. These genes are remnants of a bornavirus which presumably infected proto-hominids, and other species, some forty million years ago. This ancient virus has disappeared and nowadays bornaviruses are known to infect mainly horses, inducing neurological diseases.

This discovery came as a surprise since the bornaviral RNA genome is not known to be retrocopied into DNA at any stage of the viral replication cycle and never integrates into the host genome. This unusual integration into our ancestor's genome may have helped him survive against a pathogenic virus or may have played a role in primate evolution. As often in evolutionary biology, there are many more questions than answers, but this serves as a useful reminder that human evolution does not rely only on our own intrinsic potential, but also on a tight interaction with other living species in our environment.

A bornavirus-derived gene is actually expressed in human cells. It is called 'Endogenous Borna-like N element' (EBLN-1) and can be retrieved from UniProtKB/Swiss-Prot using the accession number Q6P2I7.

UniProtKB News

Changes concerning keywords

New keyword:

Changes concerning the controlled vocabulary for PTMs

Modified terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

New terms:

  • Diiodotyrosine
  • Glycyl adenylate
  • Iodotyrosine
  • Threonine methyl ester

Modified term for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

New term:

  • Glycyl cysteine dithioester (Gly-Cys) (interchain with C-...)

UniProt release 15.13

Published January 19, 2010

Headlines

XMRV complete proteome in UniProtKB/Swiss-Prot

Despite the 118 human pathogenic viruses identified so far, our knowledge of these pathogens is still incomplete. Several human pathologies are suspected to be induced by unknown viruses. In this context, a new virus was isolated from human prostate in 2006 and was named 'Xenotropic Moloney murine leukemia virus-Related Virus' (XMRV). This retrovirus is the first representative of the gammaretrovirus genus to be isolated in humans. These retroviruses are known to induce various cancers in their host and a causal link with prostate cancer was suspected. This link was experimentally established but later refuted and thus remains a matter of debate. The same virus has been recently associated with chronic fatigue syndrome (CFS): XMRV has been isolated in 4% of healthy subjects, and in 67% of CFS patients. Large scale epidemiological studies must be performed to establish with certainty whether these correlations are relevant.

Where did XMRV come from? Retroviruses identified in patients with CFS or prostate cancer are highly related (more than 90% DNA sequence identity) to a group of mouse viruses called xenotropic murine leukemia virus (MLV). Xenotropic MLVs are endogenous retroviruses, i.e. the viral DNA is stably integrated in the mouse genome. Mice produce low levels of the virus - a few infectious particles per ml of blood - but the virus cannot reinfect mouse tissues. Instead it spreads to other species, such as humans, which is the reason for the term 'xenotropic', meaning the virus can grow in species other than the species of origin. Therefore it makes sense to hypothesize that XMRV is a xenotropic MLV that crossed from mice to humans.

The mode of transmission of XMRV is largely unknown. It could be via transfusion, intravenous drug use, or by other blood-borne routes, but other modes of transmission (respiratory, sexual, etc.) cannot be excluded.

It will take time to answer the numerous questions raised by the discovery of XMRV. In terms of treatment, the good news is that some of the anti-retroviral drugs used for treating AIDS can immediately be tested for their efficacy against CFS. Indeed, susceptibility of XMRV to AZT has recently been demonstrated.

The complete proteome of XMRV has been annotated along with that of the well-studied MLV which is 65% (env) to 85% (gag-pol) identical and has served as a model for XMRV functional annotation.

UniProtKB News

Cross-references to eggNOG

Cross-references have been added to eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups), a database of orthologous groups of genes. The orthologous groups are annotated with functional description lines (derived by identifying a common denominator for the genes based on their various annotations), with functional categories (i.e derived from the original COG/KOG categories).

eggNOG is available at http://eggnog.embl.de/.

The format of the explicit links in the flat file is:

Resource abbreviation eggNOG
Resource identifier eggNOG cluster identifier.
Example
P33887:
DR   eggNOG; maNOG10115; -.

Show all the entries having a cross-reference to eggNOG.

Format change in the cross-references to HAMAP

The format of the cross-references to the HAMAP database has changed in order to align it with the format of other InterPro member databases.

Previous format:

Resource abbreviation HAMAP
Resource identifier HAMAP unique identifier for a protein family.
Optional information 1 Nature of hits found. The values are either 'fused', 'atypical', 'atypical/fused' or '-': 'fused' indicates that the family signature does not cover the entire protein; 'atypical' means that the protein is divergent in sequence or has mutated functional sites and should not be included in family datasets; 'atypical/fused' is a combination of the previous two cases; '-' is a placeholder for an empty field.
Optional information 2 Number of hits found, which is generally 1, rarely 2 for the fusion of identical domains/proteins.
Examples
P12743:
DR   HAMAP; MF_00326; -; 1.

Q9K3D6:
DR   HAMAP; MF_00006; fused; 1.
DR   HAMAP; MF_01105; atypical/fused; 1.

New format:

Resource abbreviation HAMAP
Resource identifier HAMAP unique identifier for a protein family signature.
Optional information 1 HAMAP entry name for a protein family.
Optional information 2 Number of hits found, which is generally 1, rarely 2 for the fusion of identical domains/proteins.
Optional information 3 Nature of hits found. The values are either 'fused', 'atypical', 'atypical/fused' or '-': 'fused' indicates that the family signature does not cover the entire protein; 'atypical' means that the protein is divergent in sequence or has mutated functional sites and should not be included in family datasets; 'atypical/fused' is a combination of the previous two cases; '-' is a placeholder for an empty field.
Examples
DR   HAMAP; MF_00326; Ribosomal_L7Ae; 1; -.

DR   HAMAP; MF_00006; Arg_succ_lyase; 1; fused.
DR   HAMAP; MF_01105; N-acetyl_glu_synth; 1; atypical/fused.

Show all the entries having a cross-reference to HAMAP.

Format change in the cross-references to HOGENOM

The format of the cross-references to the HOGENOM project has changed: The resource identifier, which was a UniProtKB accession number, has been replaced by a HOGENOM identifier.

Example:

Previous format:

DR   HOGENOM; P0A9I1; -.

New format:

DR   HOGENOM; HBG676713; -.

Show all the entries having a cross-reference to HOGENOM.

Changes concerning keywords

New keywords:

Changes in controlled vocabulary for subcellular locations

New subcellular locations:

  • Barrier septum
  • Cell septum
  • Cell tip
  • Photoreceptor inner segment
  • Photoreceptor outer segment

Changes concerning the controlled vocabulary for PTMs

Modified terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

New terms:

  • 5-glutamyl 2-aminoadipic acid
  • 5-glutamyl N2-lysine

UniProt release 15.12

Published December 15, 2009

Headlines

Through the Looking-Glass

All amino acids but glycine can exist in either of two optical isomers, called L-or D-amino acids, which are mirror images of each other. However, we have been taught for decades that proteins that occur in nature are made out of L-forms. There are some well-known exceptions, of course, but restricted to prokaryotes. Indeed, D-forms are abundant components of the peptidoglycan cell walls of bacteria, and are also observed in bacterial natural antibiotics, such as actinomycin D, bacitracin or tetracycline. These latter are quite unusual peptides that are synthesized by multienzyme complexes in a stepwise fashion without the participation of mRNA. It has also been observed that the mammalian brain contains high levels of free D-serine which appears to be a physiological coagonist of N-methyl D-aspartate receptors (NMDARs) and, as such, may act as a neurotransmitter in the brain, but this activity is carried out by the amino acid itself and does not occur within the context of a polypeptide. The isolation, in the 1980s, of naturally occurring animal peptides containing D-amino acids challenged the dogma, leading to the discovery of a new post-translational modification (PTM): L- to D-isomerization.

In 1981, Montecucchi et al., looking for enkephalin-related peptides in various amphibia, isolated dermorphin from the skin of Phyllomedusa sauvagei. Dermorphin is produced by 2 different precursors: cleavage of Dermorphin-1 gives rise to 4 mature dermorphins and that of Dermorphin-2 to 5 mature peptides, all of which have the identical sequence: YAFGYPS. This heptapeptide binds with high affinity and selectivity to mu-type opioid receptors and appears to be a thousand times more potent than morphine in inducing deep long-lasting analgesia when injected into mice or rats. Interestingly, the second amino acid of dermorphin is D-alanine. A synthetic isomer, containing L-alanine at that position, is virtually devoid of biological activity.

This discovery was followed by many others. Deltorphins, another class of frog opioid peptides, also characterized by a D-amino acid at position 2, were isolated. Another amphibian, Bombina variegata, was shown to express antimicrobial D-amino acid-containing peptides, called bombesins, on its skin. Arthropoda, such as spider, lobsters and crayfish, and Mollusca entered the game. Cone snail peptide toxins have been extensively studied in this context and they currently represent 60% of all animal D-amino acid-containing proteins annotated in UniProtKB/Swiss-Prot. A single mammal appears on the list: platypus with 2 peptides, C-type natriuretic peptide 39 and Defensin-like peptide 2/4, expressed in its venom gland.

Animal D-amino acid-containing proteins are synthesized on ribosomes following a classical mRNA template; unusual codons have not been observed. In addition, some of them have been isolated from their biological source with both L- and D-amino acid at the appropriate position. These observations suggested that L- to D-amino acid isomerization is a bona fide PTM. An enzyme catalyzing the conversion of an Omega-agatoxin-Aa4b serine (at position 46 of the mature peptide, 81 in the precursor) from L- to D-form has been isolated from the funnel-web spider Agelenopsis aperta and its partial sequence is available in UniProtKB/TrEMBL. A similar mammalian activity has been characterized from platypus venom.

L- to D-amino acid isomerization presents significant advantages. The modified peptides become more resistant to protease degradation and hence much more stable. In addition, X-ray crystallography studies have shown that the isomerization creates new structures, such as peculiar beta-turns. The creation of these new structural elements seems crucial for interaction with specific partners, opiate receptors for instance, and may act as a switch that turns on protein activity.

L- to D-amino acid isomerization could be more frequent than initially thought. It cannot be predicted by software tools and is not detectable by any of the standard techniques used in proteomics. It was only discovered when a synthetic peptide with the same sequence of L-amino acids appeared to be biologically inactive. We could be facing a novel strategy of multicellular organisms to circumvent stereochemical limitations imposed by the genetic code in an effort to increase molecular diversity.

In UniProtKB, all D-amino acid-containing proteins can be retrieved using the keyword 'D-amino acid'. To restrict the search to animal proteins, add 'Metazoa' to the taxonomy field.

UniProtKB News

Cross-references to ArachnoServer

Cross-references have been added to ArachnoServer, a spider toxin database. ArachnoServer is a manually curated database containing information on the sequence, three-dimensional structure, and biological activity of protein toxins derived from spider venom.

ArachnoServer is available at http://www.arachnoserver.org/.

The format of the explicit links in the flat file is:

Resource abbreviation ArachnoServer
Resource identifier ArachnoServer unique identifier.
Optional information 1 Toxin name.
Examples
P61232:
DR   ArachnoServer; AS000384; beta-hexatoxin-Mg1a.
DR   ArachnoServer; AS000417; beta-hexatoxin-Mr1a.

Q7M485:
DR   ArachnoServer; AS000160; Sphingomyelinase D (LrSicTox1) (N-terminal fragment).

Show all the entries having a cross-reference to ArachnoServer.

Cross-references to InParanoid

Cross-references have been added to InParanoid, a database of eukaryotic ortholog groups. The InParanoid database is a collection of pairwise comparisons between currently 35 complete proteomes. The InParanoid program uses the pairwise similarity scores, calculated using NCBI-Blast, between two complete proteomes for constructing orthology groups.

InParanoid is available at http://inparanoid.sbc.su.se/.

The format of the explicit links in the flat file is:

Resource abbreviation InParanoid
Resource identifier UniProtKB accession number.
Example
P10038:
DR   InParanoid; P10038; -.

Show all the entries having a cross-reference to InParanoid.

Changes concerning keywords

New keyword:

Changes in subcellular location controlled vocabulary

New subcellular location:

  • Host multivesicular body

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • Glutamyl lysine isopeptide (Gln-Lys) (interchain with K-...)
  • Glutamyl lysine isopeptide (Lys-Gln) (interchain with Q-...)

Modified terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • Glutamyl 5-glycerylphosphorylethanolamine -> 5-glutamyl glycerylphosphorylethanolamine

UniProt release 15.11

Published November 24, 2009

Headlines

Why do we keep dubious sequences in UniProtKB? How to discard them from a protein set?

More than 99% of the protein sequences provided by UniProtKB come from the translations of coding sequences (CDS) submitted to the EMBL-Bank/GenBank/DDBJ nucleotide sequence resources. These CDS are either generated by the application of gene prediction programs to genomic DNA sequences or via the hypothetical translation of cloned cDNAs (see FAQ). These methods themselves provide varying degrees of support for the existence of a protein, which may be further supplemented in some cases by other types of evidence (such as mass spectrometry data or evidence from direct protein sequencing).

In July 2007, a new topic was introduced into UniProtKB to indicate the evidence for the existence of a given protein, called 'Protein existence' (PE). 5 levels of evidence have been defined: 1. evidence at protein level (e.g. clear identification by mass spectrometry), 2. evidence at transcript level (e.g. the existence of a putative coding cDNA), 3. inferred by homology (a predicted protein which has been assigned membership of a defined protein family in UniProtKB), 4. predicted (a predicted protein which has not yet been assigned membership of a defined protein family in UniProtKB) and 5. uncertain (e.g. dubious sequences, such as those derived from the erroneous translation of a pseudogene or non-coding RNA). Currently in UniProtKB/Swiss-Prot, the vast majority (71%) of the entries are found in the PE3 category. PE1 and PE2 represent each approximately 13% of the total number of entries, PE4 3% and PE5 only 0.3%.

Entries that are attributed an existence level of 5 (PE5) are also tagged with the term "Putative" in the 'Protein names' section (see for example the "Putative annexin A2-like protein") and, in the 'General annotation (Comments)' section, with a 'Caution' subsection warning the user of a possible problem. The caution subsections accompanying a PE5 entry usually are of the type: "Could be the product of a pseudogene", "Product of a dubious CDS prediction" or "Product of a dubious gene prediction".

The PE section is included in the UniProtKB search engine. It is thus possible to retrieve all entries corresponding to a defined PE level - and thereby exclude all PE5 proteins. For human proteins this can be achieved by searching for: (organism:"Homo sapiens (Human) [9606]" AND reviewed:yes) NOT existence:uncertain. This search allows the retrieval of 19'835 entries, indicating that "uncertain" proteins represent 2.4% of the total human entries. Currently PE5 entries represent only 0.3% of all UniProtKB/Swiss-Prot. The higher proportion of sequences identified as uncertain or dubious in Homo sapiens may be a product of the continuous manual curation and review of these sequences by groups of the CCDS consortium, such as HAVANA, as well as UniProt curators.

One may ask the question: why not delete PE5 sequences from UniProtKB and provide only the most reliable sequences? As stated above, UniProt is continuously reviewing all protein sequences. This process can result in both the removal of some PE5 entries (in which evidence of pseudogenization is overwhelming for instance) as well as the upgrade of other PE5 entries (such as the putative E.coli pseudogene ymiA which has now been found to produce a protein product and which has now acquired a PE of 1 or the human mitochondrial ATP synthase subunit epsilon-like protein). However, many putative pseudogene sequences may be expected to remain in UniProtKB for some time as it can be difficult to prove the non-existence of a protein, and for certain loci some doubts may always persist. To give our users the opportunity to work on the most complete protein set, we have chosen to keep all PE5 sequences with the appropriate 'Caution' comments, leaving to the users the final decision whether to retrieve them or not (using the exclusion mechanism described above). Note that the sequences which are removed from UniProtKB can subsequently be retrieved from the UniParc archive if so desired.

Finally, please remember that the PE assignment is made at the level of the UniProtKB entry and not at the level of individual isoform sequences; hence, dubious alternative isoform sequences cannot be excluded from a protein set by the UniProtKB search engine. However, comments about the evidence supporting the existence of any given isoform can be found in the 'Note:' for that isoform in the 'Alternative products' section (which lists all protein isoforms for each entry). For instance, isoforms that have been identified only once through large scale sequencing are tagged with the comment "No experimental confirmation available". Note that UniProt may include isoforms that contain retained introns (as these may be physiologically relevant) as well as isoforms that contain a premature stop codon and thus could be the target for nonsense-mediated mRNA decay (NMD). The mechanism of NMD involves a first round of translation before the premature stop codon is detected (often referred to as "pioneer translation"), and so at least one protein is synthesized from each NMD target mRNA. In addition, some of the predicted NMD targets appear to be the most abundant isoforms in certain tissues (see for instance the human GABA-B receptor 1 isoform 1E).

For additional information, see the document describing the criteria used to assign the PE level of entries and the UniProtKB user manual.

UniProtKB News

Cross-references to OrthoDB

Cross-references have been added to OrthoDB, a database of orthologous groups. OrthoDB presents a catalog of eukaryotic orthologous protein-coding genes. Orthology refers to the last common ancestor of the species under consideration, and thus OrthoDB explicitly delineates orthologs at each radiation along the species phylogeny.

OrthoDB is available at http://cegg.unige.ch/orthodb.

The format of the explicit links in the flat file is:

Resource abbreviation OrthoDB
Resource identifier OrthoDB cluster number.
Example
P00915:
DR   OrthoDB; EOG90KBJT; -.

Show all the entries having a cross-reference to OrthoDB.

Cross-references to PhylomeDB

Cross-references have been added to PhylomeDB, a database for complete collections of gene phylogenies. PhylomeDB allows users to interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments.

PhylomeDB is available at http://phylomedb.org/.

The format of the explicit links in the flat file is:

Resource abbreviation PhylomeDB
Resource identifier UniProtKB accession number.
Example
Q8GTR4:
DR   PhylomeDB; Q8GTR4; -.

Show all the entries having a cross-reference to PhylomeDB.

Changes concerning keywords

New keywords:

Deleted keywords:

  • Phorbol-ester binding
  • Plant toxin

Changes in subcellular location controlled vocabulary

New subcellular locations:

  • Host synapse
  • Target cell membrane
  • Target membrane

UniProt release 15.10

Published November 3, 2009

Headlines

What are UniProt 'Complete proteomes'? How to retrieve them?

The need for users to access and download complete proteomes is unquestionable and the role of a database like UniProtKB is to meet this demand. The issue looks quite simple: there are more and more fully sequenced genomes. These genomes should contain at least minimal annotation, such as gene predictions, and translation of the predicted coding regions (CDSs) should provide a global perspective of the likely proteome of a given organism. The situation is actually more complex. The development of new sequencing techniques is generating a flood of data, which are often left as they have been produced. Databases have to deal with this ever-growing amount of data. The aim of this headline is to provide you with some tips on how we currently approach the problem, keeping in mind that the situation is rapidly evolving.

In order to give our users access to the proteomes of organisms whose genome has been fully sequenced, we have created the 'Complete proteomes' pages. Currently the proteomes of 1'428 organisms are available from these pages, 60% are bacteria, 30% viruses, 5.5% eukaryota and 4.5% archaea. Note that the term 'organism' is used in a broad sense and also includes strains or subspecies. Indeed, each completely sequenced strain is assigned a separate taxonomic identifier and is processed like an independent organism. A striking example of this approach is provided by Escherichia coli for which no less than 24 strain-specific proteomes can be downloaded separately.

A minority of the UniProt proteomes have been entirely manually reviewed and are found in UniProtKB/Swiss-Prot. These include 8 microbial (Methanocaldococcus jannaschii, 3 subspecies of Buchnera aphidicola, Escherichia coli (strain K12), Haemophilus influenzae, Mycoplasma genitalium and Mycoplasma pneumoniae) and 3 eukaryotic species (Saccharomyces cerevisiae, Schizosaccharomyces pombe, and last, but not least Homo sapiens). The current proteomes are as stable as new discoveries allow. New proteins may be identified and will have to be annotated.

However, most proteomes comprise 2 components, i.e. a manually reviewed protein set (Swiss-Prot) and an automatically annotated one (TrEMBL), and both are automatically combined to generate a non-redundant proteome. The proportion of Swiss-Prot versus TrEMBL entries is variable and depends upon the organism. For instance, 93% of the Bacillus subtilis proteome has been manually reviewed, while the reverse is true for Bacillus cereus for which 93% of the proteome is only automatically annotated and found in the TrEMBL section of UniProtKB. Note that the B.subtilis proteome will be fully in the Swiss-Prot section by the end of the year.

A third category of proteomes exists for organisms whose genomes have submission/annotation problems that prevent the production of a non-redundant protein set or have problems regarding the gene model predictions. These proteomes can be downloaded from Integr8 using the direct link provided on the 'Complete proteomes' pages. This concerns 38 organisms, including some important model organisms, such as Danio rerio (Zebrafish) and Chlamydomonas reinhardtii.

To be included in the 'Complete proteomes' pages, an organism must have a completely sequenced genome, i.e. fully closed and exhibiting either good gene prediction models or good quality transcriptome/proteome data. That is why for bacterial and archaeal genomes, whole-genome shotguns (WGS) and draft sequences are not included. However, we have to adapt to data availability, thus for fungi, WGS sequences are taken into consideration, as they often are the only available ones.

Another requirement is that all proteins in the set are mapped to the genome. The notorious exception is that of the human proteome, which is yet only partially mapped. It should be noted, however, that all human protein entries have been manually reviewed, thus ensuring they meet the UniProtKB/Swiss-Prot quality standards, and are continuously updated, allowing us to progressively increase the mapping to the genome (and to add many other interesting annotations).

All complete proteomes are available from the UniProt taxonomy resource. A direct link is provided from the UniProt homepage. In addition to providing the taxonomic information about a given species, these pages offer several options, such as the retrieval of all UniProtKB entries for a taxon (a set that may contain redundant entries) or the retrieval of the non-redundant complete proteome (see for example the Dictyostelium discoideum (Slime mold) page), including the proteomes provided by the Integr8 resource. For the 1'390 complete proteomes entirely stored in UniProtKB, all entries have been tagged with the keyword 'Complete proteome' allowing their easy retrieval directly from the database, bypassing the taxonomy pages.

For complementary information, see FAQ.

If you have questions on that subject - or any other - do not hesitate to contact us.

UniProtKB News

Format change in the cross-references to OMA

The format of the cross-references to the OMA project has changed: The resource identifier, which was a UniProtKB accession number, has been replaced by an OMA group fingerprint. The optional information field 1 is now a dash '-'.

Example:

Previous format:

DR   OMA; P39899; YANTHIA.

New format:

DR   OMA; YANTHIA; -.

Changes concerning keywords

New keywords:

Changes in subcellular location controlled vocabulary

New subcellular location:

  • Host thylakoid
  • Thylakoid

UniProt release 15.9

Published October 13, 2009

Headlines

Trichophyton tonsurans: an uninvited guest at the World Judo Championships

The World Judo Championships 2009 took place few weeks ago in Rotterdam, Netherlands. A sword of Damocles was hanging over this competition. Its name: Trichophyton tonsurans, a fungal parasite. Japanese academics have raised the alarm: the national sports of sumo and judo may decline because of the rapid spread of this skin-eating fungus. The infection is similar to athlete's foot. It is highly infectious and difficult to treat. It causes itchy red patches on the neck, face and upper body. It often affects the scalp and eventually attacks hair follicles, causing baldness. This distribution is consistent with areas of contact during the grappling that is at the heart of sumo and judo sports, suggesting that the fungus spreads by direct skin-to-skin contact.

Trichophyton tonsurans is just one member of a large family of fungal parasites, called dermatophytes. Dermatophytes are not opportunists, but true pathogens that infect nonliving, cornified layers of the skin, hair and nail in warm and moist environments suitable for proliferation. They are the most common agents of superficial mycoses.

The virulence of dermatophytes is largely due to the secretion of many different proteolytic enzymes. Their genomes encode dozens of secreted proteases. To improve the digestion efficiency of infected tissues, the pathogens secrete proteolytic "cocktails" composed of endo- and exoproteases.

During keratin degradation and digestion, dermatophytes also excrete sulphite through an efflux pump (SSU1). Sulphite reduces cystines, which are abundant in keratins, into cysteine and S-sulphocysteine. As a result, the proteins become more prone to hydrolysis by the secreted "protease cocktail". SSU1 may also play an additional role. Indeed, living in a cyst(e)ine-rich environment, such as the epidermal stratum corneum, hair and nails, may have the fatal drawback of sulphur toxicity. Thus, by excreting excess sulphur as sulphate and sulphite, the pump may also protect dermatophytes from poisoning.

Two large families of secreted endoproteases have been identified in dermatophytes: the subtilisin-like endoproteases SUB1 through SUB7 and the metalloproteinases, also called fungalysins. The exoproteases comprise dipeptidylpeptidases, such as DPP4 and DPP5, aminopeptidases, such as LAP1 and LAP2, as well as carboxypeptidases, such as MCPA, MCPB, SCPA and SCPB. All these proteins have been manually annotated and integrated into UniProtKB/Swiss-Prot.

Orthologous proteins have also been identified in Trichophyton rubrum, the predominant causative agent for superficial dermatomycosis, Arthroderma benhamiae, another dermatophyte triggering severe inflammatory responses in humans, Trichophyton equinum causing ringworm in horses, Nannizzia otae, also known as Microsporum canis, a common zoophilic fungal parasite, and several other less studied dermatophyte species.

In addition to virulence factors, the complete proteome of Nannizzia otae is now available in UniProtKB.

Dermatophytes are fascinating examples of evolutionary adaptation. These fungi have developed sophisticated weapons at our expense to achieve their goal: survival. Like David against Goliath, they have a good probability of winning the battle and sumo wrestlers may well lose their top-knots.

As of this release, 110 dermatophyte virulence factors have been manually annotated and integrated into UniProtKB/Swiss-Prot.

UniProtKB News

Cross-references to Genevestigator

Cross-references have been added to Genevestigator, a reference expression database and meta-analysis system. It allows biologists to study the expression and regulation of genes in a broad variety of contexts by summarizing information from hundreds of microarray experiments into easily interpretable results.

Genevestigator is available at https://www.genevestigator.com/.

The format of the explicit links in the flat file is:

Resource abbreviation Genevestigator
Resource identifier UniProtKB accession number.
Example
P04637:
DR   Genevestigator; P04637; -.

Changes concerning keywords

New keyword:

Changes in subcellular location controlled vocabulary

Modified subcellular locations:

  • Glycosome lumen -> Glycosome matrix
  • Glyoxysome lumen -> Glyoxysome matrix

Deleted subcellular locations:

  • Host lipid droplet membrane
  • Lipid droplet membrane

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • S-Lysyl-methionine sulfilimine (Met-Lys) (interchain with K-...)
  • S-Lysyl-methionine sulfilimine (Lys-Met) (interchain with M-...)

UniProt release 15.8

Published September 22, 2009

Headlines

300'000 HAMAP cross-references in UniProtKB/Swiss-Prot

Bacteria and archaea can live in pretty much every environmental niche we know of. From the bottom of the ocean floor to arctic ice, from wastewater treatment sludge to animal-and plant- associated environments, bacteria and archaea are everywhere. To explore this diversity the number of bacterial (and to a lesser extent archaeal) genomes being sequenced is rising practically exponentially, giving rise to huge numbers of protein sequences that are annotated to varying degrees of quality. To be able to use this data appropriately quality annotation is however essential. In order to supply this we started the HAMAP project (High-quality Automatic and Manual Annotation of microbial Proteins) in 2000. In this project, proteins from complete bacterial and archaeal proteomes, together with related plastid proteins, are automatically annotated based on manually created annotation templates for complete protein annotation, with template-based feature propagation. The annotation templates and much more are available on the HAMAP website. As of January 2008 the sequences annotated by the HAMAP pipeline that fulfill all of its stringent criteria have been entering automatically into the Swiss-Prot section of UniProtKB (see release 54.7 news).

There are now 304'013 UniProtKB/Swiss-Prot entries with a HAMAP cross-reference line to at least one of the 1'595 HAMAP families; 278'635 are bacterial, 14'601 are archaeal and 10'777 are encoded in plastids. Note that some of these entries are the templates for their families (see for example P31120); they include extra information not propagated to all members (for example biophysical chemical characterization, mutagenesis experiments, 3D structures, induction and so on) that has allowed their use as models to annotate further entries (compare entry B6I1P9 containing propagated annotation based on the family rule MF_01554 with model entry P31120).

This large number of semi-automatically annotated entries means that nearly 60% of UniProtKB/Swiss-Prot consists of HAMAP entries; add to this the approximately 30'000 other bacterial and archaeal entries that are not members of a HAMAP family and you find that the total number of bacterial and archaeal entries in UniProtKB/Swiss-Prot begins to reflect their preponderance in nature...

UniProtKB News

Changes in subcellular location controlled vocabulary

New subcellular locations:

  • Acrosome outer membrane
  • Spindle pole

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key 'Cross-link' ('CROSSLNK' in the flat file):

  • 3-(O4'-tyrosyl)-valine (Val-Tyr)
  • 5-(methoxymethyl)thiazole-4-carboxylic acid (Val-Cys)
  • 5-methyloxazole-4-carboxylic acid (Ser-Thr)
  • Glutamyl lysine isopeptide (Glu-Lys) (interchain with K-...)
  • Glutamyl lysine isopeptide (Lys-Glu) (interchain with E-...)
  • Oxazole-4-carboxylic acid (Cys-Ser)
  • Oxazole-4-carboxylic acid (Gly-Ser)
  • Oxazoline-4-carboxylic acid (Cys-Ser)
  • Thiazole-4-carboxylic acid (Gly-Cys)

New terms for the feature key 'Modified residue' ('MOD_RES' in the flat file):

  • (3R)-3-hydroxyasparagine
  • (3R)-3-hydroxyaspartate
  • (3R,4R)-3,4-dihydroxyproline
  • (3S)-3-hydroxyasparagine
  • 1-amino-2-propanone
  • 4,5,5'-trihydroxyleucine
  • 5-glutamyl polyglutamate
  • 5-glutamyl polyglycine
  • 5-hydroxy-3-methylproline
  • Cyclo[(prolylserin)-O-yl] cysteinate
  • Glutamyl 5-glycerylphosphorylethanolamine
  • N-carbamoylalanine
  • N6-(pyridoxal phosphate)lysine
  • N6-(retinylidene)lysine
  • N6-biotinyllysine
  • N6-lipoyllysine
  • O-(2-aminoethylphosphoryl)serine
  • O-(2-cholinephosphoryl)serine
  • O-(pantetheine 4'-phosphoryl)serine
  • O-(phosphoribosyl dephospho-coenzyme A)serine
  • O-AMP-threonine

Deleted terms:

  • 2-oxazoline-4-carboxylic acid (Cys-Ser)
  • 3-hydroxy-5-methylproline
  • 5-methyloxazole (Ser-Thr)
  • Oxazole (Cys-Ser)
  • Oxazole (Gly-Ser)
  • Thiazole (Gly-Cys)
  • Thiazoline-4-carboxylic acid (Thr-Cys)

UniProt release 15.7

Published September 1, 2009

Headlines

Formyl peptide receptors: the missing link between olfaction and immune system

Olfaction plays a major role in the social life of many animals, including mammals, and in their interaction with the environment. In most mammals, the olfactory system has 2 components. 1) The main system is located in the nasal olfactory epithelium (OE) and detects environmental odors, such as those emitted by food and predators. 2) The accessory system is located in the vomeronasal organ (VNO) and detects pheromones. The VNO is linked directly to the brain's emotional centers, such as amygdala and hypothalamus, which control basic drives, hormonal levels, and instinctive behaviours, while OE signals are sent to higher cortical and limbic areas. As a result, signals conveyed by VNO trigger immediate reactions.

Recently, a new family of vomeronasal chemoreceptors has been identified, termed the formyl peptide receptors. In the mouse, 5 formyl peptide receptors are expressed in VNO: Fpr-rs1 (also called Fpr3), Fpr-rs3, Fpr-rs4, Fpr-rs6, and Fpr-rs7. Fpr-rs1, as well as another member of the family, Fpr1, have been previously shown to be expressed within granulocytes, monocytes and macrophages of the immune system. Their ligands include N-formyl-methionyl peptides (fMLP) released by Gram-negative bacteria, HIV-derived peptides, the antimicrobial peptide CRAMP, lipoxin A4, etc. Upon ligand recognition, these chemoreceptors stimulate chemotaxis of the immune cells to the site of infection or tissue damage.

Interestingly, VNO Fpr-rs respond to various degrees to most of the stimuli that affect their relatives in the immune system. Sensitivity to disease/inflammation-related ligands presents major advantages, such as the detection of spoiled food. Although Fpr-rs agonists are mostly produced in tissues and serum after inflammation, they are also present in some bodily fluids, such as urine. This could allow their olfactory detection by conspecifics, leading to the rapid isolation of sick individuals and hence minimizing the risk of disease spreading within a community.

As of this release, all 5 VNO Fpr-rs are annotated and available from UniProtKB/Swiss-Prot.

UniProtKB News

Cross-references to STRING

Cross-references have been added to STRING, a resource of known and predicted protein-protein interactions, which quantitatively integrates interaction data from four sources for a large number of organisms, and transfers information between these organisms where applicable.

STRING is available at http://string-db.org/.

The format of the explicit links in the flat file is:

Resource abbreviation STRING
Resource identifier UniProtKB accession number.
Example
P17735:
DR   STRING; P17735; -.

Changes in subcellular location controlled vocabulary

New subcellular locations:

  • Host nucleus outer membrane
  • Host smooth endoplasmic reticulum
  • Host smooth endoplasmic reticulum membrane

UniProt release 15.6

Published July 28, 2009

Headlines

Microsporidian polar tube: a molecular syringe in UniProtKB/Swiss-Prot

Microsporidia are ubiquitous, obligate intracellular spore-forming fungal parasites which infect a wide range of invertebrates and vertebrates. They are common pathogens responsible for opportunistic infections in immunodeficient humans, such as HIV-infected patients or patients being treated with immunosuppressive drugs. The most common microsporidian associated with AIDS is Enterocytozoon bieneusi which induces chronic diarrhea in HIV-infected individuals. However, since no animal model for E.bieneusi is available, most of the experimental studies on microsporidia have been carried out on Encephalitozoon cuniculi. This microsporidium, which commonly infects rodents, has also been reported to infect humans. Its complete proteome is available in UniProtKB.

Microsporidia are primitive organisms lacking fundamental organelles found in other eukaryotes, such as stacked Golgi apparatus, peroxisomes or mitochondria. However, they have a mitochondrial relic organelle called the mitosome which does not contain any DNA. As a result, to persist in the environment, they have to parasitize the cells of higher organisms.

How do they achieve their goal? The microsporidian intracellular developmental cycle leads to a terminal sporogenic phase producing small spores which are critical for their host-to-host transmission. The unicellular spores have a resistant wall protecting a mononucleate or binucleate sporoplasm (the infectious apparatus of the spore) and an extrusion apparatus consisting of a single polar tube with an anterior attachment complex. Once the target cell is recognized, the polar tube acts as a syringe: it pierces the host cell membrane and rapidly "injects" the sporoplasm into the host cell.

3 polar tube proteins have been identified: PTP1, PTP2, and PTP3. The major polar tube protein, PTP1, accounts for at least 70% of the mass of the polar tube. Before the polar tube can act, the spore has to recognize the host cell and stick to its surface. This role is played by EnP1, which is involved in the adhesion of spores to host cell surface glycoaminoglycans. Orthologous proteins have been identified: EnP1, PTP1 and PTP2 in Encephalitozoon intestinalis and PTP1 and PTP2 in Encephalitozoon hellem. These 2 microsporidian species infect man and cause intestinal infections keratoconjunctivitis, and respiratory infections.

All these infectious proteins are available in UniProtKB with the following accession numbers:

UniProtKB News

Cross-references to CTD

Cross-references have been added to the Comparative Toxicogenomics Database, which elucidates molecular mechanisms by which environmental chemicals affect human disease. Chemical-gene/protein interactions and chemical- and gene-disease relationships are curated from the published literature, and integrated with diverse data to facilitate environmental health research.

CTD is available at http://ctd.mdibl.org/.

The format of the explicit links in the flat file is:

Resource abbreviation CTD
Resource identifier NCBI geneID.
Example
Q9YIC3:
DR   CTD; 395652; -.

Cross-references to Ensembl

We have changed the format of the cross-reference lines to Ensembl. The DR Ensembl lines have been extended in order to include identifiers for transcripts and peptides.

Resource abbreviation Ensembl
Resource identifier Ensembl unique identifier for a transcript.
Optional information 1 Ensembl unique identifier for a protein.
Optional information 2 Ensembl unique identifier for a gene.
Optional information 3 Species name.
Example
O43462:
DR   Ensembl; ENST00000379484; ENSP00000368798; ENSG00000012174; Homo sapiens.

Changes concerning keywords

New keywords:

UniProt release 15.5

Published July 7, 2009

Headlines

New insights into drug development with Polyketide synthases

Polyketides are secondary metabolites produced by numerous organisms, from bacteria, fungi and plants to animals. Polyketides are structurally very diverse, thousands of different polyketides have already been discovered, and they possess a wealth of biological activities, including antimicrobial, antifungal and antiparasitic functions. They endow their producing organism with increased fitness in an environment full of competitors. And not only the producers! We also take advantage of these compounds and many are in commercial use as natural insecticides, cholesterol-lowering agents, antitumor drugs or immunosuppressors.

Polyketides are synthesized by an important family of enzymes, called polyketide synthases (PKSs). PKSs are large multifunctional proteins, with an average length of over 2'500 amino acids, up to almost 5'000 amino acids, often bearing several different catalytic activities. Polyketide biosynthesis proceeds by the assembly of simple blocks, such as propionyl-CoA, butyryl-CoA or acetyl-CoA, in a process that closely parallels fatty acid biosynthesis. The fascinating diversity of polyketides arises through various mechanisms: use of different starter molecules, different chain extension substrates, generation of chiral centers, functional group modifications, such as cyclization, etc.

The social amoeba Dictyostelium discoideum lives in the soil and feeds on a variety of bacteria and fungi. In its natural habitat, D. discoideum has several rivals, such as bacteria, nematodes, and Dictyostelium caveatum. However, this slime mold is not defenseless and it has been shown, for instance, to be able to repel nematodes "by secreting compounds". It appears today that D. discoideum has at least 40 functional PKS genes and 5 probable pseudogenes. This is the largest number of PKSs of all known genomes.

These proteins are very interesting. Understanding the exact enzymatic mechanisms in play and the role of each PKS module in the generation of polyketide diversity may allow us to engineer new PKSs that could produce new active compounds. The way may be paved for the discovery of fundamentally new types of drugs!

As of this release, all D. discoideum PKSs can be retrieved from UniProtKB/Swiss-Prot.

UniProtKB News

Cross-references to UCSC genome browser

Cross-references have been added to the UCSC genome browser, which contains the reference sequences and working draft assemblies for a large collection of genomes and also provides a portal to the ENCODE project.

UCSC is available at http://genome.ucsc.edu/.

The format of the explicit links in the flat file is:

Resource abbreviation UCSC
Resource identifier UCSC GeneID.
Optional information 1 Species name.
Examples
P09496:
DR   UCSC; uc003zzc.1; human.
DR   UCSC; uc003zzd.1; human.
DR   UCSC; uc003zze.1; human.

P13631:
DR   UCSC; uc001scd.1; human.
DR   UCSC; uc001sce.1; human.

Changes concerning keywords

Modified keyword:

UniProt release 15.4

Published June 16, 2009

Headlines

Dioxygenases: from antigenic variation to myeloid malignacies

Beta-D-glucopyranosyloxymethyluracil, also called base J, was the first hypermodified base to be identified in eukaryotic DNA, in 1993, in the nucleus of Trypanosoma brucei. Base J is shown to be present in all kinetoplastids analyzed, in the related marine flagellate Diplonema and in Euglena gracilis, a unicellular alga closely related to the Kinetoplastida (see review). Base J was not only absent in a variety of other protozoa, fungi and vertebrates, but most organisms lacking base J contain DNA glycosylases attacking hydroxymethyldeoxyuridine (HOMedU), thus actively preventing the appearance of this intermediate of base J synthesis. Mammals even contain a highly active dedicated HOMedU glycosylase.

The biosynthesis of base J has been characterized. It requires 2 dioxygenases: JBP1 and JBP2. But the precise function of this DNA modification is not clear. It seems to play a role in Trypanosoma or Leishmania antigenic variation. It was first suspected to be involved in gene silencing, but this hypothesis lacks support. A current idea is that it may regulate homologous recombination at telomeres where most genes encoding Variant Surface Glycoproteins (VSG) are located.

Although the existence of such a complex DNA modification seemed unlikely in vertebrates, Tahiliani et al. (2009) performed a computational search and found JBP homologs throughout metazoans, including man where 3 homologs - TET1, TET2, TET3 - were identified. Human TET1 was unambiguously shown to be able to catalyze the conversion from 5-methylcytosine to 5-hydroxymethylcytosine (hmC). Was it just a pure exercise in style? Actually not. HmC is present in mouse embryonic stem cells. Moreover it appears to be quite abundant in mouse brain, where it constitutes up to 0.6% of total nucleotides in Purkinje cells and 0.2% in granule cells (Kriaucionis and Heintz, 2009).

Human TET1 has been known since 2002 to be involved in some acute leukemias, where it plays the role of the fusion partner of MLL in the translocation t(10;11)(q22;q23) (Ono et al., 2002). In the first months of 2009, several articles pointed at TET2 mutations that contribute to pathogenesis of a wide spectrum of myeloid malignancies, including myelodysplastic syndromes, myeloproliferative disorders, acute myeloid and chronic myelomonocytic leukemias.

A new exciting area of investigation is now open to understand the physiological function of TET/JBP family members, which may be quite crucial in view of the dramatic consequences of their mutation. As of this release, the manually annotated protein sequences of these enzymes are available from UniProtKB/Swiss-Prot: JBP1, including Trypanosoma cruzi isoenzymes JBP1A and JBP1B, JBP2, TET1, TET2 and TET3.

UniProtKB News

Changes concerning cross-references to LinkHub

Cross-references to LinkHub have been removed.

Changes concerning keywords

New keywords:

Changes in subcellular location controlled vocabulary

New subcellular locations:

  • Mitosome
  • Mitosome envelope
  • Mitosome inner membrane
  • Mitosome intermembrane space
  • Mitosome matrix
  • Mitosome membrane
  • Mitosome outer membrane
  • Spore polar tube

Modified subcellular locations:

  • Host intracytoplasmic membrane -> Host endomembrane system
  • Intracytoplasmic membrane -> Endomembrane system

UniProt release 15.3

Published May 26, 2009

Headlines

Rotavirus: a serial killer in UniProtKB/Swiss-Prot

Rotaviruses can infect humans, as well as other vertebrates. They cause severe diarrheal disease and dehydration of infants in both developed and developing countries. An estimated 0.6-0.8 million children aged 5 and under die from rotavirus-induced severe dehydrating diarrhea each year. Although mortality due to rotavirus infection is much higher in developing than in developed countries, infection frequency is remarkably similar. In temperate climates, rotavirus disease is seasonal, peaking in winter. Rotavirus gastroenteritis is transmitted by the fecal-oral route and characterized by watery stools, vomiting and fever. Commercially available vaccines are effective in preventing infection.

The virus infects the mature enterocytes of the small intestine and induces structural changes in the intestinal epithelium, secretion of a viral enterotoxin by infected cells, impaired absorption, cellular and tight junction damage and stimulation of intestinal motility, leading to watery diarrhea.

Rotaviruses have a segmented double-stranded RNA genome (dsRNA) protected by a three- layered capsid resistant to the acidic pH of the stomach. The genome is composed of 11 segments coding for about 12 proteins. One of its features is that its dsRNA genome is never completely uncoated during replication. Only the outermost layer is lost following entry into the host cell. Replication of the viral genome thus occurs within a protective shell to avoid detection and degradation by the host cell.

Seven different species of rotavirus have been described: A, B, C, D, E, F and G. Humans are primarily infected by species A, but also by species B and C. All seven species cause disease in other vertebrates. As of this release, sequences representative of all currently known rotavirus A, B, and C species have been annotated in UniProtKB/Swiss-Prot. This represents 480 entries, from 100 distinct strains, 40 of which are of human origin.

In addition to manually annotated sequences and functional information, we paid special attention to viral taxonomy. We decided to follow the recent recommendations for genome-based classification, in addition to the older antigenic classification system. This allows us to better reflect the frequent rearrangements (segment exchanges) that occur between strains. As a result, a detailed taxonomy is provided for each DNA segment/protein (see for instance Q3ZK61).

For more detailed information on rotaviruses, see the ViralZone portal.

UniProtKB News

Cross-references to PMAP-CutDB

Cross-references have been added to the CutDB - Proteolytic event database. PMAP-CutDB is one of the first systematic efforts to build an easily accessible collection of documented proteolytic events for natural proteins in vivo or in vitro. A CutDB entry is defined by a unique combination of these three attributes: protease, protein substrate and cleavage site.

PMAP-CutDB is available at http://www.proteolysis.org/.

The format of the explicit links in the flat file is:

Resource abbreviation PMAP-CutDB
Resource identifier UniProtKB accession number.
Examples
P02760:
DR   PMAP-CutDB; P02760; -.

Q02383:
DR   PMAP-CutDB; Q02383; -.

Removal of the ec2dtosp.txt document file.

The document ec2dtosp.txt, which listed the Escherichia coli Gene- protein database (ECO2DBASE) entries cross-referenced in UniProtKB/Swiss-Prot, has been removed.

Changes concerning keywords

New keyword:

UniProt release 15.2

Published May 5, 2009

Headlines

Fission yeast: the third eukaryotic complete proteome in UniProtKB/Swiss-Prot

Schizosaccharomyces pombe, the fission yeast, was isolated in 1893 by P. Lindner from East African millet beer, for which it was named, 'pombe' meaning 'beer' in Swahili. The genus name reflects both its relationship to budding yeast (-saccharomyces), and the most striking feature that distinguishes it from other yeast species, i.e. reproduction by fission (Schizo-). Although both S.pombe and Saccharomyces cerevisiae are yeasts, they are genetically as divergent from each other as both are from man. Unlike S.cerevisiae, S.pombe did not acquire its fame for its beer making talents - beer made with S.pombe seems to have quite a unsavoury acidic taste - but for the great scientific achievements its study permitted.

As mentioned above, fission yeast divides not by budding, but by medial fission, a process that resembles higher eukaryotic cell division. The organism grows exclusively through its cell tips and divides upon reaching the appropriate size, producing 2 daughter cells of equal sizes. Thus a simple measure of its length gives an estimate of which cell cycle phase the cell is in. This approach allows the isolation of cell cycle mutants (cdc), based on the presence of elongated cells due to continuous cell growth in the absence of cell division. This feature makes S.pombe a first-rate model organism to study cell division. The characterization of cdc mutants led to the discovery of cyclin-dependent kinases which was eventually awarded the 2001 Nobel Prize in Medicine.

In 2002, S.pombe was the 6th eukaryotic organism to have its genome fully sequenced. As of this release, it is the 3rd eukaryotic organism, after S.cerevisiae and Homo sapiens, for which the complete proteome is available in UniProtKB/Swiss-Prot. This set represents 4'957 manually curated protein sequence entries, containing data from the scientific literature and numerous cross-references, including links to GeneDB_Spombe, the fission yeast community database. A list of all S.pombe UniProtKB/Swiss-Prot entries is available in the pombe.txt file.

The S.pombe proteome we provide today is not a static one. We will keep revisiting and updating the entries as the science develops further. Analysis of S. pombe and S.cerevisiae proteins, coupled with phylogenetic studies, will allow the identification and annotation of homologous proteins in other organisms.

UniProtKB News

Cross-references to OMA

Cross-references have been added to the OMA project. The OMA project is a massive cross-comparison of complete genomes to identify the evolutionary relation between any pair of proteins. The main features of OMA are the large number of genomes from all kingdoms of life, the strict verification of orthology assignments and the determination of the phylogenetic relationship between any two proteins.

OMA is available at http://www.omabrowser.org/.

The format of the explicit links in the flat file is:

Resource abbreviation OMA
Resource identifier UniProtKB accession number.
Optional information 1 OMA group fingerprint.
Examples
P39899:
DR   OMA; P39899; YANTHIA.

Q9Y6C2:
DR   OMA; Q9Y6C2; EGLENKP.

Changes concerning keywords

New keyword:

Changes in subcellular location controlled vocabulary

New subcellular locations:

  • Host phagosome
  • Host phagosome membrane
  • Host presynaptic cell membrane
  • Host synapse

New subcellular topology:

  • GPI-like-anchor

UniProt release 15.1

Published April 14, 2009

Headlines

Hepatitis Delta virus, a living fossil virus of the old RNA world?

Hepatitis delta virus (HDV) is unique in virology, and continues to fascinate since its discovery 30 years ago. HDV is a defective virus parasiting hepatitis B virus (HBV) infected cells. The clinical significance of HDV infection is more severe acute and chronic liver disease than that caused by HBV alone.

Only 1'680 nucleotides long, the HDV genome is the smallest known to infect man. The virus comprises one single gene, encoding the small Hepatitis Delta Antigen (S-HDAg). To compensate for this limited protein-coding capacity, HDV relies on a unique molecular mechanism to hijack host functions and the extraordinary dynamics of its RNA genome.

All known RNA viruses code for an RNA-dependent RNA polymerase to replicate/transcribe their genome, since eukaryotic host cells are unable to replicate RNA genomes. All but HDV; surprisingly, the S-HDAg seems to modify the activity of human DNA-dependent RNA polymerase II, turning it into an RNA-dependent RNA polymerase! Not only is this activity unique in molecular biology, but it also has many implications in the field of molecular evolution: life is thought to have started as RNA. HDV highlights the potential ability of human RNA polymerase II to switch back to an activity presumably forgotten for hundreds of millions of years.

HDV genome replication is further pushing its nostalgia for the ancient RNA world. Rolling circle genome replication produces a ssRNA composed of numerous repeats of the viral genome. All viruses known to use the rolling circle replication rely on proteins to cleave the genome concatemer. All but HDV; cleavage occurs via an autocatalytic ribozyme activity encoded in the RNA genome.

HDV needs HBV co-infection only to borrow its capsid and budding mechanism. This function is carried out by a longer isoform of HDAg with an additional 19 to 20 amino acids (L-HDAg). Again HDV relies on a unique mechanism to produce this isoform; the genomic RNA is edited at one specific site by a human RNA adenosine deaminase (ADAR1). Somehow, edited genomes are unable to replicate, assuring that the unedited version remains predominant.

The lesson from this quite unusual virus is that evolution does not always result in the creation of new tools, but sometimes it allows an existing tool to learn old and long forgotten tricks

UniProtKB News

Cross-references to CAZy

Cross-references have been added to the Carbohydrate-Active enZymes database CAZy. CAZy describes the families of structurally-related catalytic and carbohydrate-binding modules (or functional domains) of enzymes that degrade, modify, or create glycosidic bonds.

CAZy is available at http://www.cazy.org/.

The format of the explicit links in the flat file is:

Resource abbreviation CAZy
Resource identifier CAZy family number.
Optional information 1 CAZy family name.
Examples
P30590:
DR   CAZy; GT2; Glycosyltransferase Family 2.

P32775:
DR   CAZy; CBM48; Carbohydrate-Binding Module Family 48.
DR   CAZy; GH13; Glycoside Hydrolase Family 13.

Changes concerning keywords

New keyword:

Changes in subcellular location controlled vocabulary

New subcellular locations:

  • Host
  • Host apical cell membrane
  • Host caveola
  • Host cell
  • Host cell envelope
  • Host cell inner membrane
  • Host cell junction
  • Host cell membrane
  • Host cell outer membrane
  • Host cell projection
  • Host cell surface
  • Host cell wall
  • Host cellular thylakoid
  • Host cellular thylakoid membrane
  • Host cis-Golgi network
  • Host cis-Golgi network membrane
  • Host cytoplasm
  • Host cytoplasmic vesicle
  • Host cytoplasmic vesicle membrane
  • Host cytoskeleton
  • Host cytosol
  • Host endoplasmic reticulum
  • Host endoplasmic reticulum lumen
  • Host endoplasmic reticulum membrane
  • Host endoplasmic reticulum-Golgi intermediate compartment
  • Host endoplasmic reticulum-Golgi intermediate compartment membrane
  • Host endosome
  • Host endosome membrane
  • Host extracellular space
  • Host filopodium
  • Host Golgi apparatus
  • Host Golgi apparatus membrane
  • Host intracytoplasmic membrane
  • Host late endosome
  • Host late endosome membrane
  • Host lipid droplet
  • Host lipid droplet membrane
  • Host lysosome
  • Host lysosome membrane
  • Host membrane
  • Host microsome
  • Host microsome membrane
  • Host mitochondrion
  • Host mitochondrion envelope
  • Host mitochondrion inner membran
  • Host mitochondrion membrane
  • Host mitochondrion outer membrane
  • Host nucleolus
  • Host nucleoplasm
  • Host nucleus
  • Host nucleus envelope
  • Host nucleus inner membrane
  • Host nucleus lamina
  • Host nucleus matrix
  • Host nucleus membrane
  • Host perinuclear region
  • Host periplasm
  • Host plasmodesma
  • Host rough endoplasmic reticulum
  • Host rough endoplasmic reticulum membrane

UniProt release 15.0

Published March 24, 2009

Headlines

A UniProtKB major release (15.0)

UniProt Knowledgebase release 15.0 includes Swiss-Prot release 57.0 and TrEMBL release 40.0.

Release 57.0 of 24-Mar-09 of UniProtKB/Swiss-Prot contains 428'650 sequence entries, comprising 154'416'236 amino acids abstracted from 177'584 references. 36'053 sequences have been added since release 56.0, the sequence data of 2'010 existing entries have been manually updated and the annotations of 368'500 entries have been revised.

Release 40.0 of 24-Mar-2009 of UniProtKB/TrEMBL contains 7'537'442 sequence entries, comprising 2'459'135'421 amino acids. 1'700'878 sequences have been added since release 39.0, the sequence data of 24'829 existing entries has been updated and the annotations of 4'218'268 entries have been revised. This represents an increase of 31%.

The following improvements were carried out in the last 8 months:

  • We have structured the 'Pathway' subsection of the 'General annotation (Comments)' section (comment line (CC) topic PATHWAY in the flat file), using the controlled vocabulary provided by the UniPathway resource, in order to improve the consistency of annotation and to allow to parse its content. To accompany this change, we have created a new document: pathlist.txt which describes the controlled vocabulary used in the 'Pathway' subsection of the 'General annotation (Comments)' section.
  • We have enriched the controlled vocabulary for subcellular location description with 14 new terms. 6 have been deleted.
  • We have enriched the controlled vocabulary for post-translational modification description with 30 new terms: 9 for the subsection 'Modified residue' of the 'Sequence annotation (Features)' section (FT 'MOD_RES' in the flat file) and 21 for the subsection 'Cross-link' (FT 'CROSSLNK' in the flat file).
  • We have added cross-references to 9 new databases, bringing the total number of explicit cross-references to 111: Bgee, GeneCards, IPI, NextBio, Pathway_Interaction_DB, PRIDE, TCDB, Xenbase and to Plant Ontology (PO) in the tisslist.txt file. Cross-references to StyGene have been removed.
  • We have added 17 new keywords and 1 has been deleted. 10 of the newly created keywords are medical ones.

UniProtKB News

Change in the 'Encoded on' subsection (OG line in the flat file): from 'Chromatophore' to 'Organellar chromatophore'

After discussion with experts in the field and consultation with the Gene Ontology experts we have changed the OG line describing proteins encoded by the chromatophore of Paulinella chromatophora from:

      OG   Plastid; Chromatophore.
     
to:
      OG   Plastid; Organellar chromatophore.
     

Changes concerning cross-references to StyGene

Cross-references to StyGene have been removed.

Removal of the 'salty.txt' document

The document salty.txt, listing Salmonella typhimurium strain LT2 entries, gene names and cross-references to StyGene, has been removed.

Changes concerning keywords

New keywords:

Modified keyword:

Changes in subcellular location controlled vocabulary

New subcellular location:

  • Pollen coat

Modified subcellular locations:

  • Chromatophore -> Organellar chromatophore
  • Chromatophore inner membrane -> Organellar chromatophore inner membrane
  • Chromatophore intermembrane space -> Organellar chromatophore intermembrane space
  • Chromatophore membrane -> Organellar chromatophore membrane
  • Chromatophore outer membrane -> Organellar chromatophore outer membrane
  • Chromatophore stroma -> Organellar chromatophore stroma
  • Chromatophore thylakoid -> Organellar chromatophore thylakoid
  • Chromatophore thylakoid lumen -> Organellar chromatophore thylakoid lumen
  • Chromatophore thylakoid membrane -> Organellar chromatophore thylakoid membrane

UniProt release 14.9

Published March 3, 2009

Headlines

Hush, Little Fly...

Most organisms slumber and so do flies. As in humans, caffeine or amphetamines keep them awake, while antihistamines make them fall asleep. Not surprisingly, prolonged sleep deprivation can lead to lethality. These and other similarities prompted researchers to use Drosophila melanogaster as a model organism to study the genetic basis of sleep.

Sleep is regulated by two main processes: circadian and homeostatic. The first says it is time to sleep, the second signals the need to rest, independently of the hour of the day. In July 2008, Koh et al. showed that in Drosophila, mutations in a single protein, Quiver, well-renamed Sleepless by the authors, deeply perturb the homeostatic control. Loss of this protein causes an extreme reduction in sleep (>80%). About 9% of the flies don't sleep at all. Although the mutants had a shortened lifespan, they were still capable of flying and mating!

Quiver is thought to act through the regulation of the Shaker K+ channel, lowering membrane excitability by modulating its expression and activity. It could thus be a signaling molecule that links homeostatic sleep drive to neuronal excitability.

Although Quiver is well-conserved in other insect species and a potential ortholog has been identified in C.elegans, there are no obvious homologs in vertebrates. However, many members of the Shaker potassium channel family are known from yeast to humans.

If there is indeed a common mechanism for sleep control between humans and flies, one might envision relieving some form of insomnia by acting on K+ channels. In the meantime, we advise you to keep counting sheep, or flies, when you can't get sleep...

Quiver is now available in UniProtKB/Swiss-Prot and the first Protein Spotlight issue of this year has been devoted to this protein.

UniProtKB News

Cross-references to TCDB

Cross-references have been added to the Transport Classification Database TCDB. TCDB details a comprehensive IUBMB approved classification system for membrane transport proteins known as the Transporter Classification (TC) system. The TC system is analogous to the Enzyme Commission (EC) system for classification of enzymes, but incorporates phylogenetic information additionally.

TCDB is available at http://www.tcdb.org/.

The format of the explicit links in the flat file is:

Resource abbreviation TCDB
Resource identifier Transporter Classification number.
Optional information 1 Transporter Classification family name.
Examples
P0A903:
DR   TCDB; 1.B.33.1.3; outer membrane protein insertion porin (OmpIP) family.

P0AC02:
DR   TCDB; 1.B.33.1.3; outer membrane protein insertion porin (OmpIP) family.

O60840:
DR   TCDB; 1.A.1.11.11; voltage-gated ion channel (VIC) superfamily.
DR   TCDB; 1.A.1.11.15; voltage-gated ion channel (VIC) superfamily.

Cross-references to Pathway_Interaction_DB

Cross-references have been added to the Pathway Interaction Database Pathway_Interaction_DB. The Pathway Interaction Database is a highly-structured, curated collection of information about known biomolecular interactions and key cellular processes assembled into signaling pathways.

Pathway_Interaction_DB is available at http://pid.nci.nih.gov/.

The format of the explicit links in the flat file is:

Resource abbreviation Pathway_Interaction_DB
Resource identifier Short pathway name.
Optional information 1 Full pathway name.
Examples
O00422:
DR   Pathway_Interaction_DB; hdac_classi_pathway; Signaling events mediated by HDAC Class I.
DR   Pathway_Interaction_DB; hedgehog_glipathway; Hedgehog signaling events mediated by Gli proteins.
DR   Pathway_Interaction_DB; smad2_3nuclearpathway; Regulation of nuclear SMAD2/3 signaling.
DR   Pathway_Interaction_DB; telomerasepathway; Regulation of Telomerase.

O14640:
DR   Pathway_Interaction_DB; ps1pathway; Presenilin action in Notch and Wnt signaling.
DR   Pathway_Interaction_DB; wnt_canonical_pathway; Canonical Wnt signaling pathway.

Changes concerning keywords

New keywords:

Modified keyword:

UniProt release 14.8

Published February 10, 2009

Headlines

The UniProtKB/Swiss-Prot bronze medal is awarded to the plant Arabidopsis thaliana

With 7'764 manually annotated entries, Arabidopsis thaliana is now the third most represented species in UniProtKB/Swiss-Prot, behind Homo sapiens (human) and Mus musculus (mouse). This corresponds to about 25% of the complete proteome of A.thaliana, which can be retrieved from UniProtKB using the keyword 'Complete proteome'.

The members of the Plant Proteome Annotation Program (PPAP) are very proud of this third position. As shown in a study on Olympic medalists by Medvec et al. (1995), competitors who won the bronze medal are significantly happier with their award than those who won the silver medal. The silver medalists tend to be frustrated at having missed out on the gold, while the bronze medalists are simply happy to have received any honor at all.

UniProtKB News

Cross-references to Plant Ontology (PO) in the tisslist.txt file

Cross-references to Plant Ontology (PO) have been added in the tisslist.txt file. Each term in this tissue list can be mapped to the corresponding eVOC term and, as of this release, also to a Plant Ontology (PO) term.

Examples:

ID   Aleurone.
AC   TS-0027
SY   Aleurone layer.
DR   PO; PO:0005360; aleurone layer.
//
ID   Embryo.
AC   TS-0229
SY   Embryonic; Embryonic tissue; Whole embryo; Parthenogenote.
DR   eVOC; EV:0300001; development-stage: embryo.
DR   PO; PO:0009009; embryo.
//

Cross-references to IPI

Cross-references have been added to the International Protein Index IPI. IPI maintains a database of cross-references between the primary data sources, provides minimally redundant yet maximally complete sets of proteins and maintains stable identifiers for proteomes of higher eukaryotic organisms.

IPI is available at http://www.ebi.ac.uk/IPI/IPIhelp.html.

The format of the explicit links in the flat file is:

Resource abbreviation IPI
Resource identifier IPI unique identifier.
Examples
Q8NFR9:
DR   IPI; IPI00168887; -.
DR   IPI; IPI00177866; -.
DR   IPI; IPI00747706; -.
DR   IPI; IPI00789075; -.
DR   IPI; IPI00876915; -.

P03898:
DR   IPI; IPI00716083; -.

Cross-references to Bgee

Cross-references have been added to Bgee, a database for Gene Expression Evolution. Bgee is a database that allows to retrieve and compare gene expression patterns between animal species. Bgee first maps heterogeneous expression data (currently EST, Affymetrix and in situ hybridization data) on anatomical and developmental ontologies.

Bgee is available at http://bgee.unil.ch/bgee/bgee.

The format of the explicit links in the flat file is:

Resource abbreviation Bgee
Resource identifier UniProtKB accession number.
Examples
Q9Z351:
DR   Bgee; Q9Z351; -.

P62835:
DR   Bgee; P62835; -.

Changes in subcellular location controlled vocabulary

New subcellular locations:

  • Membrane raft
  • Perispore
  • Prospore
  • Prospore membrane
  • Spore coat
  • Spore core membrane
  • Spore cortex
  • Spore outer membrane

Deleted subcellular locations:

  • Endospore
  • Endospore coat
  • Endospore cortex
  • Endospore exosporium
  • Endospore intermembrane
  • Endospore outer membrane

Changes in PTM controlled vocabulary

New PTMs:

  • (3R,4R)-4,5-dihydroxyisoleucine
  • (3R,4S)-4-hydroxyisoleucine
  • (3S,4R)-3,4-dihydroxyisoleucine
  • 2'-cysteinyl-6'-hydroxytryptophan sulfoxide (Trp-Cys)
  • 2-oxazoline-4-carboxylic acid (Cys-Ser)
  • 3-hydroxy-5-methylproline
  • 3-hydroxyphenylalanine
  • 3-hydroxyvaline
  • 5-amino-piperideine-2,5-dicarboxylic acid (Ser-Cys) (with S-...)
  • 5-amino-piperideine-2,5-dicarboxylic acid (Ser-Ser) (with C-...)
  • 5-methoxythiazole-4-carboxylic acid (Val-Cys)
  • 5-methyloxazole (Ser-Thr)
  • 5-methylthiazole-4-carboxylic acid (Asn-Cys)
  • Cyclopeptide (Ile-Pro)
  • Cyclopeptide (Leu-Leu)
  • Decarboxylated threonine
  • N6-(ADP-ribosyl)lysine
  • O-methylthreonine
  • Pyridine-2,5-dicarboxylic acid (Ser-Cys) (with S-...)
  • Pyridine-2,5-dicarboxylic acid (Ser-Ser) (with C-...)
  • Thiazole-4-carboxylic acid (Asn-Cys)
  • Thiazole-4-carboxylic acid (Cys-Cys)
  • Thiazole-4-carboxylic acid (Ile-Cys)
  • Thiazole-4-carboxylic acid (Phe-Cys)
  • Thiazole-4-carboxylic acid (Pro-Cys)
  • Thiazole-4-carboxylic acid (Ser-Cys)
  • Thiazole-4-carboxylic acid (Thr-Cys)
  • Thiazole-4-carboxylic acid (Val-Cys)
  • Thiazoline-4-carboxylic acid (Phe-Cys)
  • Thiazoline-4-carboxylic acid (Thr-Cys)

UniProt release 14.7

Published January 20, 2009

Headlines

UniPathway, a metabolic door to UniProtKB/Swiss-Prot

Due to the importance of using standardized nomenclature, annotations in UniProtKB/Swiss-Prot are progressively moving towards structured controlled vocabularies. In this context, the UniPathway project (a collaborative project involving the SIB and INRIA) aims at providing an extra resource dedicated to the exploration of metabolism using a structured controlled vocabulary for concisely describing the role of a protein in metabolism.

The metabolism of living organisms can be understood as a network of biochemical reactions, generally catalyzed by enzymes. Dealing with this network as a whole is a complex task and a classical approach is to divide it into more manageable segments, called pathways. This approach is always somewhat arbitrary and depends upon the final usage. Usually, a first level of segmentation is achieved on the basis of biological criteria. For instance, one could divide by considering the sub-network of all reactions involved in the amino-acid biosynthesis or, more specifically, in L-lysine biosynthesis only, or even more specifically, in L-lysine biosynthesis via the AAA pathway. It results in a series of coarse- to fine-grained divisions (the coarsest is called a 'super-pathway').

Whenever possible, we further refine this first-level segmentation to a second-level one, in order to split the pathways into linear segments (i.e. sub-networks without branches) called 'sub-pathways'. Such a fine-grained segmentation allows representation of pathway variants. Indeed, depending on an organism (or a set of organisms), the chemical route from one compound to another can be performed in different ways. It is important to represent these variations within the same pathway since UniProtKB covers a large number of species. In addition, it offers a convenient way to label the enzymatic reactions that constitute a metabolic pathway by their relative position ('step') in the sub-pathway.

The role of a protein in metabolism is described in the 'Pathway' subsection of the 'General annotation (Comments)' section. The syntax is 'super-pathway; pathway; sub-pathway: step n/m'. For examples of metabolic pathway annotations, see: P49367, P38998 and P11454. In this last example, the biochemical reactions of the pathway are not yet known. P11454 was therefore only annotated at the level of the pathway.

In the current version of UniProtKB/Swiss-Prot, close to 82'000 entries are annotated with the UniPathway controlled vocabulary. The UniProt web site supplies direct links to the UniPathway web server that provides more detailed information on pathways, sub-pathways and biochemical reactions.

UniProtKB News

New document on pathway controlled vocabulary

The document pathlist.txt is available by ftp and on the Web site. It describes the controlled vocabulary used in the 'Pathway' subsection of the 'General annotation (Comments)' section in the following format:

---------  -------------------------------   ----------------------------
Line code  Content                           Occurrence in an entry
---------  -------------------------------   ----------------------------
ID         Identifier                        Once; starts an entry
AC         Accession number                  Once
CL         UniPathway class                  Once
DE         Definition                        Once or more
SY         Synonym(s)                        Optional; once or more
HI         Relationship is-a                 Optional; once or more
HP         Relationship part-of              Optional; once or more
DR         Cross-reference(s)                Optional; once or more
//         Terminator                        Once; ends an entry

Example:

ID   D-alanine biosynthesis.
AC   UPA00042
CL   Pathway.
DE   Biosynthesis of D-alanine. D-alanine is used either as an energy
DE   source or as a component of bacterial cell wall, where it is directly
DE   involved in the cross-linking of adjacent peptidoglycan chains. In
DE   Gram-positive bacteria, D-alanine can also be found to variable
DE   extents in cell wall teichoic acid and lipoteichoic acid residues.
SY   D-2-aminopropionic acid biosynthesis.
HI   UPA00402; amino-acid biosynthesis.
DR   GO; GO:0030632; P:D-alanine biosynthetic process.
DR   KEGG; map00252; Alanine and aspartate metabolism.
DR   KEGG; map00473; D-Alanine metabolism.
DR   MetaCyc; ALADEG-PWY.
//

Syntax modification of the 'Pathway' subsection

We have structured the 'Pathway' subsection of the 'General annotation (Comments)' section (comment line (CC) topic PATHWAY in the flat file), using the controlled vocabulary provided by the UniPathway resource, in order to improve the consistency of annotation and to allow to parse its content.

The new format of PATHWAY topic in the flat file is:

CC   -!- PATHWAY: Super-pathway; Pathway(; Sub-pathway: Enzymatic_reaction)?([regulation])?.
     
Where:
  • Super-pathway: Describes a class of metabolic pathways, e.g. Amino-acid biosynthesis
  • Pathway: Describes a metabolic pathway, e.g. L-lysine biosynthesis via AAA pathway
  • Sub-pathway: Describes a linear sequence of enzymatic reactions in the format:
    final_product from initial_substrate
    where final_product and initial_substrate are the labels of the corresponding chemical compounds, e.g. L-alpha-aminoadipate from 2-oxoglutarate
  • Enzymatic_reaction: Describes the enzymatic reaction catalyzed by the protein in the format:
    step n/m
    where n is the relative position of the enzymatic reaction in the sub-pathway and m is the total number of enzymatic reactions in the sub-pathway.
  • [regulation]: Indicates that a protein acts as transcriptional regulator of the genes coding for enzymes of the pathway.

Note: Perl-style multipliers indicate whether a pattern (as delimited by parentheses) is optional.

Examples:

P49367:
      CC   -!- PATHWAY: Amino-acid biosynthesis; L-lysine biosynthesis via AAA
      CC       pathway; L-alpha-aminoadipate from 2-oxoglutarate: step 2/4.
     
P0A877:
      CC   -!- PATHWAY: Amino-acid biosynthesis; L-tryptophan biosynthesis; L-
      CC       tryptophan from chorismate: step 5/5.
     
P95477:
      CC   -!- PATHWAY: Siderophore biosynthesis; pseudomonine biosynthesis.
     
P52957:
      CC   -!- PATHWAY: Mycotoxin biosynthesis; sterigmatocystin biosynthesis
      CC       [regulation].
     

Changes concerning keywords

New keywords:

UniProt release 14.6

Published December 16, 2008

Headlines

GeneCards: yet another means to get human gene chromosomal location

UniProtKB aims to be a central hub for biological information on proteins. While the protein sequence is described in depth at the residue level in the 'Sequence annotation (Features)' section of UniProtKB/Swiss-Prot entries, the general context in which the protein exists and functions (mostly provided in the 'General annotation (Comments)' section) is kept at a general interest level. Users interested in more detailed information are invited to deepen their knowledge by looking into the original publications (in the 'References' section) and making use of the numerous cross-references, mostly found in the 'Cross- references' section that is becoming larger and larger with each release.

In the current release, we have added cross-references to GeneCards. This database focuses on human genes. The information provided by GeneCards is automatically extracted from more than 50 databases, some of which are manually annotated, such as OMIM and UniProtKB/Swiss-Prot. While much of the information provided by GeneCards overlaps with that found in UniProtKB/Swiss-Prot, it also contains additional data which complement our annotations.

GeneCards indicates very precisely the chromosomal location of each gene, not only at the chromosome (sub)bands, but also at the level of base pairs, clearly indicating from which end of the chromosome the position is calculated (see for instance ATP10A). This type of information is not currently provided directly in UniProtKB/Swiss-Prot entries, but can be accessed through links to other databases, such as Ensembl and now GeneCards. Note, however, that we provide a complete list of all human proteins, chromosome by chromosome, on the 'human-centric' page on the ExPASy server. For each chromosome, the list can be downloaded from the UniProt ftp site (see for instance all proteins encoded on chromosome 1).

UniProtKB News

Cross-references to GeneCards

Cross-references have been added to GeneCards. GeneCards is a searchable, integrated database of human genes that provides concise genomic, proteomic, transcriptomic, genetic and functional information on all known and predicted human genes.

GeneCards is available at http://www.genecards.org/.

The format of the explicit links in the flat file is:

Resource abbreviation GeneCards
Resource identifier GeneCards unique identifier.
Examples
Q6PCB8:
DR   GeneCards; GC05M049731; -.

P69905:
DR   GeneCards; GC16P000162; -.
DR   GeneCards; GC16P000166; -.

Cross-references to PRIDE

Cross-references have been added to PRIDE PRoteomics IDEntifications database. The PRIDE PRoteomics IDEntifications database is a centralized, standards compliant, public data repository for proteomics data.

PRIDE is available at http://www.ebi.ac.uk/pride/.

The format of the explicit links in the flat file is:

Resource abbreviation PRIDE
Resource identifier UniProtKB accession number.
Examples
Q9Y5P4:
DR   PRIDE; Q9Y5P4; -.

P25296:
DR   PRIDE; P25296; -.

Changes concerning cross-references to IntAct

We have changed the format of the cross-reference lines to IntAct to add the number of interactions.

Optional information 1 Number of interactions.
Example
O01802:
DR   IntAct; O01802; 12.

UniProt release 14.5

Published November 25, 2008

Headlines

The plastid: the most important organelle!

The world is full of plastids. Most of us know the green photosynthetic chloroplast which houses the machinery that fixes CO2 (with O2 as a "mere" by-product) and synthesizes sugars, lipids, amino acids, etc.; in short, the basis of our food chain. Found in plants and algae, chloroplasts are absolutely essential to life as we know it.

Plastids contain DNA; they are the remnants of a cyanobacterium that was engulfed by a eukaryotic heterotroph which had previously engulfed an alphaproteobacteria which eventually became the mitochondrion. These are primary endosymbiotic events; the organism that was taken up by the host was not digested but survived in the cytoplasm, eventually transferring genes to the host nucleus and being in effect enslaved. Most of these transferred gene products are imported back into their respective organelles using transit peptides. Plastids now encode between 28 and 250 protein-coding genes. The primary plastid endosymbiosis gave rise to 3 lineages: green algae, red algae and the glaucophytes. Subsequent engulfment of green or red algae by other eukaryotes has given rise to secondary endosymbionts, which in some cases have been engulfed again, sometimes with plastid replacement, to give an array of tertiary endosymbionts. These secondary and tertiary events gave rise to (among others) cryptophytes, diatoms, heterokont algae and apicocomplexa which are organisms that are no longer photosynthetic such as Plasmodium. To further complicate matters, it was thought that there were only 2 primary endosymbiotic events; recent work, however, on a thecate amoeba, Paulinella chromatophora, has cast doubt on this assumption.

Due to their small size, plastids are easily sequenced. A list of fully sequenced plastid genomes, their genes and the nomenclature of known plastid-encoded proteins can be found in our document plastid.txt.

In UniProtKB, we indicate whether a protein is encoded by plastid, mitochondrial or plasmid DNA in the 'Names and origin' section, 'Encoded on' subsection (OG line in the flat file). 6 categories have been created for plastids:

  • 'Plastid; Chloroplast' indicates the organism is photosynthetic, whether of primary, secondary or higher endosymbiotic events.
  • 'Plastid; Non-photosynthetic plastid' is used when the organism is from a photosynthetic lineage but genetically unable to photosynthesize, as happens with some parasitic plants (Epifagus virginiana, Aneura mirabilis), a parastic "green" algae (Helicosporidium sp. subsp. Simulium jonesii) and a euglenoid (Astasia longa).
  • 'Plastid; Cyanelle' is used for the plastid of the glaucophyte algae. It has the remnants of a cell wall between its surrounding membranes.
  • 'Plastid; Apicoplast' is used for plastids from the non-photosynthetic Apicocomplexan parasites such as Plasmodium, Toxoplasma and Eimeria which cause malaria, toxoplasmosis and coccidian diseases respectively. Although the plastid remnant has a reduced coding capacity, it is essential for cell survival and is interesting as a drug target.
  • 'Plastid; Chromatophore' is used for the plastid of the thecate amoeba Paulinella chromatophora, which has a very large endosymbiont genome (1.0 Mb, encoding almost 900 proteins).
  • 'Plastid' (without any qualifier) is used for some parasitic plants (mostly from the genus Cuscuta) which may be briefly photosynthetic when very young.

Currently, in UniProtKB/Swiss-Prot, there are close to 11'000 entries encoded by a plastid genome; 10'130 by chloroplasts, 145 by cyanelles, 142 by non-photosynthetic plastids, 18 by apicoplasts, 22 by chromatophores and 165 by unspecified types of plastids.

UniProtKB News

Changes concerning keywords

New keywords:

Modified keyword:

Deleted keyword:

  • Structural protein

Website News

New UniParc query field 'isoform'

The existing query field uniprot allows you to search UniParc for the canonical sequence of a UniProtKB entry, e.g. uniprot:P00750. With the new query field isoform you can retrieve the UniParc record that corresponds to the sequence of a specific UniProtKB isoform, e.g. isoform:P00750-2 or you can retrieve all isoforms of a UniProtKB entry, e.g. isoform:P00750-*.

This can also be done with the website's toolbar:

  1. Select Search in: Sequence Archive (UniParc)
  2. Click on Fields » to open the query builder
  3. Select Field: UniProtKB isoform ID
  4. Type the identifier, e.g. P00750-2
  5. Click on Add & Search

Programmatic search for UniRef and UniParc identifiers in UniProtKB

The URLs to search for UniRef and UniParc identifiers in UniProtKB are going to change in the following way:

Valid until release 14.6 Valid from release 14.5
UniRef cluster:*
e.g. cluster:UniRef50_Q8WZ42
cluster:(*)
e.g. cluster:(UniRef50_Q8WZ42)
UniParc sequence:*
e.g. sequence:UPI0000D7E631
sequence:(*)
e.g. sequence:(UPI0000D7E631)

Please change your queries before release 14.6 by adding parentheses around the identifier.

The web interface for searching UniRef and UniParc identifiers in UniProtKB remains unchanged:

  1. Select Search in: Protein Knowledgebase (UniProtKB)
  2. Click on Fields » to open the query builder
  3. Select Field: UniRef ID (or UniParc ID)
  4. Type the identifier, e.g. UniRef50_Q8WZ42
  5. Click on Add & Search

UniProt release 14.4

Published November 4, 2008

Headlines

One thousand legs and a few toxins

Have you ever faced an elongated and dorso-ventrally flattened arthropod? If yes, it could have been a scolopendra or one of its cousins of the "numberless feet" family, i.e. the Myriapoda subphylum. If you were lucky enough not to be stung, you avoided intense local or irradiating pain, redness, edema, local hyperthermia, superficial necrosis, or even systemic symptoms such as nausea, emesis, sudoresis, anxiety and depression.

What is the cause of these symptoms? Information about scolopendra venom composition is very limited, probably due to the lack of severe systemic symptoms and fatalities in adults. However, in 2007, a group of researchers studied the neglected group of scolopenders (see Rates et al., 2007), using a structure-to-function proteomic approach in order to better understand the complexity of the venoms of two Brazilian scolopendra species: Scolopendra viridicornis nigra and Scolopendra angulata. 23 proteins have been characterized and their N-termini sequenced. As of this release, they are all available in UniProtKB/Swiss-Prot.

UniProtKB News

Cross-references to NextBio

Cross-references have been added to NextBio. NextBio is a life science search engine that enables researchers and clinicians to access and understand the world's life sciences information. NextBio contains amongst other things gene-centric data for human, mouse, rat, fly, worm and yeast.

The NextBio is available at http://www.nextbio.com/.

The format of the explicit links in the flat file is:

Resource abbreviation NextBio
Resource identifier NextBio unique identifier.
Examples
O95793:
DR   NextBio; 26468; -.

P55002:
DR   NextBio; 291402; -.

Cross-references to Xenbase

Cross-references have been added to the Xenbase, a Xenopus laevis and Xenopus tropicalis biology and genomics resource. Xenbase is a model organism database integrating a diverse array of biological and genomic data on the frogs, Xenopus laevis and Xenopus (Silurana) tropicalis. Data is collected from other databases, high-throughput screens and scientific literature and is integrated into a number of database modules covering subjects such as community, literature, gene and genomic analysis.

The Xenbase resource is available at http://http://www.xenbase.org/.

The format of the explicit links in the flat file is:

Resource abbreviation Xenbase
Resource identifier Xenbase accession number.
Optional information 1 Gene name.
Examples
Q7ZXH3:
DR   Xenbase; XB-FEAT-942651; map3k7ip3.

P02281:
DR   Xenbase; XB-FEAT-5722946; -.
DR   Xenbase; XB-FEAT-5717970; hist1h2bj.
DR   Xenbase; XB-FEAT-5719554; hist1h2bk.

Changes concerning keywords

New keyword:

UniProt release 14.3

Published October 14, 2008

Headlines

The SIB Swiss Institute of Bioinformatics celebrates its 10th anniversary

The SIB Swiss Institute of Bioinformatics, one of the 3 founder members of the UniProt Consortium, was established 10 years ago, on 30th March 1998, thanks to the enthusiasm and dedication of a small number of outstanding Swiss scientists. In these past 10 years, the SIB has evolved into a federation of 25 research and service groups based in 5 locations in Switzerland: Basel, Berne, Geneva, Lausanne and Zurich. It comprises a total of close to 300 members affiliated to the best universities and institutes of Switzerland.

The SIB has 3 main missions: research, services and training. It develops and maintains databases, such as UniProtKB/Swiss-Prot (in collaboration with the EBI and PIR), PROSITE, SWISS-2DPAGE, CleanEx, SWISS-MODEL Repository and STRING (in collaboration with the EMBL). It also creates and supplies software for the global life science research community, such as Melanie, MSight and SWISS-MODEL. It manages several bioinformatics core facilities that provide informatics and statistical support, services or advice to life scientists, thus enabling them to conduct their research projects and analyse the resulting data. The SIB is also responsible for a number of bioinformatics courses, which are part of the undergraduate curriculum of Swiss universities, as well as a Doctoral School open to graduate students.

The SIB 10th anniversary was celebrated during the whole year with various events throughout Switzerland, such as conferences and exhibitions for the public at large. However a landmark was reached on September 24th with a one-day conference, followed by a gala dinner peppered with music and speeches. Last, but not least, Zoltán Kutalik has been awarded the first annual SIB Young Bioinformatician Award. The award is for the "Ping Pong" code, published in Nature Biotechnology (May 2008), which allows to virtually check human cell lines for their sensitivity to thousands of drugs.

Happy anniversary SIB and many happy returns!

UniProt release 14.2

Published September 23, 2008

Headlines

Additional bibliography information in UniProtKB

As a comprehensive and high-quality resource of protein sequence and functional information, UniProtKB strives to provide comprehensive literature citations associated with protein sequences and their characterization. Currently about 2 thirds of the UniProtKB PubMed citations are found in UniProtKB/Swiss-Prot, as a result of active integration in the course of manual curation.

In order to keep up with the explosive growth of literature and to give our users access to additional publications, we decided to integrate additional sources of literature from other annotated databases into UniProtKB. For this purpose we selected 5 external databases: Entrez Gene (GeneRIFs), SGD, MGI, GAD and PDB, and extracted citations that were mapped to UniProtKB entries. This additional bibliography is available from the 'References' section by clicking on 'Additional computationally mapped references'.

This procedure allowed the addition of about 283'000 PubMed citations in close to 110'000 UniProtKB entries. 85% of these references did not exist previously in UniProtKB.

In the future, we plan to apply this pipeline to more databases that could be used as sources of protein bibliography, including model organism databases, such as FlyBase and WormBase. We believe this additional protein bibliography information will allow our users to better explore the existing knowledge of their proteins of interest.

UniProt release 14.1

Published September 2, 2008

Headlines

First draft of the complete human proteome available in UniProtKB/Swiss-Prot

The UniProt consortium is pleased to announce that a manually annotated representation of all the currently known human protein-coding genes is available in this release of UniProtKB/Swiss-Prot. This represents 20,325 entries. More than a third of these contain additional sequences representing isoforms generated by alternative splicing, alternative promoter usage and/or alternative translation initiation, resulting in close to 34,000 human protein sequences. Approximately 46,000 single amino acid polymorphisms (SAPs), mostly disease-linked, are also described, as well as 60,000 post-translational modifications (PTMs) (for additional statistics, click here).

It is not the first time that UniProtKB/Swiss-Prot has provided a fully annotated proteome for a model organism (for example E.coli or S.cerevisiae) and there are many more planned in the near and more distant future (A.thaliana, B.subtilis, D.discoideum, mouse, rice, S.aureus, S.pombe, etc). But we do not expect that there will ever be anything as important as this proteome. For the first time, we can present to the life sciences community a clean set of what we believe to be a full (although still imperfect!) representation of human proteins. It is the ultimate goal of the life sciences to fully understand Homo sapiens at the molecular level and we hope this set will significantly contribute to this extraordinary adventure.

There are still many challenging tasks in front of us. We will create entries for newly discovered human proteins, review and update the existing set, increase the number of splice variants, explore the full range of PTMs and continue to build a comprehensive view of protein variation in the human population. The characterization at the molecular level will need to be placed in its physiological context: subcellular location, tissue expression, protein/protein interaction, etc. And last but not the least, we all want to understand the role of all these actors of our life processes.

The way is paved, but the road will be long before we fully understand life at a molecular level.

UniProtKB News

Changes concerning keywords

New keywords:

Changes in subcellular location controlled vocabulary

New subcellular locations:

  • Bacterial flagellum filament
  • Bacterial flagellum hook
  • Chromatophore inner membrane
  • Chromatophore intermembrane space
  • Chromatophore outer membrane

UniProt release 14.0

Published July 22, 2008

Headlines

A UniProtKB major release (14.0) on our brand-new UniProt website

After almost one year of beta testing, the UniProt consortium is proud to announce the release of its new official unified website: a new interface, a new search engine and many new options to serve you better. The content of the various databases we provide is unchanged, except for all the improvements we keep carrying with each new release, as we used to from the very beginning of our existence, some 20 years ago in the case of our older database: UniProtKB/Swiss-Prot. Many documents are available on the Documentation/help page, including FAQs. However, don't hesitate to contact us for any further questions, remarks or update requests.

UniProt Knowledgebase release 14.0 includes Swiss-Prot release 56.0 and TrEMBL release 39.0.

Release 56.0 of 22-Jul-08 of UniProtKB/Swiss-Prot contains 392'667 sequence entries, comprising 141'217'034 amino acids abstracted from 172'036 references. 36'631 sequences have been added since release 55.0, the sequence data of 605 existing entries has been updated and the annotations of 356'036 entries have been revised.

The following improvements were carried out in the last 5 months:

  • We have structured the UniProtKB DE lines. The new format includes 3 categories:
    • 'RecName' is the protein name recommended by the UniProt Consortium,
    • 'AltName' represents synonyms found in the literature or in other databases,
    • 'SubName' is the name provided by the submitters of the underlying nucleotide sequence. It is found in UniProtKB/TrEMBL only.
    Three subcategories allow the fine tuning of the nomenclature:
    • Abbreviations and acronyms are available in the 'Short' subcategory,
    • WHO INN (International Nonproprietary Names) are found in the 'INN' subcategory,
    • EC (Enzyme nomenclature) numbers are located in the 'EC' subcategory.
    Each block of DE lines may also contain the sections: 'Includes' or 'Contains' and the field 'Flags' which indicates, for instance, whether the sequence shown is a fragment and/or a precursor.
  • We have changed the FASTA headers in order to make them compatible with the -o option of the NCBI's program formatdb.
  • We have added a new term to the list of valid plastid values in the OG line: Chromatophore.
  • We have enriched the controlled vocabulary for post-translational modification descriptions with 20 new terms: 12 for the feature key 'CROSSLNK' and 8 for the feature key 'MOD_RES'.
  • We have added cross-references to 7 new databases, bringing the total number of explicit cross-references to 102: AGRICOLA, Candida Genome Database(CGD), HOGENOM, HOVERGEN, NMPDR, ProMEX, and BindingDB.
  • We have removed cross-references to HIV and TRANSFAC.
  • We have added 22 new keywords and 2 have been deleted. 12 of the newly created keywords are medical ones.
  • We have added 1 new document:

UniProtKB News

Change of the protein names section (DE line in the flat file)

Up to now, the UniProtKB protein names section (DE lines in the flat file) were listing protein names in a computer parsable format, but with a minimal amount of structure. In UniProtKB/Swiss-Prot, the description starts with the recommended name of the protein and additional alternative names are indicated between parentheses. In UniProtKB/TrEMBL, the description is derived directly from the underlying nucleotide entry and its accuracy relies on the information provided by the submitter of the nucleotide entry, unless it has been improved by automatic annotation procedures.

Consistent nomenclature is indispensable for communication, literature searching and entry retrieval. The protein names provided in the description lines of UniProtKB/Swiss- Prot are widely used by life scientists and often propagated during the annotation of new genomic sequences. For these reasons we have structured the UniProtKB DE lines more explicitly: We introduced 3 categories, as well as several subcategories, of protein names:

Category Field Subcategory Field Cardinality Description
RecName: 1 in UniProtKB/Swiss-Prot
0-1 in UniProtKB/TrEMBL
The name recommended by the UniProt consortium.
Full= 1 The full name.
Short= 0-n An abbreviation of the full name or an acronym.
EC= 0-n An Enzyme Commission number.
AltName: 0-n A synonym of the recommended name.
Full= 0-1 The full name.
Short= 0-n An abbreviation of the full name or an acronym.
EC= 0-n An Enzyme Commission number.
AltName: Allergen= 0-1 See allergen.txt.
AltName: Biotech= 0-1 A name used in a biotechnological context.
AltName: CD_antigen= 0-n See cdlist.txt.
AltName: INN= 0-n The international nonproprietary name: A generic name for a pharmaceutical substance or active pharmaceutical ingredient that is globally recognized and is a public property.
SubName: 0 in UniProtKB/Swiss-Prot
0-n in UniProtKB/TrEMBL
A name provided by the submitter of the underlying nucleotide sequence.
Full= 1 The full name.
EC= 0-n An Enzyme Commission number.

Each name is shown on a separate line; lines may therefore exceed 75 characters.

A block of DE lines may further contain multiple Includes: and/or Contains: sections and a separate field Flags: to indicate whether the protein sequence is a precursor or a fragment:

Field Cardinality Value
Includes: 0-n A block of protein names as described in the table above.
Contains: 0-n A block of protein names as described in the table above.
Flags: 0-1 Precursor and/or Fragment or Fragments

Examples:

P09919:

Previous format:

DE   Granulocyte colony-stimulating factor precursor (G-CSF) (Pluripoietin)
DE   (Filgrastim) (Lenograstim).

New format:

DE   RecName: Full=Granulocyte colony-stimulating factor;
DE            Short=G-CSF;
DE   AltName: Full=Pluripoietin;
DE   AltName: INN=Filgrastim;
DE   AltName: INN=Lenograstim;
DE   Flags: Precursor;
Q10743:

Previous format:

DE   ADAM 10 precursor (EC 3.4.24.81) (A disintegrin and metalloproteinase
DE   domain 10) (Mammalian disintegrin-metalloprotease) (Kuzbanian protein
DE   homolog) (CD156c antigen) (Fragment).

New format:

DE   RecName: Full=ADAM 10;
DE            EC=3.4.24.81;
DE   AltName: Full=A disintegrin and metalloproteinase domain 10;
DE   AltName: Full=Mammalian disintegrin-metalloprotease;
DE   AltName: Full=Kuzbanian protein homolog;
DE   AltName: CD_antigen=CD156c;
DE   Flags: Precursor; Fragment;
Q07908:

Previous format:

DE   Arginine biosynthesis bifunctional protein argJ [Includes: Glutamate
DE   N-acetyltransferase (EC 2.3.1.35) (Ornithine acetyltransferase)
DE   (Ornithine transacetylase) (OATase); Amino-acid acetyltransferase
DE   (EC 2.3.1.1) (N-acetylglutamate synthase) (AGS)] [Contains: Arginine
DE   biosynthesis bifunctional protein argJ alpha chain; Arginine
DE   biosynthesis bifunctional protein argJ beta chain].

New format:

DE   RecName: Full=Arginine biosynthesis bifunctional protein argJ;
DE   Includes:
DE     RecName: Full=Glutamate N-acetyltransferase;
DE              EC=2.3.1.35;
DE     AltName: Full=Ornithine acetyltransferase;
DE              Short=OATase;
DE     AltName: Full=Ornithine transacetylase;
DE   Includes:
DE     RecName: Full=Amino-acid acetyltransferase;
DE              EC=2.3.1.1;
DE     AltName: Full=N-acetylglutamate synthase;
DE              Short=AGS;
DE   Contains:
DE     RecName: Full=Arginine biosynthesis bifunctional protein argJ alpha chain;
DE   Contains:
DE     RecName: Full=Arginine biosynthesis bifunctional protein argJ beta chain;

Changes in the FASTA header line

The UniProtKB FASTA headers were unfortunately incompatible with the -o option of the NCBI's program formatdb. We have been working with the NCBI to remedy this and changes were required on both sides. The new version of formatdb now accepts a database code for UniProtKB/TrEMBL, and we have modified our UniProtKB FASTA headers accordingly. For consistency reasons, we also changed the FASTA headers of the other UniProt databases.

UniProtKB

>db|UniqueIdentifier|EntryName ProteinName OS=OrganismName[ GN=GeneName] PE=ProteinExistence SV=SequenceVersion
Where:
  • db is 'sp' for UniProtKB/Swiss-Prot and 'tr' for UniProtKB/TrEMBL.
  • UniqueIdentifier is the primary accession number of the UniProtKB entry.
  • EntryName is the entry name of the UniProtKB entry.
  • ProteinName is the recommended name of the UniProtKB entry as annotated in the RecName field from release 14.0 on. For UniProtKB/TrEMBL entries without a RecName field, the SubName field is used. The 'precursor' attribute is excluded, 'Fragment' is included with the name if applicable.
  • OrganismName is the scientific name of the organism of the UniProtKB entry.
  • GeneName is the first gene name of the UniProtKB entry. If there is no gene name, OrderedLocusName or ORFname, the GN field is not listed.
  • ProteinExistence is the numerical value describing the evidence for the existence of the protein.
  • SequenceVersion is the version number of the sequence.

Examples:

>sp|Q8I6R7|ACN2_ACAGO Acanthoscurrin-2 (Fragment) OS=Acanthoscurria gomesiana GN=acantho2 PE=1 SV=1
>sp|P27748|ACOX_RALEH Acetoin catabolism protein X OS=Ralstonia eutropha (strain ATCC 17699 / H16 / DSM 428 / Stanier 337) GN=acoX PE=4 SV=2
>sp|P04224|HA22_MOUSE H-2 class II histocompatibility antigen, E-K alpha chain OS=Mus musculus PE=1 SV=1

>tr|A3SA23|A3SA23_9RHOB TonB dependent, hydroxamate-type ferrisiderophore, outer membrane receptor OS=Sulfitobacter sp. EE-36 GN=EE36_08023 PE=3 SV=1
>tr|Q8N2H2|Q8N2H2_HUMAN CDNA FLJ90785 fis, clone THYRO1001457, moderately similar to H.sapiens protein kinase C mu OS=Homo sapiens PE=2 SV=1
Alternative isoforms (this only applies to UniProtKB/Swiss-Prot):
>sp|IsoID|EntryName Isoform IsoformName of ProteinName OS=OrganismName[ GN=GeneName]
Where:
  • IsoID is the isoform identifier as assigned in the ALTERNATIVE PRODUCTS section of the UniProtKB entry.
  • IsoformName is the isoform name as annotated in the ALTERNATIVE PRODUCTS Name field of the UniProtKB entry.
ProteinExistence and SequenceVersion do not apply to alternative isoforms (ProteinExistence is dependent on the number of cDNA sequences, which is not known for individual isoforms).

Example:

sp|Q4R572-2|1433B_MACFA Isoform Short of 14-3-3 protein beta/alpha OS=Macaca fascicularis GN=YWHAB

UniRef

>UniqueIdentifier ClusterName n=Members Tax=Taxon RepID=RepresentativeMember
Where:
  • UniqueIdentifier is the primary accession number of the UniRef cluster.
  • ClusterName is the name of the UniRef cluster.
  • Members is the number of UniRef cluster members.
  • Taxon is the scientific name of the lowest common taxon shared by all UniRef cluster members.
  • RepresentativeMember is the entry name of the representative member of the UniRef cluster.

Example:

>UniRef100_A5DI11 Elongation factor 2 n=1 Tax=Pichia guilliermondii RepID=EF2_PICGU

UniParc

>UniqueIdentifier status=Status
Where:
  • UniqueIdentifier is the primary accession number of the UniParc entry.
  • Status is 'active' if the UniParc entry has at least one active cross-reference, and 'inactive' if it does not have any active cross-references.

Example:

>UPI0000000005 status=active

UniMES

>UniqueIDentifier ProteinName OS=OrganismName[ Pep=SourcePeptideIdentifier] SV=SequenceVersion
Where:
  • UniqueIdentifier is the primary accession number of the UniMES entry.
  • ProteinName is the protein name of the UniMES entry.
  • OrganismName is the scientific name of the organism (group) of the UniMES entry.
  • SourcePeptideIdentifier is the (optional) peptide identifier provided by the submitter.
  • SequenceVersion is the version number of the sequence.

Example:

>MES00000000005 Putative uncharacterized protein GOS_3018412 (Fragment) OS=marine metagenome Pep=JCVI_PEP_1096688850003 SV=1

Archived UniProtKB sequence versions

>db|UniqueIdentifier archived from Release ReleaseNumber ReleaseDate SV=SequenceVersion
Where:
  • db is 'sp' for UniProtKB/Swiss-Prot and 'tr' for UniProtKB/TrEMBL.
  • UniqueIdentifier is the primary accession number of the UniProtKB entry.
  • ReleaseNumber refers to the release from which the sequence was archived (Swiss-Prot or TrEMBL release numbers for releases prior to the first UniProt release, and both UniProt and Swiss-Prot or TrEMBL release numbers for releases after the first UniProt release).
  • ReleaseDate is the date of the release form which the sequence was archived.
  • SequenceVersion is the version number of the sequence.

Examples:

"pre-UniProt":
>sp|P05067 archived from Release 18.0 01-MAY-1991 SV=3
>tr|Q55167 archived from Release 17.0 01-JUN-2001 SV=1
"post-UniProt":
>sp|P05067 archived from Release 9.2/51.2 28-NOV-2006 SV=3
>tr|A0RTJ8 archived from Release 11.0/36.0 29-MAY-2007 SV=1

Chromatophore: a new organelle value in the 'Encoded on' subsection (OG line in the flat file).

We have added Chromatophore to the list of valid plastid values in the OG (OrGanelle) line in the flat file. The chromatophore is the photosynthetic inclusion found in Paulinella chromatophora, a photosynthetic thecate amoeba. It encodes and houses the machinery necessary for photosynthesis and CO2 fixation; it also has the genetic capacity to synthesize some amino acids, some fatty acids and a few cofactors. It is not yet clear whether the chromatophore derives from the same endosymbiotic event that is thought to have led to all other plastids. The chromatophore genome of P. chromatophora has been sequenced (PubMed:18356055) and been found to be just over 1 Mb, approximately 9 times larger than the average photosynthetic plastid and approximately 1/3 smaller than the smallest cyanobacterial genome.

Example:

OG   Plastid; Chromatophore.

On the website, the information is found in the 'Names and origin' section, 'Encoded on' subsection (see for instance B1X5D6).

Cross-references to BindingDB

Cross-references have been added to The Binding Database. BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of proteins considered to be drug-targets with small, drug-like molecules.

The Binding Database is available at http://www.bindingdb.org/.

The format of the explicit links in the flat file is:

Resource abbreviation BindingDB
Resource identifier UniProtKB accession number.
Examples
P50613:
DR   BindingDB; P50613; -.

P68850:
DR   BindingDB; P68850; -.

UniProt decoy databases

The target-decoy search strategy, which has become widespread and is recommended in journal guidelines, consists of attaching a decoy database to a forward database and searching MS/MS spectra against this composite database. It is more stringent than a simple search, and allows to compute an estimation of the false discovery rate.
For this strategy to be efficient, the decoy database has to preserve the general composition of the target database while minimizing the peptide sequence overlap between the target and the decoy.
We developed a new algorithm that shuffles proteins and keeps re-shuffling each tryptic peptide until it no longer matches with any peptide from the original database. This method ensures that no tryptic peptide is shared between the target and decoy databases.

Decoy versions of UniProtKB/Swiss-Prot, UniProtKB/TrEMBL and UniRef100 can now be retrieved in FASTA format from our : public FTP site.

Changes concerning keywords

New keywords:

Deleted keywords:

  • Inner membrane
  • Outer membrane

Changes in subcellular location controlled vocabulary

New subcellular locations:

  • Chromatophore
  • Chromatophore membrane
  • Chromatophore stroma
  • Chromatophore thylakoid
  • Chromatophore thylakoid lumen
  • Chromatophore thylakoid membrane

Changes in PTM controlled vocabulary

New terms in the 'Amino acid modifications' subsection (feature key 'CROSSLNK' in the flat file):

  • (2-aminosuccinimidyl)acetic acid (Asp-Gly)
  • 5'-tyrosyl-5'-aminotyrosine (Tyr-Tyr) (interchain)
  • Cyclopeptide (Ala-Ile)
  • Cyclopeptide (Leu-Trp)
  • Cyclopeptide (Pro-Met)
  • Cyclopeptide (Pro-Tyr)
  • Cyclopeptide (Ser-Gly)
  • Isoaspartyl lysine isopeptide (Lys-Asn)
  • Lysine tyrosylquinone (Tyr-Lys)
  • S-cysteinyl 3-(oxidosulfanyl)alanine (Cys-Cys)
  • Threonyl lysine isopeptide (Lys-Thr) (interchain with T-...)
  • Threonyl lysine isopeptide (Thr-Lys) (interchain with K-...)

New terms in the 'Amino acid modifications' subsection (feature key 'MOD_RES' in the flat file):

  • Aminomalonic acid (Ser)
  • Aspartate 1-(chondroitin 4-sulfate)-ester
  • Aspartic acid 1-[(3-aminopropyl)(5'-adenosyl)phosphono]amide
  • Aspartyl isopeptide (Asp)
  • Beta-decarboxylated aspartate
  • N-D-glucuronoyl glycine
  • O-AMP-tyrosine
  • O-UMP-tyrosine

UniProt release 13.6

Published July 1, 2008

Headlines

Transient pleasures of the mind

Symmetry and round objects, including round numbers, easily fascinate the human mind. Thus, UniProtKB is happy to announce that we have a double set of round numbers to celebrate: UniProtKB/Swiss-Prot now contains over 50'000 cross-references to PDB and over 5'000 mammalian entries with experimental 3D-structures.

It is deeply satisfactory to see the 3D-structure of a protein. 3D-structures show the interactions between proteins and other macromolecules, and between proteins and small ligands, such as metal ions, substrates and inhibitors. Determining the 3D-structure is an important step for elucidating the mode of action of a well-characterized protein, and it provides a starting point for the classification of an uncharacterized protein and the prediction of its physiological role.

UniProtKB provides access to protein 3D-structures via cross-references to PDB (see for example P00734). The number of structures is constantly increasing, and quite frequently several structures have been determined for a given protein. Thus, the 50'000 cross-references to PDB in UniProtKB/Swiss-Prot correspond to more than 12'700 individual entries. Over 5'000 of these (about 40%) are from mammalian model organisms, including close to 3'300 human entries, while bacteria and archaea account for over 4'500 of the entries with links to PDB. Escherichica coli strain K12 is currently the best studied organism at the structural level, with 1035 out of its 4'339 proteins (almost 25%) having at least one link to a PDB entry. Close to 6'000 additional links to PDB are in UniProtKB/TrEMBL, corresponding to another 3'500 entries.

Thanks to the efforts of individual laboratories and structural proteomics groups, the number of experimental 3D-structures is rapidly increasing, and so the symmetrical roundness of the present numbers is a very transient phenomenon. Soon for every new protein there may be a family member with an experimental 3D-structure, even for membrane proteins. That is definitely something to look forward to.

UniProtKB News

Cross-references to AGRICOLA

In the flat file, the RX (Reference cross-reference) line is an optional line which is used to indicate cross-references to bibliographic databases. We have introduced cross-references to AGRICOLA, the National Agricultural Library's catalog of citations to agricultural literature. The valid bibliographic database names and their associated identifiers are now:

Name Identifier
MEDLINE Eight-digit MEDLINE Unique Identifier (UI)
PubMed PubMed Unique Identifier (PMID)
DOI Digital Object Identifier (DOI)
AGRICOLA AGRICOLA Unique Identifier

Example:

Q01901:
     RX   AGRICOLA=IND20450567;
    

Changes concerning keywords

New keyword:

UniProt release 13.5

Published June 10, 2008

Headlines

Over 100 cross-references in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot was the first biomolecular database to include cross- references in its entries. As of this release, we provide our users with 101 explicit links (stored in the various distributed file formats, flat text, XML and RDF/XML) and 23 implicit links (available only from web servers, such as UniProt and ExPASy). Most cross-references can be found in the 'Cross-references' section of the entry (see for example Q9FK25), some are in the 'Sequence annotation' section (the Feature table in the flat file) (see for example cross-references to dbSNP in Q969T7). The dbxref.txt document provides a list of the databases cross-referenced in UniProtKB/Swiss-Prot. This document is available on the UniProt website and by ftp.

Additional links pepper almost every section of a UniProtKB/Swiss-Prot entry. They include cross-references to PubMed which are located in the 'References' section (see for example P0A790) and cross-references to the ENZYME database available through the EC numbers in the 'Names and origin' section (see for example Q00955). Moreover the 'Web resources' section is dedicated to databases or web pages that are specific for a single protein (see for example P04637). Note that the dbxref.txt document does not list these 'special' links.

Historically, a 'hundred' was a geographic division referring to the amount of land sufficient to sustain one hundred families. With over 120 cross-references, we hope to sustain many more research groups in quest of protein information.

UniProtKB News

Cross-references to HOGENOM

Cross-references have been added to HOGENOM, a database of homologous genes from fully sequenced organisms. HOGENOM allows to select sets of homologous genes among species, and to visualize multiple alignments and phylogenetic trees. It is as well possible to search for orthologous genes in a wide range of taxons. Thus HOGENOM is particularly useful for comparative sequence analysis, phylogeny and molecular evolution studies. More generaly, HOGENOM gives an overall view of what is known about a peculiar gene family.

The HOGENOM database is available at http://pbil.univ-lyon1.fr/databases/hogenom.php.

The format of the explicit links in the flat file is:

Resource abbreviation HOGENOM
Resource identifier UniProtKB accession number.
Examples
P0A9I1:
DR   HOGENOM; P0A9I1; -.

P49642:
DR   HOGENOM; P49642; -.

Cross-references to HOVERGEN

Cross-references have been added to HOVERGEN, a database of homologous vertebrate genes. HOVERGEN allows one to select sets of homologous genes among vertebrate species, and to visualize multiple alignments and phylogenetic trees. Thus HOVERGEN is particularly useful for comparative sequence analysis, phylogeny and molecular evolution studies. More generaly, HOVERGEN gives an overall view of what is known about a peculiar gene family.

The HOVERGEN database is available at http://pbil.univ-lyon1.fr/databases/hovergen.php.

The format of the explicit links in the flat file is:

Resource abbreviation HOVERGEN
Resource identifier UniProtKB accession number.
Examples
P31946:
DR   HOVERGEN; P31946; -.

Q91ZB4:
DR   HOVERGEN; Q91ZB4; -.

Changes concerning keywords

New keywords:

UniProt release 13.4

Published May 20, 2008

Headlines

Swiss-Prot in the Wonderland of protein names

Successful basic research requires various skills from scientists, not only creativity, but also precision, critical analysis of experimental results, reconsideration of the starting hypotheses, continuous controls and days, nights and weekends of - sometimes tedious - work in the lab. Thus when proteins are eventually purified, genes are cloned and a nice story is wrapped around the data, one of the rewards is to name the proteins/genes. There lies the fun.

Telling names can be useful for remembering a function or a phenotype. Interaction of Drosophila Cleopatra mutants with the asp gene product is lethal. Indeed, Cleopatra, Ancient Egypt's queen, allegedly committed suicide by way of an asp bite. Groucho mutants have more bristles than the norm on their face, much like Groucho Marx. Ken and Barbie protein mutants lack external genitalia... In Arabidopsis thaliana, Superman mutants have extra stamens (male genitals) in their flowers, and fans of the famous cartoon will not be surprised to learn that Kryptonite protein suppresses the function of Superman.

Acronyms are another part of the naming game. You would expect the RING1 protein to have a specific 3D structure related to its name, round, for instance. Actually, RING stands for "Really Interesting New Gene". In the same vein, you would not expect POSH to be any ordinary protein and yet all it contains are "Plenty Of SH3" domains! JAK1 kinase has two phosphate-transferring domains and was named after Janus, the Roman god of gates, usually depicted with two heads looking in opposite directions. However, JAK is also said to be 'Just Another Kinase', one among the hundreds of essential kinases described so far. And last, but not least, the Drosophila INDY protein refers to the movie "Monty Python and the Holy Grail", in which a live person about to be buried rightly protests: 'I'm Not Dead Yet!', which is hardly surprising since mutations in this gene result in a near doubling of the average adult life-span. For more amazing protein/gene names, see the excellent website established by Mikael Niku and Mikko Taipale.

Scientific creativity can be somewhat hampered by economical actors. The Pokemon oncogene for instance - which stands for POK erythroid myeloid ontogenic factor - had to be withdrawn after the US branch of Japanese video-game franchise Pokémon threatened researchers with legal action. The protein ended up with the far more sober - not to say boring - name of 'Zinc finger and BTB domain-containing protein 7A' (ZBTB7A).

Much ink has been spilled over the lack of standardization of protein names. Inconsistency among orthologs, family members and so on makes the systematic search through the literature a complicated task. UniProt provides a few guidelines for protein naming. Such a document should help to improve consistency, keeping a given protein's 'hypokeimenon', while not curbing creativity!

UniProtKB News

Cross-references to CGD

Cross-references have been added to the Candida Genome Database. CGD is a resource for genomic sequence data and gene and protein information for Candida albicans. CGD is based on the Saccharomyces Genome Database and is funded by the National Institute of Dental and Craniofacial Research at the US National Institutes of Health.

The Candida Genome Database is available at http://www.candidagenome.org/.

The format of the explicit links in the flat file is:

Resource abbreviation CGD
Resource identifier CGD identifier.
Optional information 1 Gene name.
Examples
O74198:
DR   CGD; CAL0006397; ERG6.

Q59TD3:
DR   CGD; CAL0079252; MED8.

Changes concerning keywords

New keyword:

UniProt release 13.3

Published April 29, 2008

Headlines

6 million entries in UniProtKB

Once upon a long, long time... This is how all fairy tales start, but was it really so long ago? No, it was December 2003 - just 4 and a half years ago, but it seems like ages - that UniProtKB was born. It was a beautiful baby, 1,220,020 entries fat and well supported by its 2 legs: the large TrEMBL and the small but knowledgeable Swiss-Prot. And the baby put on weight: on average 1,500 protein sequence entries per day in 2004 and during the first half of 2005. The more you have, the more you want, and from the middle of 2005 up to beginning of 2007, UniProtKB was integrating about 3,500 new entries per day. And it hasn't stopped since: currently we are integrating approximately 5,000 entries per day and this number keeps growing. As a result, we are happy to announce that UniProtKB has reached the significant milestone of 6,074,524 entries. Note that this tremendous growth is not due to the submission of environmental samples that are stored in another UniProt database: UniMES.

May we all live happily ever after and extract knowledge from this flood of data!

UniProtKB News

Cross-references to NMPDR

Cross-references have been added to the National Microbial Pathogen Data Resource. NMPDR is a National Institute of Allergy and Infections Disease (NIAID)-funded Bioinformatics Resource Center that supports research in selected Category B pathogens. NMPDR contains the complete genomes of approximately 50 strains of pathogenic bacteria as well as >400 other genomes that provide a broad context for comparative analysis across the three phylogenetic domains. NMPDR integrates complete, public genomes with expertly curated biological subsystems to provide the most consistent genome annotations. Subsystems are sets of functional roles related by a biologically meaningful organizing principle, which are built over large collections of genomes; they provide researchers with consistent functional assignments in a biologically structured context.

The National Microbial Pathogen Data Resource is available at http://www.nmpdr.org/.

The format of the explicit links in the flat file is:

Resource abbreviation NMPDR
Resource identifier NMPDR protein identifier.
Examples
Q88K84:
DR   NMPDR; fig|160488.1.peg.2385; -.

Q1QN15:
DR   NMPDR; fig|323097.3.peg.1480; -.

Changes concerning keywords

New keywords:

UniProt release 13.2

Published April 8, 2008

Headlines

Dictyostelium discoideum on the move

Dictyostelium discoideum is a social amoeba known for its ability to alternate between unicellular and multicellular forms. Thanks to the availability of powerful molecular genetic tools, it is a convenient model to study fundamental cellular processes, such as cytokinesis, motility, phagocytosis, chemotaxis, signal transduction and aspects of development, including cell sorting, pattern formation and cell-type determination. It is one of 9 nonmammalian model organisms recognized by the National Institutes of Health (NIH) for their utility in the study of fundamental molecular processes of medical importance.

The 34 Mb genome of Dictyostelium discoideum was sequenced and assembled by an international consortium in 2005. Its gene-dense chromosomes encode approximately 12,500 predicted proteins, a high proportion of which have long, repetitive amino acid tracts.

In order to improve the coverage of functional annotation of Dictyostelium discoideum proteins, the UniProt consortium and dictyBase jointly organized a one-week Dictyostelium discoideum protein annotation jamboree in the SIB Swiss Institute of Bioinformatics in Geneva last month. During this special event, more than 1,000 proteins were annotated by UniProtKB curators and about 30 gene models were corrected by dictyBase curators. In addition, more than 300 gene and protein names were standardized.

The close collaboration between UniProtKB and dictyBase will continue until the completion of Dictyostelium discoideum proteome annotation, planned for 2010.

UniProtKB/Swiss-Prot current release contains 1,803 fully annotated Dictyostelium discoideum entries, which represents about 15% of the complete proteome. A complete non-redundant set of Dictyostelium discoideum proteins can be retrieved from UniProtKB with the keyword 'Complete proteome'.

UniProtKB News

New document listing all secondary accession numbers.

The document sec_ac.txt, available by ftp, lists all secondary accession number(s) in UniProtKB (UniProtKB/Swiss-Prot and UniProtKB/TrEMBL), together with their corresponding current primary accession number.

Changes concerning cross-references to HIV

Cross-references to HIV have been removed.

Changes concerning cross-references to TRANSFAC

Cross-references to the TRANSFAC have been removed.

Changes concerning keywords

New keywords:

UniProt release 13.1

Published March 18, 2008

Headlines

A small but deadly pathogen: Hepatitis B virus

The Hepatitis B virus (HBV) causes transient and chronic infections of the liver and constitutes a major cause of human disease. It is estimated that more than 5% of the global population carries the virus, and deaths from liver cancer caused by HBV probably exceed one million per year (see WHO factsheet). An effective vaccine has been available for nearly 20 years, but its high cost still hampers disease control in the developing world.

This killer virus has a surprisingly small genome, about 3.2 kb, which nevertheless encodes for 5 proteins through overlapping open reading frames. It replicates by reverse-transcribing genomic RNA to partial dsDNA through a unique mechanism, and thus belongs to a particular family: the hepadnaviridae.

The virus specifically infects hepatocytes, and most symptoms in an acute infection result from the killing of infected cells by the host immune system. In a few cases, the virus manages to down-regulate the host immunity and establishes a chronic infection. A viral protein secreted in blood is suspected to be involved in chronicity: the HbeAg protein may specifically deplete T-helper lymphocytes, thereby suppressing the ability to mount a strong cytotoxic response against infected hepatocytes.

Our current knowledge of the virus is rather poor due to the lack of cell culture systems allowing in vitro viral propagation. Much of what we know is derived from the study of other closely related hepadnaviridae, such as the woodchuck hepatitis virus (WHV) and the ground squirrel hepatitis virus (GSHV).

In the current UniProtKB/Swiss-Prot release, all hepatitis B virus entries have been updated, and 51 strains representative of the 8 genotypes infecting humans have been annotated. Animal hepatitis B viruses entries have also been revisited, notably WHV and GSHV.

UniProtKB News

Cross-references to ProMEX

Cross-references have been added to the Protein Mass spectra EXtraction database. ProMEX is a mass spectral library consisting of tryptic peptide product ion spectra generated by liquid chromatography coupled to ion trap mass spectrometry (LC-ITMS) and was developed using samples derived from Arabidopsis thaliana and Medicago truncatula. The database serves as a reference and can be used for protein identification in uncharacterized samples. Protein identification by ProMEX is linked to other molecular levels of biological organization such as metabolite, pathway and transcript data. The database is further connected to annotation and classification services.

The Protein Mass spectra EXtraction database is available at http://promex.mpimp-golm.mpg.de/.

The format of the explicit links in the flat file is:

Resource abbreviation ProMEX
Resource identifier UniProtKB accession number.
Examples
O80448:
DR   ProMEX; O80448; -.

P49200:
DR   ProMEX; P49200; -.

Changes concerning keywords

New keywords:

UniProt release 13.0

Published February 26, 2008

Headlines

UniProtKB major release (13.0)

UniProt Knowledgebase release 13.0 includes Swiss-Prot release 55.0 and TrEMBL release 38.0.

Release 55.0 of 26-Feb-08 of UniProtKB/Swiss-Prot contains 356'194 sequence entries, comprising 127'836'513 amino acids abstracted from 165776 references. 80'183 sequences have been added since release 54.0, the sequence data of 1'411 existing entries has been updated and the annotations of 262'009 entries have been revised.

The following improvements were carried out in the last 7 months:

UniProtKB News

New representation of non-standard amino acids (selenocysteine and pyrrolysine)

The non-standard amino acids selenocysteine and pyrrolysine used to be annotated in the 'Sequence annotation' section, 'Amino acid modifications' subsection, under the feature keys 'Selenocysteine' and 'Modified residue', respectively. In the sequence, selenocysteine was represented by the one-letter code 'C' and pyrrolysine by 'K'. In order to annotate these and future non-standard amino acids more adequately, we created a new key 'Non- standard residue'. The type of non-standard residue involved is indicated in the 'description' of the feature key. Sequences will accomodate the IUPAC/IUBMB recommended one-letter codes 'U' for selenocysteine and 'O' for pyrrolysine.

In the flat file, selenocysteine used to be described with the feature key SE_CYS and pyrrolysine with the more generic feature key MOD_RES. We have replaced these keys with the new feature key NON_STD (non-standard). The type of non-standard residue involved is indicated in the 'description' of the NON_STD key.

Former annotation in the flat file:

     ID   BTHD_DROME              Reviewed;         249 AA.
     ..
     FT   SE_CYS       37     37
     ..
     MPPKRNKKAE APIAERDAGE ELDPNAPVLY VEHCRSCRVF RRRAEELHSA LRERGLQQLQ
     *
    
     ID   MTBB1_METAC             Reviewed;         467 AA.
     ..
     FT   MOD_RES     356    356       Pyrrolysine (Probable).
     ..
     RAVNFMKAAV QASPIPCHVD MGMGVGGIPM LETPPVDAVT RASKAMVEVA GVDGIKIGVG
     *
    

Current annotation:

     ID   BTHD_DROME              Reviewed;         249 AA.
     ..
     FT   NON_STD      37     37       Selenocysteine.
     ..
     MPPKRNKKAE APIAERDAGE ELDPNAPVLY VEHCRSURVF RRRAEELHSA LRERGLQQLQ
     *
    
     ID   MTBB1_METAC             Reviewed;         467 AA.
     ..
     FT   NON_STD     356    356       Pyrrolysine (Probable).
     ..
     RAVNFMKAAV QASPIPCHVD MGMGVGGIPM LETPPVDAVT RASKAMVEVA GVDGIOIGVG
     *
    

UniProtKB/Swiss-Prot entries describing a selenocysteine- and pyrrolysine- containing sequences can be retrieved with the 'Selenocysteine' and 'Pyrrolysine' keywords, respectively.

Cross-references to PhosphoSite

Cross-references have been added to PhosphoSite, an expert-curated knowledgebase of information focused on protein phosphorylation mainly in vertebrates. In addition to phosphorylation sites curated from the literature, large numbers of new unpublished sites discovered by MS/MS analyses are being added regularly.

The Phosphorylation site database is available at http://phosphosite.cellsignal.com/.

The format of the explicit links in the flat file is:

Resource abbreviation PhosphoSite
Resource identifier UniProtKB accession number.
Examples
P01266:
DR   PhosphoSite; P01266; -.

Q9JMH6:
DR   PhosphoSite; Q9JMH6; -.

Cross-references to 2DBase-Ecoli

Cross-references have been added to 2DBase-Ecoli, the 2D-PAGE database of Escherichia coli. The 2DBase-Ecoli database currently contains 12 gels consisting of 1185 protein spots information in which 723 proteins where identified and annotated. Individual protein spots in the existing gels can be displayed, queried, analysed and compared in a tabular format based on various functional categories enabling quick and subsequent analysis.

The 2D-PAGE Database of Escherichia coli is available at http://2dbase.techfak.uni-bielefeld.de/.

The format of the explicit links in the flat file is:

Resource abbreviation 2DBase-Ecoli
Resource identifier UniProtKB accession number.
Examples
P02930:
DR   2DBase-Ecoli; P02930; -.

P04816:
DR   2DBase-Ecoli; P04816; -.

Changes concerning keywords

New keywords:

Changes in subcellular location controlled vocabulary

New subcellular locations:

  • Extravirionic side
  • Intravirionic side

Changes in PTM controlled vocabulary

New terms in the 'Amino acid modifications' subsection (feature key 'MOD_RES' in the flat file):

  • 2'-methylsulfonyltryptophan
  • 4,5-dihydroxyleucine
  • (3R,4S)-3,4-dihydroxyproline
  • Cyclopeptide (Ala-Pro)
  • D-serine (Cys)
  • D-serine (Ser)
  • D-threonine
  • N,N-dimethylalanine
  • N-acetylisoleucine
  • O-(5'-phospho-DNA)-serine
  • O-(5'-phospho-DNA)-tyrosine
  • O-(5'-phospho-RNA)-serine
  • O-(5'-phospho-RNA)-tyrosine
  • S-glutathionyl cysteine

UniProt release 12.8

Published February 5, 2008

Headlines

Over 20,000 fungal proteins manually annotated in UniProtKB/Swiss-Prot

Almost exactly one year after the integration of the complete proteome of Saccharomyces cerevisiae into UniProtKB/Swiss-Prot (see news), we have increased the number of manually annotated fungal entries to more than 20 000.

The fungal kingdom includes very diverse organisms, from unicellular to multicellular, from microscopic to macroscopic. Fungi have essential roles in many ecological processes. They are required for nutrient cycling within ecosystems, since they recycle dead organic matter into useful nutrients. Many plants would not survive without symbiotic fungi called mycorrhizae, which live in their roots and supply essential nutrients. They are also economically important as they provide numerous drugs (such as penicillin), food (such as mushrooms) and are used for their ability to ferment different sugars to produce bread, wine, beer and even soy sauce.

Fungi are also responsible for a great number of severe plant and animal diseases. Fungal infections, also called mycotic infections, may affect the skin or the internal organs of the body. Severe mycotic infections, such as histoplasmosis and candidiasis, are potentially life-threatening. Fungal diseases are very difficult to treat since fungi are eukaryotic organisms that share many properties with animal or human cells. Plant diseases caused by fungi include rusts and smuts, as well as leaf, root, and stem rot. They can cause severe damage to crop production.

Moreover, many fungi are important model organisms for studying the genetics and molecular biology of eukaryotes.

It is therefore not surprising that many fungi were targeted for the complete genome sequencing. No less that 32 complete fungal genomes have been submitted to public sequence databases to date. Using the S. cerevisiae and Schizosaccharomyces pombe fully annotated proteomes as templates, we are progressively annotating orthologous proteins in these newcomers, in order to provide our users with a high-quality fungal protein dataset that will better reflect the diversity of this kingdom.

UniProtKB News

Cross-references to World-2DPAGE

Cross-references have been added to the public repository of 2D-gel data World-2DPAGE. All 2D gel data to be published in the journal Proteomics needs to be available on the web. The World-2DPAGE repository hosts the data for resources who cannot build and maintain a web interface. There are currently two data sources submitted to World-2DPAGE, which are numbered consecutively:

  • 0001: CGL14067-2DPAGE, Corynebacterium glutamicum entries
  • 0002: NIBR 2D-PAGE, Staphylococcus aureus Mu50 entries

The format of the explicit links in the flat file is:

Resource abbreviation World-2DPAGE
Resource identifier Database name and database accession number (usually from UniProtKB), separated by a colon.
Examples
P61108:
DR   World-2DPAGE; 0002:P61108; -.

P77845:
DR   World-2DPAGE; 0001:P77845; -.

Cornea-2DPAGE, DOSAC-COBS-2DPAGE, HSC-2DPAGE, REPRODUCTION-2DPAGE, SWISS-2DPAGE

In cross-references to Cornea-2DPAGE, DOSAC-COBS-2DPAGE, HSC-2DPAGE, REPRODUCTION-2DPAGE and SWISS-2DPAGE, the optional information field 1 used to be the species origin. The species information has become obsolete/redundant since UniProtKB/Swiss-Prot no longer contains entries describing the same protein from different species (see Release 6.7). We have therefore replaced the species information by "-".

Examples:

Previous format:

DR   SWISS-2DPAGE; P04217; HUMAN.
DR   Cornea-2DPAGE; P04217; HUMAN.
DR   DOSAC-COBS-2DPAGE; P04217; HUMAN.
DR   REPRODUCTION-2DPAGE; P04217; HUMAN.

New format:

DR   SWISS-2DPAGE; P04217; -.
DR   Cornea-2DPAGE; P04217; -.
DR   DOSAC-COBS-2DPAGE; P04217; -.
DR   REPRODUCTION-2DPAGE; P04217; -.

New document on human and mouse protein kinases.

The document pkinfam.txt, available by ftp and on the Web site, provides the classification of human and mouse protein kinases into subfamilies or subgroups, as developed by Gerard Manning. The classification from Diego Miranda-Saavedra has also been taken into account.

This document contains all UniProtKB/Swiss-Prot human and mouse protein kinase entries, subdivided into 10 subfamilies or subgroups. Each gene name is followed by the corresponding human and/or mouse UniProtKB/Swiss-Prot entry name (and accession number).

Changes concerning keywords

New keyword:

UniProt release 12.7

Published January 15, 2008

Headlines

Addition of more than 40’000 microbial entries derived from automated annotation in UniProtKB

Thanks to genome sequencing efforts, there has been a tremendous rise in the number of submitted protein sequences. And this is only the beginning, as faster and cheaper sequencing methods will greatly increase the rate at which new genomes are sequenced.

Semi-automated annotation methods are necessary in order to provide the users with a maximum number of annotated protein sequences. The approach used by UniProtKB/Swiss-Prot differs from most other automated methods as the bulk of the annotation procedure is still performed manually, since we want to make sure that we produce high quality annotation with a minimal amount of incorrect inferences.

Our first automatic annotation project is called HAMAP, which stands for High-quality Automated and Manual Annotation of microbial Proteomes. In the context of this project, proteins from complete bacterial and archaeal proteomes, together with the related plastid proteins, are automatically annotated based on manually created family rules for complete protein annotation, with template-based feature propagation. We are very aware of the danger posed by automatic annotation procedures and have been extremely careful in the implementation of the pipeline, establishing many checks and conditional propagation in order to ensure that automatic annotation will produce data of a quality up to that of manual curation.

At this release, we have begun the procedure to integrate automatically into UniProtKB/Swiss-Prot the entries annotated by the HAMAP automated pipeline; over 40’000 bacterial and archaeal entries were integrated. This is the largest number of entries ever integrated at one release.

It must be noted that the planned introduction of ‘evidence tags’ should allow us to unambiguously flag whether an information item has been derived manually or automatically. For the time being, all entries annotated by the HAMAP pipeline have a cross-reference to HAMAP (for an example see entry Q02JM4).

UniProtKB News

Cross-references to dictyBase

The DictyBase database was renamed dictyBase. We changed the database name in the relevant cross-references (DR lines in the flat file) accordingly.

Example:

DR   dictyBase; DDB0201569; manA.

Cross-references to PDBsum

Cross-references have been added to the PDBsum database. PDBsum provides an overview of every macromolecular structure deposited in the Protein Data Bank (PDB), giving schematic diagrams of the molecules in each structure and of the interactions between them.

The PDBsum database is available at http://www.ebi.ac.uk/pdbsum.

The format of the explicit links in the flat file is:

Resource abbreviation PDBsum
Resource identifier PDB entry name.
Examples
<a href="/uniprot/Q07540#section_x-ref">Q07540</a>:
DR   PDBsum; 2FQL; -.
DR   PDBsum; 2GA5; -.

<a href="/uniprot/P78536#section_x-ref">P78536</a>:
DR   PDBsum; 1BKC; -.
DR   PDBsum; 1ZXC; -.
DR   PDBsum; 2A8H; -.
DR   PDBsum; 2DDF; -.
DR   PDBsum; 2FV5; -.
DR   PDBsum; 2FV9; -.
DR   PDBsum; 2I47; -.

Cross-references to VectorBase

Cross-references have been added to the Invertebrate Vectors of Human Pathogens database. VectorBase is a NIAID Bioinformatics Resource Center for Invertebrate Vectors of Human Pathogens. VectorBase annotates and maintains vector genomes providing an integrated resource for the research community.

The VectorBase database is available at http://www.vectorbase.org/index.php.

The format of the explicit links in the flat file is:

Resource abbreviation VectorBase
Resource identifier VectorBase Gene ID.
Optional information 1 Species name.
Examples
<a href="/uniprot/Q17KX3#section_x-ref">Q17KX3</a>:
DR   VectorBase; AAEL001551; Aedes aegypti.

<a href="/uniprot/Q7PD39#section_x-ref">Q7PD39</a>:
DR   VectorBase; AGAP005024; Anopheles gambiae.
DR   VectorBase; AGAP005025; Anopheles gambiae.

Release of new species-specific documents

There are 9 new documents for several Brucella, Rickettsia and Coxiella complete proteomes, listing all the UniProtKB/Swiss-Prot entries from these proteomes and their corresponding gene designations.

The documents contain, for each relevant UniProtKB/Swiss-Prot entry, the corresponding ordered locus name, entry name, accession number, sequence length and gene name(s).

Changes concerning keywords

New keywords:

Modified keywords:

Changes in subcellular location controlled vocabulary

New subcellular location:

  • Cyanelle stroma

UniMES News

New clustered sequence sets

The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data.

We now provide UniMES clusters, i.e. clustered sets (unimes_cluster100.fasta and unimes_cluster90.fasta) of sequences at two resolutions (100% and >90%). In unimes_cluster100.fasta, identical sequences and subfragments from unimes.fasta are placed into a single cluster.

The unimes_cluster90.fasta is built by clustering unimes_cluster100.fasta representative sequences (the longest sequence in a cluster) using the CD-HIT algorithm (Li W., Jaroszewski L., and Godzik A., Bioinformatics, 17: 282-283, 2001) such that each cluster is composed of sequences that have at least 90% sequence identity, to the representative sequence. Only the representative sequences of the clusters are present in these files.

UniMES is available in the subdirectory current_release/unimes of the UniProt ftp servers (Uniprot, EBI and ExPASy).

UniProt release 12.6

Published December 4, 2007

Headlines

Complete proteome for Arabidopsis thaliana in UniProtKB

Arabidopsis thaliana was the first plant to have its genome completely sequenced. A first round of annotation was performed in 2001 by the Arabidopsis Genome Initiative. The genome was later reannotated and is now maintained by The Arabidopsis Information Resource (TAIR) which assumes primary responsibility for Arabidopsis genome annotation.

As the genome sequencing was being completed, Swiss-Prot initiated the Plant Proteome Annotation Program (PPAP) whose main focus is the annotation of Arabidopsis (and rice) plant-specific proteins and protein families.

This ongoing program has so far produced more than 6'200 manually annotated Arabidopsis thaliana protein sequences in UniProtKB/Swiss-Prot. In addition, close to 44'000 Arabidopsis entries are available in UniProtKB/TrEMBL with a certain level of redundancy. Thus, the total number of protein sequences in UniProtKB for this model plant is much higher than the current estimate of 27'029 protein-encoding genes (see TAIR7 release of April 2007). To get around this problem, a non-redundant set of Arabidopsis proteins, including nuclear, mitochondrial and chloroplastic proteins, was created as of this release and the selected entries have been labelled with the keyword 'Complete proteome' to allow easy retrieval.

The current complete proteome contains a total of 29'315 entries: 6'241 Arabidopsis thaliana in UniProtKB/Swiss-Prot and 23'074 in UniProtKB/TrEMBL.

Arabidopsis thaliana is the third 'green plant' (Viridiplantae) for which a complete nonredundant protein set has been created in UniProtKB. The other two are the unicellular green algae Ostreococcus tauri and Ostreococcus lucimarinus.

UniProtKB News

Changes concerning keywords (KW line)

Deleted keyword:

  • Interferon induction

UniProt release 12.5

Published November 13, 2007

Headlines

Acanthamoeba polyphaga mimivirus, a "giant" virus in UniProtKB/Swiss-Prot

Mimivirus (for mimicking microbe) is a new viral genus containing a single identified species, Acanthamoeba polyphaga mimivirus (APMV), discovered by Didier Raoult's lab in 1992 within the amoeba Acanthamoeba polyphaga while working on Legionellosis. The virion has a non-enveloped, icosahedral capsid with a diameter of 400 nm and protein filaments projecting from its surface. The capsid contains the internal core surrounded by an internal lipid layer. Its linear, double- stranded DNA genome is roughly 1.2 million bp in length, the largest viral genome known so far. Its replication cycle, genome and capsid structure place it into the nucleocytoplasmic large DNA viruses (NCLDVs), which include amongst others the poxviruses and iridoviruses.

This virus is amazing in many ways. It is the largest virus ever isolated, with a genome size and complexity comparable to that of a small bacterium. A thorough bioinformatics analysis carried out by the group of Jean-Michel Claverie uncovered 909 potential protein-coding genes. Some of these proteins belong to families that are shared with all or some NCLDVs, many have eukaryotic counterparts and there are quite a number of ORFans (no sequence similarity to proteins from other genomes). It was a surprise to find an appreciable number of genes coding for proteins involved in metabolism, DNA repair pathways and, most surprising, genes encoding a partially functional protein translation apparatus. Mimivirus does indeed encode four aminoacyl-tRNA synthetases (ArgRS, CysRS, MetRS, TyrRS), as well as various translation initiation, elongation and termination factors. It is very intriguing to find, in a virus, genes corresponding to central components of the protein translation machinery, a biochemical process widely thought to be an exclusive signature of cellular organisms.

The discovery of this amazing virus has lead to the concept of "giant" virus and implies that there is an overlap in terms of particle dimension, genome size, and genetic complexity between the viral and cellular organism worlds.

A special effort has been made in UniProtKB/Swiss-Prot database to provide the complete, fully annotated mimivirus proteome. We have also integrated all proteomics and structural information that has been made available by the groups of Jean-Michel Claverie and Chantal Abergel.

To get all UniProtKB mimivirus entries, click here.

UniProtKB News

Format change in the ptmlist.txt document file

The ptmlist.txt document, which is available by ftp and on the Web site, describes the post-translational modifications (PTMs) that are annotated in UniProtKB/Swiss-Prot entries in the sequence annotation section (Features) (FT lines in the flat file) in the subsections "Cross-link" (CROSSLNK key in the flat file), "Lipidation" (LIPID key in the flat file) and "Modified residue" (MOD_RES key in the flat file). The document was in a format that was suitable for computer applications (e.g. ExPASy's proteomics tools), but which was not very human readable. The new file format should improve this.

Previous format:

N,N-dimethylproline  MOD_RES P  BB Nter C2H4  28.031300  28.06  in  e:6446,7586,33682  Methylation  FT=MOD_RES%20dimethylproline&wild=1  AA0066  MOD:00075

New format:

ID   N,N-dimethylproline
AC   PTM-0179
FT   MOD_RES
TG   Proline.
PA   Amino acid backbone.
PP   N-terminal.
CF   C2 H4
MM   28.031300
MA   28.06
LC   Intracellular localisation.
TR   Eukaryota; taxId:6446 (Sipunculus nudus), taxId:7586 (Echinodermata), taxId:33682 (Euglenozoa).
KW   Methylation.
DR   RESID:AA0066.
DR   MOD:00075.
//

With the following definitions of the line types:

  ---------  ---------------------------     ----------------------
  Line code  Content                         Occurrence in an entry
  ---------  ---------------------------     ----------------------
  ID         Identifier (FT description)     Once; starts a PTM entry.
  AC         Accession (PTM-xxxx)            Once.
  FT         Feature key                     Once.
  TG         Target                          Once; two targets separated
                                             by a dash in case of intrachain
                                             crosslinks.
  PA         Position of the modified        Optional, once.
             amino acid
  PP         Position of the modification    Optional, once.
             in the polypeptide
  CF         Correction formula              Optional, once.
  MM         Monoisotopic mass difference    Optional, once.
  MA         Average mass difference         Optional, once.
  LC         Cellular location               Optional, once; alternatives
                                             can be proposed.
  TR         Taxonomic range                 Optional, once or more.
  KW         Keyword                         Optional, once or more.
  DR         Cross-reference to PTM          Optional, once or more.
             databases
  //         Terminator                      Once; ends an entry.

Changes concerning cross-references to PDB

We added an additional field to the cross-reference (DR line in the flat file) to the PDB database to show the resolution of structures that were determined by X-ray crystallography or electron microscopy.

For the chain names we use now the remediated data from wwPDB, therefore the chain names have changed for some entries.

Previous format:

DR   PDB; ENTRY_NAME; METHOD; CHAIN.

New format:

DR   PDB; ENTRY_NAME; METHOD; RESOLUTION; CHAIN.

Examples:

Q20728:
DR   PDB; 1LPL; X-ray; 1.77 A; A=135-229.
Q5HEB7:
DR   PDB; 2I8C; X-ray; 2.46 A; A/B=1-356.

A dash indicates that we found no information about the resolution or that the field is not applicable (for NMR structures and theoretical models).

Examples:

P02768:
DR   PDB; 2ESG; X-ray; -; C=25-609.
P12872:
DR   PDB; 1LBJ; NMR; -; A=26-47.
P0AC41:
DR   PDB; 2AD0; Model; -; A=1-588.

Cross-references to CleanEx

Cross-references have been added to the CleanEx database of gene expression profiles. CleanEx is a database which provides access to public gene expression data via unique approved gene symbols and which represents heterogeneous expression data produced by different technologies in a way that facilitates joint analysis and cross-dataset comparison.

The CleanEx database is available at http://www.cleanex.isb-sib.ch/.

The format of the explicit links in the flat file is:

Resource abbreviation CleanEx
Resource identifier Combination of a species code and a gene identifier.
Examples
O08788:
DR   CleanEx; MM_DCTN1; -.

P78358:
DR   CleanEx; HS_CTAG1A; -.
DR   CleanEx; HS_CTAG1B; -.

Changes concerning keywords (KW line)

Modified keyword:

UniProt release 12.4

Published October 23, 2007

Headlines

More controlled vocabulary in the 'Subcellular location' subsection

Over 160'000 UniProtKB/Swiss-Prot entries (56%) contain a subcellular location description in the General Annotation section (CC lines in the flat file). We have standardized the content of these comments with the concomitant creation of a controlled vocabulary and a new, parsable flat-file format.

The subcellular location controlled vocabularies are stored in a new document (subcell.txt) which provides, for each individual UniProtKB location, topology or orientation term, the corresponding definition, as well as other relevant information, such as synonyms, hierarchies or mapped GO terms.

The format of the 'Subcellular location' subtopic has changed from free text to a more structured format. When required for the accurate description of a complex biological situation, free text is still used in the 'Note' (see for example O43918). In addition, since release 11.0, this subsection can occur more than once per entry, allowing specific annotation for each isoform, chain or peptide in separate subsections.

UniProtKB News

New document listing the controlled vocabularies used in the 'Subcellular location' subsection

The document subcell.txt, available by ftp and on the Web site, lists the controlled vocabularies used in the in the 'Subcellular location' subsection (CC SUBCELLULAR LOCATION lines in the flat file), their definitions and further information such as synonyms or relevant GO terms in the following format:

     ---------  -------------------------------   ----------------------------------------------
     Line code  Content                           Occurrence in an entry
     ---------  -------------------------------   ----------------------------------------------
     ID         Identifier (location)             Once; starts an entry
     IT         Identifier (topology)             Once; starts a 'topology' entry
     IO         Identifier (orientation)          Once; starts an 'orientation' entry
     AC         Accession (SL-xxxx)               Once
     DE         Definition                        Once or more
     SY         Synonyms                          Optional; Once or more
     SL         Content of subc. loc. lines       Once
     HI         Hierarchy ('is-a')                Optional; Once or more
     HP         Hierarchy ('part-of')             Optional; Once or more
     KW         Associated keyword (accession)    Optional; Once or more
     GO         Gene ontology (GO) mapping        Optional; Once or more
     WW         Interesting links or references   Optional; Once or more
     //         Terminator                        Once; ends an entry
    

Example:

     ID   Cyanelle.
     AC   SL-0082
     DE   A cyanelle is a photosynthetic organelle of glaucocystophyte algae.
     DE   Cyanelles are surrounded by a double membrane and, in between, a
     DE   peptidoglycan wall. Thylakoid membrane architecture and the presence
     DE   of carboxysomes are cyanobacteria-like. Historically, the term
     DE   cyanelle is derived from a classification as endosymbiotic
     DE   cyanobacteria, and thus is not fully correct.
     SY   Muroplast; Cyanoplast.
     SL   Plastid, cyanelle.
     HI   Plastid.
     KW   KW-0194
     GO   GO:0009842; cyanelle
     //
    

Syntax modification of the 'Subcellular location' subtopic

We have structured the 'Subcellular location' subtopic (CC SUBCELLULAR LOCATION lines in the flat file) in order to improve the consistency of annotation and to allow to parse its content.

The new format of SUBCELLULAR LOCATION in the flat file is:

     CC   -!- SUBCELLULAR LOCATION:(( Molecule:)?( Location\\\\.)+)?( Note=Free_text( Flag)?\\\\.)?
    
Where:
  • Molecule: Isoform, chain or peptide name
  • Location = Subcellular_location( Flag)?(; Topology( Flag)?)?(; Orientation( Flag)?)?
    • Subcellular_location: SL-line of subcell.txt ID-record
    • Topology: SL-line of subcell.txt IT-record
    • Orientation: SL-line of subcell.txt IO-record
    • Flag = \\\\(By similarity|Probable|Potential\\\\)

Note: Perl-style multipliers indicate whether a pattern (as delimited by parentheses) is optional (?) or may occur 1 or more times (+). Alternative values are separated by a pipe symbol (|).

Examples:

P32755:
     CC   -!- SUBCELLULAR LOCATION: Cytoplasm. Endoplasmic reticulum membrane;
     CC       Peripheral membrane protein. Golgi apparatus membrane; Peripheral
     CC       membrane protein.
    
Q96QV1:
     CC   -!- SUBCELLULAR LOCATION: Cell membrane; Peripheral membrane protein
     CC       (By similarity). Secreted (By similarity). Note=The last 22 C-
     CC       terminal amino acids may participate in cell membrane attachment.
     CC   -!- SUBCELLULAR LOCATION: Isoform 2: Cytoplasm (Probable).
    
P35670:
     CC   -!- SUBCELLULAR LOCATION: Golgi apparatus, trans-Golgi network
     CC       membrane; Multi-pass membrane protein (By similarity).
     CC       Note=Predominantly found in the trans-Golgi network (TGN). Not
     CC       redistributed to the plasma membrane in response to elevated
     CC       copper levels.
     CC   -!- SUBCELLULAR LOCATION: Isoform 2: Cytoplasm.
     CC   -!- SUBCELLULAR LOCATION: WND/140 kDa: Mitochondrion.
    

Modification of the EC (Enzyme Commission) number format

EC numbers are used to describe enzyme reactions and are based on the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB). The EC numbers and the reactions they describe are stored in the ENZYME and IntEnz databases.

In the UniProt Knowledgebase some enzymes are assigned so-called partial EC numbers where part of the numbers are replaced by dashes (e.g. EC 3.4.24.-). This happens in the following situations:

  1. The catalytic activity of the protein is not known exactly.
  2. The protein catalyzes a reaction that is known, but not yet included in the IUBMB EC list.

To distinguish these two meanings, we have started to use the letter 'n' with a preliminary number instead of a dash '-' for the latter case. The retrofit of those existing EC numbers of proteins in UniProtKB that catalyze a reaction that is known, but not yet included in the IUBMB EC list will be an ongoing process.

Examples:

The catalytic activity of the protein is not known exactly:

Q9VAC5:
     DE   ADAM 17-like protease precursor (EC 3.4.24.-).
    

The protein catalyzes a reaction that is known, but not yet included in the IUBMB's EC list:

Q01468:
     DE   4-oxalocrotonate tautomerase (EC 5.3.2.n1) (4-OT).
    
Q8IV42:
     DE   L-seryl-tRNA(Sec) kinase (EC 2.7.1.n3) (O-phosphoseryl-tRNA(Sec)
     DE   kinase) (PSTK).
    

UniProt release 12.3

Published October 2, 2007

Headlines

Oryza sativa (rice) species separated into japonica and indica subspecies in UniProtKB/Swiss-Prot entries

Although it has been a rule in UniProtKB/Swiss-Prot to merge all protein sequences encoded by the same gene in one species into a single record to avoid redundancy, this rule sometimes has to be adapted to specific cases. For example, this rule applied to rice entries, causing sequences from various rice cultivars to be merged and entries tagged with the unique taxonomic identifier (ID) for Oryza sativa species: 4530.

However, O.sativa comprises 2 subspecies: japonica and indica. A classification at subspecies level is already effective in several databases, including UniProtKB/TrEMBL, and most scientists use it when submitting new sequences. In EMBL/DDBJ/GenBank, there are over 1.2 million japonica and almost 360,000 indica sequences, coming mainly from large scale genome, cDNA or EST sequencing projects. The completion of both the japonica and the indica genomes and the analysis of multiple sets of subspecies-specific transcripts revealed a significant number of sequence variations and a divergence of expression pattern between japonica and indica subspecies. In order to provide clearer information to its users, UniProtKB/Swiss-Prot had to adopt this classification and separate indica and japonica subpecies in rice entries.

Most rice entries contained exclusively japonica sequences and were quickly updated with the appropriate taxonomic ID. But over 220 rice entries contained merged sequences of japonica and indica subspecies and had to be "de-merged". This task was undertaken by the PPAP (Plant Proteome Annotation Program) team. Common information was kept in both japonica and indica entries, while expression patterns or other subspecies-specific experimental evidence was transferred where it belongs. Today all rice entries are classified into either japonica or indica subspecies, with the exception of very few entries where subspecies was not specified. When available, cultivars are indicated in the reference section. Each entry also provides cross-references to either japonica (cultivar nipponbare) or indica (cultivar 93-11) genomic sequences.

The gene nomenclature system ('OS' code) defined by RAP-DB and/or TIGR for the japonica cultivar nipponbare can be found in japonica entries in the gene names subsection (Ordered Locus Names). RAP-DB locus identifiers are listed in the rice.txt file.

To get all reviewed UniProtKB Japonica entries, click here.

To get all reviewed UniProtKB Indica entries, click here.

To get all reviewed UniProtKB rice entries, click here.

The mnemonic species identification code in the entry name allows to quickly identify to which subspecies the protein belongs: ORYSJ is the code for japonica, ORYSI for indica and the old ORYSA code indicates that the subspecies is not specified. The list of rice cultivars can be found in the strains.txt file.

UniProtKB News

Changes concerning the comment line (CC) topic MASS SPECTROMETRY

To be consistent with other comment line topics in the flat file, we have changed the field tags of the topic MASS SPECTROMETRY. At the same time, we have extracted literature references into a new field, Source=, and replaced all molecule descriptions by isoform identifiers.

Previous format:

     CC   -!- MASS SPECTROMETRY: MW=mass(; MW_ERR=error)?; METHOD=method;
     CC       RANGE=ranges( (molecule))?; NOTE=(references|free_text (references)).
    

New format:

     CC   -!- MASS SPECTROMETRY: Mass=mass(; Mass_error=error)?; Method=method;
     CC       Range=ranges( (IsoformID))?(; Note=free_text)?; Source=references;
    

Examples:

P61409:

Previous format:

     CC   -!- MASS SPECTROMETRY: MW=3979.9; METHOD=Electrospray; RANGE=1-31;
     CC       NOTE=Ref.1, Ref.2.
    

New format:

     CC   -!- MASS SPECTROMETRY: Mass=3979.9; Method=Electrospray; Range=1-31;
     CC       Source=Ref.1, Ref.2;
    
P04653:

Previous format:

     CC   -!- MASS SPECTROMETRY: MW=23638.14; MW_ERR=3.0; METHOD=Electrospray;
     CC       RANGE=16-214 (P04653-2; Allele A); NOTE=With eleven phosphate
     CC       groups (Ref.2).
    

New format:

     CC   -!- MASS SPECTROMETRY: Mass=23638.14; Mass_error=3.0; Method=Electrospray;
     CC       Range=16-214 (P04653-2); Note=Allele A, with 11 phosphate groups;
     CC       Source=PubMed:7601973;
    

Note that literature references of the form Ref.n are replaced by PubMed identifiers where this is possible.

Cross-references to RefSeq

Cross-references have been added to the NCBI Reference Sequences database. The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products for taxonomically diverse organisms including eukaryotes, bacteria, and viruses. RefSeq is a baseline for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses.

The RefSeq database is available at http://www.ncbi.nlm.nih.gov/RefSeq/.

The format of the explicit links in the flat file is:

Resource abbreviation RefSeq
Resource identifier RefSeq protein accession ID.
Examples
O34697:
DR   RefSeq; NP_390916.1; -.

Q8IN81:
DR   RefSeq; NP_524397.2; -.
DR   RefSeq; NP_732344.1; -.
DR   RefSeq; NP_732345.1; -.
DR   RefSeq; NP_732346.1; -.
DR   RefSeq; NP_732347.1; -.
DR   RefSeq; NP_732348.1; -.
DR   RefSeq; NP_732349.1; -.
DR   RefSeq; NP_732350.1; -.

Cross-references to GeneID

Cross-references have been added to the Database of genes from NCBI RefSeq genomes. Entrez Gene is the NCBI's database for gene-specific information. It does not include all known or predicted genes; instead Entrez Gene focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense sequence analysis. The content of Entrez Gene represents the result of curation and automated integration of data from NCBI's Reference Sequence project (RefSeq), from collaborating model organism databases, and from many other databases available from NCBI. Records are assigned unique, stable and tracked integers as identifiers. The content (nomenclature, map location, gene products and their attributes, markers, phenotypes, and links to citations, sequences, variation details, maps, expression, homologs, protein domains and external databases) is updated as new information becomes available. Entrez Gene is a step forward from NCBI's LocusLink, with both a major increase in taxonomic scope and improved access through the many tools associated with NCBI Entrez.

The GeneID database is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene.

The format of the explicit links in the flat file is:

Resource abbreviation GeneID
Resource identifier GeneID accession ID.
Examples
P63272:
DR   GeneID; 6827; -.

P74750:
DR   GeneID; 951978; -.
DR   GeneID; 953863; -.

UniProt release 12.2

Published September 11, 2007

Headlines

Yeast PDR5: the first adopted protein in UniProtKB/Swiss-Prot

While progress in laboratory techniques allows the production of an ever-increasing flood of data, these data are still insufficiently exploited. One reason for this bottleneck is the lack of efficient integration into databases, making data more difficult, sometimes almost impossible, to access. The current information flow consists in two steps. First, scientists providing knowledge encode it in the format of a given journal. Then database curators have to decode and standardize it to make it computer-parsable and usable for the further research.

In order to reduce this time-consuming and error-prone process and to make the most of expert scientists, UniProtKB/Swiss-Prot proposes a new strategy called 'Adopt a Protein', where researchers can adopt one or more specific proteins. 'Foster parents' make sure that the information concerning their favourite protein(s) is up-to-date. UniProtKB/Swiss-Prot provides them with a draft with the correct sequence, up-to-date sequence analysis predictions and a description of the main topics that require annotation, such as protein names, bibliographic references, comments and protein features. The input of 'foster parents' is acknowledged in the entry.

The yeast Saccharomyces cerevisiae is a popular model organism used in hundreds of laboratories around the world and its genome has been fully sequenced and extensively studied over past a decade. Moreover, the yeast community has a long tradition of sharing information. Therefore, the yeast proteome has been chosen as a test platform to initiate the 'Adopt a Protein' scheme.

This release contains the first fully annotated adopted protein: PDR5. PDR5 is a 160-kDa yeast pleiotropic ABC efflux transporter of multiple drugs localized in the plasma membrane. It belongs to the ABC (ATP-binding cassette) transporter family, PDR subfamily. The PDR subfamily is specific to fungi and plants and exhibits distinctive structural features, such as extended extracellular loops and a degenerate ATP binding domain. Yeast strains lacking PDR5 are used for toxicity tests, whereas those overexpressing PDR5 are used for screening antifungal sensitizers.

PDR5 has been adopted by Professor André Goffeau from the Catholic University of Louvain (Belgium). We are grateful to him for committing precious time to help producing an annotation useful to the whole community. We hope that PDR5 is only the first member of a big adopted family! If you want to become a 'foster parent', please contact the UniProtKB/Swiss-Prot Fungal Proteome Annotation Program.

UniProtKB News

Changes concerning the section 'Web resources'

In the flat file, the 'Web resources' section is located in comment lines (CC). To be consistent with other comment lines, we have changed this topic from

     CC   -!- WEB RESOURCE: NAME=resource_name(; NOTE=free_text)?; URL="url".
    
to
     CC   -!- WEB RESOURCE: Name=resource_name(; Note=free_text)?; URL="url";
    

Format change in the dbxref.txt and jourlist.txt document files

The dbxref.txt file lists the names and abbreviations and URLs of all databases cross-referenced in UniProtKB. The jourlist.txt file lists the titles and abbreviations of all journals cited in the UniProtKB/Swiss-Prot. We have added a new field, AC, to assign a stable identifier to each record in these files.

Examples:

dbxref.txt

     AC    : DB-0022
     Abbrev: EMBL
     Name  : EMBL nucleotide sequence database
     Ref   : Nucleic Acids Res. 35:D16-D20(2007); PubMed=17148479; DOI=10.1093/nar/gkl913;
     LinkTp: Explicit
     Server: http://www.ebi.ac.uk/embl/
     Db_URL: www.ebi.ac.uk/htbin/expasyfetch?%s
     Cat   : Sequence databases
    

jourlist.txt

     AC    : JN-1120
     Abbrev: J. Mol. Biol.
     Title : Journal of Molecular Biology
     ISSN  : 0022-2836
     e-ISSN: 1089-8638
     CODEN : JMOBAK
     Short : JMB
     Publis: Elsevier Science
     Server: http://www.elsevier.com/locate/issn/00222836
    

UniProt release 12.1

Published August 21, 2007

Headlines

Over 18,500 phosphorylation sites identified by mass spectrometry in UniProtKB/Swiss-Prot

Phosphorylation is a key reversible modification that regulates protein function, subcellular localization, stability, and interactions. It is believed that up to 30% of all eukaryotic proteins may be phosphorylated.

During the last few years, phosphoproteomics have greatly improved due to the optimization of enrichment protocols for phosphoproteins and phosphopeptides, better fractionation techniques using chromatography, and improvement of mass spectrometry instrumentation. Thanks to these developments, it is now possible to analyze entire phosphorylation sets rapidly. However, protein and phosphorylation site identification by mass spectrometry is crucially dependent on the quality and completeness of the biological resource used for analysis.

In UniProtKB/Swiss-Prot, we make a special effort to document post- translational modifications and especially phosphorylation sites, using data from the literature. We have incorporated data from 38 high-quality phosphoproteomics studies which have allowed us to annotate or confirm 18,556 phosphorylation sites in 6,493 protein entries, mainly from human (45%), mouse (27%) and yeast (25%), but also from rat, Arabidopsis thaliana and bacteria. These high-throughput studies can be easily recognized among other UniprotKB references through the [LARGE SCALE ANALYSIS] tag appearing in 'cited for' topic in the 'References' section (RP line in the flat file).

Click here to obtain the complete list of UniProtKB/Swiss-Prot entries having at least one phosphorylation site found in proteomic studies.

UniProtKB News

Changes concerning cross-references to RZPD-ProtExp

Cross-references to the RZPD-ProtExp have been removed.

UniProt release 12.0

Published July 24, 2007

Headlines

UniProtKB/Swiss-Prot major release (54.0)

Release 54.0 of 24-Jul-07 of UniProtKB/Swiss-Prot contains 276'256 sequence entries, comprising 101'466'206 amino acids abstracted from 158'294 references. 7'104 sequences have been added since release 53.0: this represents a 3% increase. In addition, the sequence data of 690 existing entries have been updated and the annotations of 269'152 entries have been revised.

The following improvements were carried out in the last 2 months:

  • We have introduced a new topic to indicate the evidence for the existence of a given protein. 5 levels of evidence have been defined: 1. evidence at protein level (e.g. clear identification by mass spectrometry), 2. evidence at transcript level (e.g. Northern blot), 3. inferred by homology (strong sequence similarity to known proteins in related species), 4. predicted and 5. uncertain (e.g. dubious sequences that could be the erroneous translation of a pseudogene).
  • We have added cross-references to 3 new databases: DisProt, PeptideAtlas, and PharmGKB.
  • We have enriched the controlled vocabulary for post-translational modification description with 14 new terms: 8 for the feature key 'Cross-link' (mostly for the description of cyclopeptides), 1 for the feature key 'Lipidation' and 5 for the feature key 'Modified residue'.
  • We have added 13 new keywords, 6 have been modified and 1 deleted.

UniProt Knowledgebase release 12.0 includes Swiss-Prot release 54.0 and TrEMBL release 37.0.

UniProtKB News

Introduction of a new topic to document protein existence

Most protein sequences are derived from translations of gene predictions. Some of them exhibit strong sequence similarity to known proteins in closely related species. For other proteins there is experimental evidence, such as Edman sequencing, clear identification by mass spectrometry (MSI), X-ray or NMR structure, detection by antibodies, etc. To indicate these different levels of evidence for the existence of a protein, we have introduced a new topic in each entry. The 'protein existence' topic does not describe the accuracy or correctness of a sequence displayed in UniProtKB, but the evidence for the existence of a protein. It may happen that the protein sequence is not entirely accurate, especially for sequences derived from gene predictions from genomic sequences.

In the flat file, this information is provided in the new PE line. The format is:

     PE   Level: Evidence;
    
With the following values:
  • 1: Evidence at protein level
  • 2: Evidence at transcript level
  • 3: Inferred from homology
  • 4: Predicted
  • 5: Uncertain

Example:

     PE   1: Evidence at protein level;
    

The document 'Criteria for protein existence' lists the criteria used to assign a PE level to entries.

The PE line is found between the DR and KW lines of UniProtKB entries.

Modification of the citation of direct submissions to UniProtKB

In the flat file, the format of the RL (Reference Location) line for submissions used to be:

     RL   Submitted (MMM-YYYY) to DatabaseName.
    

Where:

  • MMM is the month
  • YYYY is the year
  • DatabaseName is one of:
    • the EMBL/GenBank/DDBJ databases
    • Swiss-Prot
    • the PDB data bank
    • the PIR data bank

We have replaced the DatabaseName value Swiss-Prot by UniProtKB.

Change in UniProtKB release frequency

UniProtKB release frequency used to be every other week. From now on, UniProtKB will be updated every 3 weeks.

Changes concerning keywords (KW line)

New keywords:

Modified keywords:

Deleted keyword:

  • Hypothetical protein

UniProt release 11.3

Published July 10, 2007

Headlines

Knottins or how to knit in the protein world

Knottins (also called inhibitor cystine knots or ICKs) are small disulfide-rich proteins characterized by a special "disulfide through disulfide knot". This knot is obtained when one disulfide bridge crosses the macrocycle formed by two other disulfides and the interconnecting backbone (disulfide 3-6 goes through disulfides 1-4 and 2-5).

The knottin structure is found in many unrelated families, such as plant protease inhibitors, cyclotides, toxins from cone snails, spiders, insects, horseshoe crabs and scorpions, gurmarin-like peptides, agouti-related proteins, and antimicrobial peptides.

In collaboration with Laurent Chiche (CNRS, Montpellier), about 450 UniProtKB/Swiss-Prot entries have been updated with knottin structural information. They can be retrieved with the newly introduced keyword Knottin.

Examples:

UniProtKB News

Cross-references to PharmGKB

Cross-references have been added to the PharmGKB database. PharmGKB curates information that establishes knowledge about the relationships among drugs, diseases and genes, including their variations and gene products. It is a repository for genetic, genomic, molecular and cellular phenotype data and clinical information about people who have participated in pharmacogenomics research studies. The data includes, but is not limited to, clinical and basic pharmacokinetic and pharmacogenomic research in the cardiovascular, pulmonary, cancer, pathways, metabolic and transporter domains.

The PharmGKB database is available at http://www.pharmgkb.org/.

The format of the explicit links in the flat file is:

Resource abbreviation PharmGKB
Resource identifier PharmGKB accession ID.
Example
Q96S55:
DR   PharmGKB; PA134982239; -.

Changes concerning keywords

New keywords:

UniProt release 11.2

Published June 26, 2007

Headlines

Obesity in the spotlight

Over the last 40 years, overweight and obesity have become a central health issue in a growing number of countries. Obesity comorbidities are severe and include cardiovascular diseases, diabetes, musculoskeletal disorders and some cancers. The two fundamental causes of obesity are clearly identified as an increased intake of high-fat and energy-dense diets and a decrease of physical activity. However, there is growing evidence that certain gene products have a direct or indirect influence on body mass.

In 1999, the mouse Fto gene was cloned and called Fatso, because of its large size (at least 250 kb). By a curious coincidence, the human orthologous protein was recently shown to predispose to childhood and adult obesity. The main culprits are intronic variations in the FTO gene. Carriers of one (or two) inherited copy (copies) of the variants have an increased risk of obesity of 30% or 70%, respectively. The function of Fatso is not yet known. This protein, along with other proteins involved in the development of obesity, can be retrieved from the UniProtKB/Swiss-Prot using the keyword Obesity.

UniProtKB News

Changes concerning keywords

New keyword:

Modified keywords:

UniProt release 11.1

Published June 12, 2007

Headlines

4,000 bovine entries in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot is happy to announce the annotation of over 4,000 entries of a very popular animal in Switzerland, almost a national symbol: Bos taurus, in other words, the cow.

Those of you who have visited the Swiss Alps may know that their gorgeous scenery is definitely associated with the sound of cowbells in summer pasture. Similarly, the modern biology landscape would be poorer without bovine sequences, obviously not in a decorative role, but as a key element for our understanding of human biology.

The domesticated cow is extensively used in biomedical research, as an animal model and also as a source of biological material. Remember that bovine insulin was the first sequenced protein and was used for decades to treat diabetes. The first draft of the bovine genome sequence was released in October 2004 by the Human Genome Sequencing Center of the Baylor College of Medicine. The human and bovine genomes are more similarly organized than when either is compared to the mouse. Despite its interest, only a few large scale cDNA sequencing projects have been initiated. Currently more than 70% of the UniProtKB/Swiss-Prot bovine sequences come from translation of cDNA sequences produced the NIH Mammalian Gene Collection and the Agricultural Research Service, US Department of Agriculture.

UniProtKB News

Cross-references to PeptideAtlas

Cross-references to the PeptideAtlas database have been added. PeptideAtlas is a multi-organism, publicly accessible compendium of peptides that have been identified in a large set of tandem mass spectrometry proteomics experiments. All results of sequence searching have subsequently been processed through PeptideProphet to derive a probability of correct identification for all results in a uniform manner to insure a high quality database. All peptides have been mapped to Ensembl and can be viewed as custom tracks on the Ensembl Genome Browser.

The format of the explicit links in the flat file is:

Resource abbreviation PeptideAtlas
Resource identifier UniProtKB accession number.
Example
P08524:
DR   PeptideAtlas; P08524; -.

Cross-references to DisProt

Cross-references to the Database of Protein Disorder (DisProt) have been added. The Database of Protein Disorder (DisProt) is a curated database that provides information about proteins that lack fixed 3D structure in their putatively native states, either in their entirety or in part. DisProt is a collaborative effort between Center for Computational Biology and Bioinformatics at Indiana University School of Medicine and Center for Information Science and Technology at Temple University.

The format of the explicit links in the flat file is:

Resource abbreviation DisProt
Resource identifier DisProt accession number.
Example
P07293:
DR   DisProt; DP00228; -.
DR   DisProt; DP00440; -.

UniProt release 11.0

Published May 29, 2007

Full statistics and release notes

Headlines

UniProtKB/Swiss-Prot major release (53.0)

Release 53.0 of 29-May-07 of UniProtKB/Swiss-Prot contains 269'293 sequence entries, comprising 98'902758 amino acids abstracted from 156'204 references. 9'228 sequences have been added since release 52.0: this represents a 3.5% increase. In addition, the sequence data of 734 existing entries have been updated and the annotations of 210'454 entries have been revised.

The following improvements were carried out in the last 3 months:

  • We have created a new comment subsection: SEQUENCE CAUTION. This subsection is used for all the comments related to submitted sequences which differ from the sequence shown in the entry because of conflicts, such as frameshifts, erroneous gene model predictions, erroneous translation or other discrepancies that cannot be described in the feature 'Conflict'.
  • We have added cross-references to 3 new databases: BuruList, a database dedicated to the analysis of Mycobacterium ulcerans genome, Orphanet, a database dedicated to information on rare diseases and orphan drugs, and PseudoCAP, a Pseudomonas genome database.

UniProt Knowledgebase release 11.0 includes Swiss-Prot release 53.0 and TrEMBL release 36.0.

We are pleased to announce a new UniProt database, the UniProt Metagenomic and Environmental Sequences (UniMES) database, a repository specifically developed for metagenomic and environmental data. UniMES is available in FASTA format on the UniProt ftp servers, in the new subdirectory current_release/unimes:

UniProtKB News

New ftp directory for UniProt Metagenomic and Environmental Sequences (UniMES)

We have added a new subdirectory, current_release/unimes, to the UniProt ftp servers

to distribute metagenomic and environmental sequences.

New comment subsection SEQUENCE CAUTION

We have introduced a new subsection in the General Annotation (Comments) section, SEQUENCE CAUTION, to describe protein sequence reports that differ from the sequence that is shown in UniProtKB due to conflicts that are not described in the SEQUENCE CONFLICT lines (Features), such as frameshifts, erroneous gene model predictions, etc. This kind of information was before reported in the topic CAUTION together with other warnings that are unrelated to sequence conflicts.

The format of the SEQUENCE CAUTION topic is:

     CC   -!- SEQUENCE CAUTION:
     Sequence=Sequence; Type=Type;[ Positions=Positions;][ Note=Note;]
    

Where:

  • Sequence is the sequence which differs from the UniProtKB sequence. It is described by one of:
    • an EMBL protein identifier (with version number)
    • an EMBL accession number.
    • a literature reference (e.g. Ref. 3).
  • Type describes the cause for the sequence difference(s) and is one of:
    • Frameshift
    • Erroneous initiation
    • Erroneous termination
    • Erroneous gene model prediction
    • Erroneous translation
    • Miscellaneous discrepancy
  • Positions describes the UniProtKB sequence position(s) or range(s) of the difference(s) where possible. Sometimes the term 'Several' is used to indicate that there are many differences.
  • Note is an optional free text explanation.

These lines are not wrapped and their length may therefore exceed 75 characters.

Examples:

Q93W20:
     Previous annotation:
     CC   -!- CAUTION: Ref.2 (BAA97015) sequence differs from that shown due to
     CC       erroneous gene model prediction. The predicted gene At5g49940 has
     CC       been split into 2 genes: At5g49940 and At5g49945.
     New annotation:
     CC   -!- SEQUENCE CAUTION:
     CC       Sequence=BAA97015.1; Type=Erroneous gene model prediction; Note=The predicted gene At5g49940 has been split into 2 genes: At5g49940 and At5g49945;
    
Q83M39:
     Previous annotation:
     CC   -!- CAUTION: Ref.1 and Ref.2 sequences differ from that shown due to a
     CC       stop codon at position 273 which was translated as Gln to extend
     CC       the sequence.
     New annotation:
     CC   -!- SEQUENCE CAUTION:
     CC       Sequence=AAN42076.1; Type=Erroneous termination; Positions=273; Note=Translated as Gln;
     CC       Sequence=AAP15953.1; Type=Erroneous termination; Positions=273; Note=Translated as Gln;
    
P17814:
     Previous annotation:
     CC   -!- CAUTION: Ref.1 (CAA36850) sequence differs from that shown due to
     CC       a frameshift in position 496.
     CC   -!- CAUTION: Ref.1 (CAA36850) sequence differs from that shown due to
     CC       erroneous gene model prediction.
     New annotation:
     CC   -!- SEQUENCE CAUTION:
     CC       Sequence=CAA36850.1; Type=Erroneous gene model prediction;
     CC       Sequence=CAA36850.1; Type=Frameshift; Positions=496;
    
P0A7B3:
     Previous annotation:
     CC   -!- CAUTION: Ref.4 (X07863) sequence differs from that shown due to
     CC       several frameshifts.
     CC   -!- CAUTION: Ref.5 (Y00357) sequence differs from that shown due to
     CC       frameshifts in positions 204, 215 and 282.
     New annotation:
     CC   -!- SEQUENCE CAUTION:
     CC       Sequence=X07863; Type=Frameshift; Positions=Several;
     CC       Sequence=Y00357; Type=Frameshift; Positions=204, 215, 282;
    
P27612:
     Previous annotation:
     CC   -!- CAUTION: Ref.2 (AAA39943) sequence differs from that shown due to
     CC       frameshifts in positions 4, 32, and 42.
     CC   -!- CAUTION: Ref.2 (AAA39943) sequence differs from that shown due to
     CC       contaminating sequence.
     CC   -!- CAUTION: Ref.3 sequence differs from that shown due to a
     CC       frameshift in position 697.
     Current annotation:
     CC   -!- SEQUENCE CAUTION:
     CC       Sequence=AAA39943.1; Type=Miscellaneous discrepancy; Note=Several frameshifts and contaminating sequence;
     CC       Sequence=Ref.3; Type=Frameshift; Positions=697;
    

Change in comment subsection SUBCELLULAR LOCATION

From now on, the comment subsection SUBCELLULAR LOCATION may occur more than once per entry.

Cross-references to PseudoCAP

Cross-references have been added to the Pseudomonas aeruginosa Community Annotation Project database. This database provides genome annotation of P. aeruginosa strain PAO1 and of other Pseudomonas species, acting as a valuable comparative resource for P. aeruginosa research, as well as being useful for the larger Pseudomonas research community. Over the coming year this database will be further enhanced toward more focus on comparative analysis of P. aeruginosa isolates and more specific information about putative drug and vaccine targets.

The Pseudomonas aeruginosa Community Annotation Project database is available at http://www.pseudomonas.com/.

The format of the explicit links in the flat file is:

Resource abbreviation PseudoCAP
Resource identifier Ordered locus name.
Example
Q9I576:
DR   PseudoCAP; PA0865; -.

Cross-references to Orphanet

Cross-references have been added to the Orphanet database. This database is dedicated to information on rare diseases and orphan drugs. It aims to improve management and treatment of genetic, auto-immune or infectious rare diseases, rare cancers, or not yet classified rare diseases. ORPHANET offers services adapted to the needs of patients and their families, health professionals and researchers, support groups and industry.

The Orphanet database is available at http://www.orpha.net/consor/cgi-bin/home.php?Lng=GB.

The format of the explicit links in the flat file is:

Resource abbreviation Orphanet
Resource identifier Orphanet unique disease identifier.
Optional information 1 Name of the disease.
Example
P26439:
DR   Orphanet; 418; Adrenal hyperplasia, congenital.
DR   Orphanet; 3185; Stein-Leventhal syndrome.

Changes concerning keywords

New keyword:

UniProt release 10.5

Published May 15, 2007

Headlines

While UniProt is a central resource for biologists, some specialized information is beyond the scope of our database. Therefore we link UniProtKB entries to more specialized resources:

  • the cross-reference section links an entry to up to almost 100 different databases
  • the web resource section links an entry to other related web pages

We recently added links to the free encyclopedia Wikipedia in the web resource section. Proteins with a link to Wikipedia are mainly of medical or pharmaceutical interest. Wikipedia articles may describe the discovery of the protein and its use in medicine.

Examples:

UniProt release 10.4

Published May 1, 2007

Headlines

We have introduced the oldest fossil protein sequence to date into UniProtKB/Swiss-Prot, i.e. several peptides from collagen (P0C2W2, P0C2W3, P0C2W4) which were extracted from a 68 million year-old dinosaur: Tyrannosaurus rex. These collagen sequences were obtained by mass spectrometry analysis directly from soft tissue that remained in fossilized bones, which were unearthed from rocks in the Hell Creek Formation of eastern Montana, US.

Interestingly, Tyrannosaurus rex collagen is similar to chicken collagen, and similarities have also been found with frog and newt protein. The finding is consistent with the idea that we can trace a direct evolutionary line between birds and dinosaurs (for more information: PMID 17431180).

The discovery of protein in bone soft tissue of dinosaur is a surprise - it was not thought that such organic material could survive this long. "The pathways of cellular decay are well known for modern organisms. And extrapolations predict that all organic matter vanishes within 100,000 years, maximum" (BBC news).

Until now, the oldest fossil protein sequence in UniprotKB/Swiss-Prot was a RuBisCO large subunit from a fossil leaf of a Miocene (17-20 million years old) Magnolia, P30828 (see headline release 1.7).

You can get all these aged proteins by clicking on the keyword Extinct organism protein.

Other reference:

Protein Spotlight (May 2004) Small blast from the past

UniProtKB News

Changes concerning keywords

Modified keyword:

UniProt release 10.3

Published April 17, 2007

Headlines

Complete set of Arabidopsis thaliana F-box proteins

F-box proteins play a major role in the ubiquitin conjugation pathway. There are involved in the third step of this pathway. Most of the F-box protein contains a conserved F-box domain near the N-terminus and a variable region. The F-box domain can interact with Cullin and one of the SKP1 proteins to form a E3 SCF (SKP1/Cullin/F-box) ubiquitin ligase complex. The variable region interacts with a specific protein, which is, in turn, ubiquitinated and thus targeted to protein degradation. This variable region confers the specificity of the SCF complex.

The whole set of Arabidopsis thaliana, more than 630 F-box protein sequences, has been manually reviewed and integrated into UniProtKB/Swiss-Prot. About 120 wrong gene model predictions have been corrected, including 26 F-box proteins obtained by splitting erroneous gene predictions covering more than one gene. This represents one of the largest protein family of a given species that had ever been integrated into UniProtKB/Swiss-Prot.

In A. thaliana, almost half of F-box proteins contains a combination of different domains which is used to define subgroups:

›300 FB F-box alone
91 FBL F-box associated with LRR-repeat
124 FBK F-box associated with Kelch-repeat
30 FBD F-box associated with FBD
41 FDL F-box associated with FBD and LRR-repeat
4 FBLK F-box associated with LRR-repeat and Kelch-repeat

Among this large protein family, less than 30 members have been characterized: their functions are various and include flowering, circadian cycle, hormone signaling, and plant defense.

Related entries:

UniProt release 10.2

Published April 3, 2007

Headlines

Spider dermonecrotic toxin family

Loxosceles is the genus of spiders that includes the infamous brown recluse spider Loxosceles reclusa. These spiders, also called violin spiders or fiddleback spiders because of violin-like marks on their cephalothorax, are brownish-yellow in color, and spin small, irregular webs under rocks, or in nooks and crannies of your house. These spiders are found in the USA, South America, Europe and Africa. Their most characteristic feature is actually their eyes: most spiders have eight eyes, but Loxosceles have six, arranged in three pairs, or dyads, that sit side-by-side.

The bite of a Loxosceles spider is not deadly, but it is very unpleasant - the venom is necrotoxic, causing tissue to die and fall off. Pain usually doesn't begin until 6-12 hours after the bite occurs. Loxosceles' necrotoxic venom is cytotoxic and hemolytic. It contains at least 8 enzymes; the enzyme thought to be responsible for most of the destructive effects is called Sphingomyelinase D. This enzyme catalyzes the hydrolysis of sphingomyelin and causes hemolysis and dermonecrosis.

The annotation of this family of toxin has just been updated in UniProtKB/Swiss-Prot (e.g. Q8I914 and P83045.

UniProtKB News

Cross-references to BuruList

Cross-references have been added to the Mycobacterium ulcerans genome database. This database is dedicated to the analysis of the genome of Mycobacterium ulcerans, the Buruli ulcer bacillus: BuruList. BuruList provides a complete dataset of DNA and protein sequences derived from the epidemic strain Agy99, linked to the relevant annotations and functional assignments. It allows one to easily browse through these data and retrieve information, using various criteria (gene names, location, keywords, etc.).

The Mycobacterium ulcerans genome database is available at http://genolist.pasteur.fr/BuruList/.

The format of the explicit links in the flat file is:

Resource abbreviation BuruList
Resource identifier Ordered locus name.
Example
A0PW55:
DR   BuruList; MUL_4631; -.

Changes concerning keywords

New keyword:

UniProt release 10.1

Published March 20, 2007

Headlines

Koala genome invaded by a new retrovirus

Endogenous retroviruses are vestiges of ancestral viral infection that have been incorporated long time ago into a host's genome. Surprisingly, 8% of the human genome is composed of such "fossil" viruses (1). The most recent endogenization event is a porcine virus that entered its host approximately 5,000 years ago.

Recently a new endogenous retrovirus was identified in Australia koala populations.

Koalas were largely exterminated on mainland southern Australia in the late nineteenth century. Populations were established on a small number of islands in the early 1900s and have remained isolated since 1920s. These populations have since been used to restock the mainland.

The new Koala retrovirus (KoRV) has only been found in mainland populations, suggesting that this virus entered koala species in the last 100 years (2). This retrovirus is both endogenous and fully functional, meaning that it spreads both by contact and by heredity, and is still in the process of invading the koala genome. KoRV is very similar to Gibbon Ape Leukemia Virus (GALV), and these two retroviruses are thought to have diverged very recently. This suggests a scenario in which a monkey retrovirus has crossed species to enter newly established koala population and has started to colonize koala genome.

The KoRV is unique in that we are observing the initial entry of a new family of endogenous retrovirus into a wild host genome. The dynamic interaction between this virus and its new host provides a unique opportunity to study the process of endogenization and its impact on species development and evolution.

Related entries:

References:

  1. Griffiths D.J.
    Endogenous retroviruses in the human genome sequence
    Genome Biology 2:reviews1017.1-1017.5 (2001).
  2. Tarlinton R.E., Meers J., Young P.R.
    Retroviral invasion of the koala genome
    Nature 442:79-81 (2006)

UniProt release 10.0

Published March 6, 2007

Full statistics and release notes

Headlines

UniProtKB/Swiss-Prot major release (52.0)

UniProt Knowledgebase release 10.0 includes Swiss-Prot release 52.0 and TrEMBL release 35.0.

UniProtKB/Swiss-Prot release 52.0 of March 6, 2007 contains 260'175 sequence entries, comprising 95'002'661 amino acids abstracted from 152'564 references. 18'986 sequences have been added since release 51.0: this represents an increase of 7.3%. In addition, the annotations of 190'910 entries have been revised.

Many improvements were carried out in the last 4 months:

  • We have reintroduced the initiator methionine: the UniProtKB sequence data now always corresponds to the precursor form of a protein, i.e. before post-translational modifications such as cleavage of the signal peptide or other events (except of course when the sequence data are derived from direct protein sequencing).
  • We have extended the UniProtKB accession number format: we have extended the existing accession number format by allowing the first character to be any of the 26 letters (instead of only O, P and Q).
  • We have added cross-references to 7 new databases and introduced a new molecule type for the EMBL cross-references (Viral_cRNA)

UniProtKB/Swiss-Prot (flat file version) turned 1 Gigabyte (GB) long on this major release ! For comparison, the human genome contains 0.791175 GB of data (the 3.1647 x 10^9 base pairs represented as 2-bits)

UniProtKB News

Format change in the dbxref.txt document file

The dbxref.txt file lists the names and abbreviations and URLs of all databases cross-referenced in the UniProt Knowledgebase. We have added a new optional field, "Ref". This field contains the database reference in the following format:

Ref   : Journal_abbrev Volume:First_page-Last_page(YYYY); [PubMed=Pubmed_identifier; ][DOI=Digital_object_identifier;]

Example:

     Abbrev: PROSITE
     Name  : PROSITE; a protein domain and family database
     Ref   : Nucleic Acids Res. 34:D227-D230(2006); PubMed=16381852; DOI=10.1093/nar/gkj063;
     LinkTp: Explicit
     Server: http://www.expasy.org/prosite/
     Db_URL: www.expasy.org/cgi-bin/get-prosite-raw.pl?%s
     Cat   : Family and domain databases
    

Changes concerning keywords

New keyword:

UniProt release 9.7

Published February 20, 2007

Headlines

Complete human kinome in UniProtKB/Swiss-Prot!

Phosphorylation by protein kinases is a universal and fundamental cell-signalling process in eukaryotic cells. A comprehensive catalog of predicted human kinases has been published in 2002 (Manning et al.).

We have annotated the 518 protein kinases predicted to exist, and when necessary revised their sequences. The human kinome as defined by Manning et al., is now complete in UniProtKB/Swiss-Prot !

These protein kinases are subdivided in 10 groups

  • AGC (containing PKA, PKG and PKC families)
  • CAMK (Calcium/Calmodulin-dependent protein kinase)
  • CK1 (Casein kinase 1)
  • CMGC (containing CDK, MAPK, GSK3 and CLK families)
  • RGC (Receptor guanylate cyclases)
  • STE (homologs of yeast Sterile 7, 11 and 20 kinases)
  • TK (Tyrosine kinase)
  • TKL (Tyrosine kinase-like)
  • Atypical
  • Other

In addition to these 518 protein kinases, there is currently one family of lipid kinases which is being fully characterized: the phosphatidyl 3-kinase (PI3 kinase) family (PI3 kinome). This emerging family appears to also include phosphatidyl 4-kinase (PI4 kinases). PI4 kinases as well as PI3 kinases share the same catalytic kinase domain. However, they are distantly related to the catalytic domain of the protein kinases and as a consequence belong to a separate family. This lipid kinase family will be soon integrated into UniProtKB/Swiss-Prot.

Mouse kinase orthologs are in the process of being all integrated into UniProtKB/Swiss-Prot. By providing annotated and up-to-date human and mouse kinomes to the scientific community, our knowledgebase becomes a central and reference portal for kinases.

UniProtKB News

Cross-references to CYGD

Cross-references have been added to the MIPS Comprehensive Yeast Genome Database. This database aims to present information on the molecular structure and functional network of the entirely sequenced, well-studied model eukaryote, the budding yeast Saccharomyces cerevisiae. In addition the data of various projects on related yeasts are used for comparative analysis.

The CYGD is available at http://mips.gsf.de/genre/proj/yeast.

The format of the explicit links in the flat file is:

Resource abbreviation CYGD
Resource identifier Ordered locus name.
Example
P35688:
DR   CYGD; YDL240w; -.

Cross-references to EMBL: new molecule type

We added the value Viral_cRNA to the controlled vocabulary of the field MoleculeType of the cross-references to the EMBL nucleotide sequence database. The format of the DR EMBL line is:

DR   EMBL; AccessionNumber; ProteinID; StatusIdentifier; MoleculeType.

The controlled vocabulary of the field MoleculeType is:

  • Genomic_DNA
  • Genomic_RNA
  • pre-RNA
  • mRNA
  • Unassigned_DNA
  • Unassigned_RNA
  • Other_DNA
  • Other_RNA
  • Viral_cRNA
  • -

Changes concerning keywords

New keywords:

UniProt release 9.6

Published February 6, 2007

Headlines

One million comment lines in UniProtKB/Swiss-Prot!

Annotation is the focal point of our effort to maintain and develop UniProtKB/Swiss-Prot. Many of our manual annotation is found in the comment lines, which aim to provide a summary of what is known about a protein. There are 27 different types of comment line, which are arranged according to what we designate as 'topics'.

Recently, we reached a peak of 1 million CC topic lines. About 97 % of the UniProtKB/Swiss-Prot entries contains at least one CC topic line and, currently, there is an average of 4 different CC topic lines per entry.

Comment lines are mainly free text, but we have already set up a standardised format as well as the use of controlled vocabularies for several topics (ALTERNATIVE PRODUCTS, BIOPHYSICOCHEMICAL PROPERTIES, CATALYTIC ACTIVITY, DISEASE, INTERACTION, MASS SPECTROMETRY, PATHWAY, RNA EDITING, SIMILARITY, TOXIC DOSE...). Standardisation for two further topics - SUBCELLULAR LOCATION and CAUTION - are also on their way (more: Forthcoming changes)

The most represented CC topics in UniProtKB/Swiss-Prot are:

  • SIMILARITY: describes the similaritie(s) (at the sequence or structure level) of a protein with other proteins or families of proteins. It can be found in 91 % of the entries. This topic can however be considered as an 'outsider' in this ranking as its content is mostly based on protein sequence analysis (family and domain scan process or similarity searches(BLAST)), and not really on a literature-based annotation.
  • FUNCTION: gives a general description of the function(s) of a protein. It is found in 68 % of the entries. Note that over 40 % of this functional data has been experimentally proved.
  • SUBCELLULAR LOCATION: describes the subcellular location of the mature protein; this information is found in 54 % of the entries (39 % experimentally proven).
  • SUBUNIT: describes the quaternary structure of a protein and any kind of interactions with other proteins or protein complexes (found in 37 % of the entries).
  • CATALYTIC ACTIVITY: describes the reaction(s) catalyzed by an enzyme (found in 35 % of the entries). There are at least as many CC CATALYTIC ACTIVITY line(s) as complete EC number(s) in the 'Protein name' section.

Such a distribution reflects the type of experimental biological data which is available for a protein sequence nowadays in the scientific literature.

The data found in UniProtKB/Swiss-Prot, are continuously updated and - since annotators are constantly improving their skills in literature-based information retrieval - the 'depth' of manual annotation is always increasing. This is highlighted by the fact that we have increased the average number of CC topics per entry from 3.5 to 4 since March 2004.

UniProtKB News

Cross-references to Cornea-2DPAGE

Cross-references have been added to the Human Cornea 2-DE database, a two-dimensional polyacrylamide gel electrophoresis federated database available at the Aarhus University (Denmark).

The Cornea-2DPAGE is available at http://www.cornea-proteomics.com/.

The format of the explicit links in the flat file is:

Resource abbreviation Cornea-2DPAGE
Resource identifier UniProtKB accession number.
Optional information 1 Organism common name.
Example
P31946:
DR   Cornea-2DPAGE; P31946; HUMAN.

Cross-references to DOSAC-COBS-2DPAGE

Cross-references have been added to the DOSAC-COBS 2D Page, a two-dimensional polyacrylamide gel electrophoresis federated database available at the DOSAC and COBS genome and proteome laboratory (La Maddalena, Italy).

The DOSAC-COBS-2DPAGE is available at http://www.dosac.unipa.it/2d/.

The format of the explicit links in the flat file is:

Resource abbreviation DOSAC-COBS-2DPAGE
Resource identifier UniProtKB accession number.
Optional information 1 Organism common name.
Example
P15531:
DR   DOSAC-COBS-2DPAGE; P15531; HUMAN.

Cross-references to REPRODUCTION-2DPAGE

Cross-references have been added to the REPRODUCTION-2DPAGE, a two-dimensional polyacrylamide gel electrophoresis database available at the laboratory of Reproductive Medicine, Nanjing Medical University, P. R. China.

The REPRODUCTION-2DPAGE is available at http://reprod.njmu.edu.cn/cgi-bin/2d/2d.cgi.

The format of the explicit links in the flat file is:

Resource abbreviation REPRODUCTION-2DPAGE
Resource identifier UniProtKB accession number.
Optional information 1 Organism common name.
Example
P32119:
DR   REPRODUCTION-2DPAGE; P32119; HUMAN.

UniProt release 9.5

Published January 23, 2007

Headlines

Reintroduction of the initiator methionine

In UniProtKB/Swiss-Prot, the sequence data corresponds to the precursor form of a protein, i.e. before post-translational modifications such as cleavage of the signal peptide or other processing. However, for historical reasons, a notable exception was made: when the initiator methionine was post-translationally removed, the sequence stored in UniProtKB/Swiss-Prot did not include the methionine and instead started with the second residue.

As a consequence, our sequence data differed from that shown in other sequence databases where the initiator methionine is usually not removed. This discrepancy was confusing for users and was the subject of one of the most frequently asked questions to UniProtKB/Swiss-Prot.

This is no longer the case. With this release, all initiator methionines have been reintroduced to the UniProtKB/Swiss-Prot entries (over 10'000) from which it is cleaved. This caused a major change, since all amino acid positions described in these entries have now been updated to reflect the new sequence numbering.

The cleavage of the initiator methionine is still indicated by the INIT_MET line in the feature table but the sequence position is 1 instead of 0. We also added the comment Removed in the description field of INIT_MET line to indicate that the initiator methionine is indeed removed post-translationally.

Example P51487:

Previous format:

FT   INIT_MET      0      0
FT   CHAIN         1    400       Phosrestin-1.
...
SQ   SEQUENCE   400 AA;  44781 MW;  DA786D7E9FFB4A29 CRC64;
      VVSVKVFKK ATPNGKVTFY LGRRHFIDHF DYIDPVDGVI VVDPDYLKNR KVFAQLATIY

New format:

FT   INIT_MET      1      1       Removed.
FT   CHAIN         2    401       Phosrestin-1.
...
SQ   SEQUENCE   401 AA;  44912 MW;  1212C2422CD35A94 CRC64;
     MVVSVKVFKK ATPNGKVTFY LGRRHFIDHF DYIDPVDGVI VVDPDYLKNR KVFAQLATIY

UniProt release 9.4

Published January 9, 2007

Headlines

Complete yeast proteome in UniProtKB/Swiss-Prot

Brewer's yeast or baker's yeast are two common names for the species Saccharomyces cerevisiae, for which the scientifically correct name was first applied to a strain observed in malt circa 1837. These common names neatly reflect the major interests this organism holds for the majority of people. It is one of the earliest "domesticated" organisms, and while initially appreciated for its alcohol producing or dough leavening capabilities, the simple yeast soon became an important organism for research too.

The ease with which yeast can be cultivated and genetically manipulated made it a useful tool in the early days of biotechnological and biomedical research, where it was utilized for the production of pharmaceuticals and enzymes (a name that originates from the latin 'enzymi' = in yeast). S.cerevisiae has subsequently proven to be an extremely useful experimental model system for the study of the basic biological structures and processes of the eukaryotic cell. It is therefore not surprising that it was one of the first eukaryotic species targeted by large-scale sequencing efforts, and in 1996, researchers were able to celebrate the completion of the first eukaryotic genome sequence.

One decade later, and coincident with the 20th anniversary of Swiss-Prot, yeast is again in the headlines, representing the first complete eukaryotic proteome integrated into Swiss-Prot, the manually curated section of the UniProt knowledgebase. In the current release of UniProtKB/Swiss-Prot there are more than 6000 yeast entries containing every gene of the yeast genome believed to code for a protein. Each entry contains literature-curated annotations and numerous cross-references, the locus identifier, which maps a protein to its corresponding genomic locus, and a cross-reference to the Saccharomyces Genome Database (SGD; www.yeastgenome.org), the community-designated repository for the reference genome sequence. A summary of all yeast entries including these references is listed in the file yeast.txt.

In the 10 years since the initial release of the S.cerevisiae genome, the annotation of protein encoding genes has continually evolved. New open reading frames have been identified and existing predicted ORFs have been revised or retired. In collaboration with SGD we have revisited and updated all entries for which the protein sequence has been changed since the initial release in order to provide users with a set of yeast proteins that corresponds to the most current view of the yeast proteome.

Ten years of post-genomic research have yielded a wealth of information on yeast proteins and we will continually revisit yeast entries to update their functional annotation. S.cerevisiae continues to be at the forefront of experimental molecular biology, particularly in the field of proteomics, and the availability of the complete proteome in UniProtKB/Swiss-Prot will facilitate the mapping and integration of results from large-scale proteomic studies. S.cerevisiae will also serve in the future as one of the model systems for functional annotation in UniProtKB/Swiss-Prot. As one of the best-characterized of the eukaryotic organisms, its proteins will provide many templates for the creation and annotation of fungal-specific or broader eukaryotic protein families.


UniProtKB News

Cross-references to MaizeGDB

We changed the Data bank identifier for the Maize Genetics and Genomics Database MaizeGDB from MaizeDB to MaizeGDB.

Example:

DR   MaizeDB; 58111; -.

has changed to

DR   MaizeGDB; 58111; -.

Changes concerning keywords

New keyword:

UniProt release 9.3

Published December 12, 2006

Headlines

Major update of a re-emerging pathogen: Dengue virus.

Dengue is a mosquito-borne virus found in tropical and sub-tropical regions around the world, predominantly in urban and semi-urban areas in south-east Asia, Africa, and South America.

During the 1970s, the disease had become sporadic due to an active vector control program. Moreover since the 1980s, both the virus and his vector have re-emerged and spread even wider than before: the disease is now found in more than 100 countries. The reasons for this re-emergence probably include the growing extension of urban areas and the suspension of the vector control program.

The virus is transmitted to humans through the bite of an infected Aedes aegypti mosquito and subsequently replicates in skin dendritic cells before infecting lymph nodes and blood cells. The symptoms are fever and pain, that can be sustained for up to 7 days. In rare cases, human infection leads to dengue haemorrhagic fever (DHF), a potentially fatal complication. Today, DHF affects most Asian countries and has become a leading cause of hospitalisation and death among children.

Some 2500 million people -- two fifths of the world's population -- are now at risk from dengue. WHO currently estimates there may be 50 million cases of dengue infection worldwide every year. The mild autumn of 2006 has favoured the spread of the vector and has been responsible for a major outbreak of dengue in India, with many reported cases in New Delhi.

The growing number of dengue virus sequences ( currently over 3400 in UniProtKB/TrEMBL ) and the absence of taxonomic nomenclature complicates the identification of medical samples.

In the current UniProtKB/Swiss-Prot release, we have adopted a systematic nomenclature for 28 representative dengue strains, where we indicate the country and the year of isolation besides the strain name.

Example: Dengue virus type 2 (strain TH-36)
becomes: Dengue virus type 2 (strain Thailand/TH-36/1958)

The virus (+)RNA genome codes for a single polyprotein, cleaved in more than 12 products. 32 representative dengue virus polyproteins have been annotated and are available in UniProtKB/Swiss-Prot. (ex: P33478)


UniProtKB News

Protein Spotlight document

Protein Spotlight (ISSN 1424-4721 - http://www.expasy.org/spotlight/) is a monthly review written by the Swiss-Prot team of the SIB Swiss Institute of Bioinformatics. Spotlight articles describe a specific protein or family of proteins on an informal tone.

The protsprot document lists for each Protein Spotlight article the corresponding UniProtKB/Swiss-Prot entries that it cites.

Cross-references to DIP

Cross-references have been added to the Database of interacting proteins. The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data.

The DIP is available at http://dip.doe-mbi.ucla.edu/.

The format of the explicit links in the flat file is:

Resource abbreviation DIP
Resource identifier DIP accession number.
Examples
Q9W1K5:
DR   DIP; DIP:19601N; -.

P41597:
DR   DIP; DIP:5833N; -.
DR   DIP; DIP:5839N; -.

Cross-references to Reactome

The resource identifier of the cross-references to the Reactome database has been modified. The resource identifier was a Reactome unique identifier for a protein, which was identical to the Swiss-Prot primary AC number of that protein. Now it is a stable Reactome identifier. In addition, the pathway name is given as an optional information field.

Resource identifier Reactome identifier.
Optional information 1 Pathway name.
Examples
P61978:
DR   Reactome; REACT_1675.1; mRNA Processing.
DR   Reactome; REACT_71.1; Gene Expression.

P62191:
DR   Reactome; REACT_152.2; Cell Cycle, Mitotic.
DR   Reactome; REACT_1538.1; Cell Cycle Checkpoints.
DR   Reactome; REACT_383.2; DNA Replication.
DR   Reactome; REACT_6185.3; HIV Infection.
DR   Reactome; REACT_6850.1; Cdc20:Phospho-APC/C mediated degradation of Cyclin A.

Cross-references to GermOnline

The resource identifier of the cross-references to the GermOnline database has been modified. The resource identifier was a GermOnline identifier for a gene. Now it is a gene identifier from any source, e.g. Ensembl or model organism database. In addition, the organism name is is given as an optional information field.

Resource identifier Gene identifier from any source, e.g. Ensembl or model organism database.
Optional information 1 Organism name.
Examples
P02766:
DR   GermOnline; ENSG00000118271; Homo sapiens.

P32559:
DR   GermOnline; YMR023C; Saccharomyces cerevisiae.

Changes concerning keywords

New keyword:

Modified keywords:

UniProt release 9.2

Published November 28, 2006

Headlines

All known human G protein-coupled receptor proteins in UniProtKB/Swiss-Prot

The Human Proteome Initiative (HPI) aims to annotate all known human protein sequences, as well as their mammalian orthologs. The G protein-coupled receptor proteins (GPCRs), also known as seven transmembrane receptors (7TM receptors) form one of the largest proteins family in mammalian genomes. These proteins are involved in all types of stimulus-response pathways, from intercellular communication to physiological senses, including taste, smell, and vision (opsins receptors). Many diseases are linked to GPCRs and half of the drug products by the pharmaceutical industry are targeted against GPCRs. A special emphasis has been given to this family in the HPI project.

In the current release, all known and potential human G protein-coupled receptor protein are annotated and integrated in UniProtKB/Swiss-Prot. 775 human GPCRs are now available in our knowledgebase. About half of all GPCRs are presumed to be involved in the sense of smell. For the remaining half, the active ligand has been documented when available, but about 20% of human GPCRs are still orphans. Most of mouse and rat orthologs have been annotated.

All G protein-coupled receptor proteins annotated in UniProtKB/Swiss-Prot are classified by family and listed in the file 7tmrlist.txt.


UniProtKB News

Cross-references to Gene3D

Cross-references have been added to the Gene3D Structural and Functional Annotation of Protein Families database. Gene3D database provides a combined structural, functional and evolutionary view of the protein world. It is focussed on providing structural annotation for protein sequences without structural representatives -- including the complete proteomes of over 240 different species. The protein sequences have also been clustered into whole-chain families so as to aid functional prediction. The structural annotation is generated using HMM models based on the CATH domain families; CATH is a repository for manually deduced protein domains.

The Gene3D is available at http://cathwww.biochem.ucl.ac.uk:8080/Gene3D/.

The format of the explicit links in the flat file is:

Resource abbreviation Gene3D
Resource identifier Gene3D identifier.
Optional information 1 Gene3D entry name
Examples
Q12933:
DR   Gene3D; G3DSA:3.90.890.10; SIAH-type; 1.
DR   Gene3D; G3DSA:2.60.210.10; TRAF-type; 1.

Q04311:
DR   Gene3D; G3DSA:1.25.40.20; ANK; 1.

Changes concerning keywords

New keyword:

UniProt release 9.1

Published November 14, 2006

Headlines

CD antigens: molecular markers of cell differentiation

The CD nomenclature was proposed and established in 1982 at the first International Workshop and Conference on Human Leukocyte Differentiation Antigens (HLDA). This system was intended for the classification of monoclonal antibodies (mAbs), generated in many laboratories around the world, against various cell surface molecules on leukocytes (white blood cells). The data were collated and analyzed by the statistical procedure of 'cluster analysis'. This analytical method identified clusters of antibodies with very similar patterns of binding to leukocytes at various stages of differentiation: hence the use of the abbreviation 'CD' for 'cluster of differentiation'. CD antibodies are used widely for research, differential diagnosis, and the monitoring and treatment of disease.

The HLDA workshops assign each CD on the basis of the reactivity of at least two mAbs to one human antigen; the provisional indicator 'w' (for example CDw293) is sometimes given to an imperfectly characterized cluster or to a cluster represented by only one mAb.

Gradually the use of the CD nomenclature has expanded to many other cell types such as endothelial and stromal cells. Therefore the 8th HLDA conference (HDLA8) decided in 2004 that the acronym HLDA would be replaced by HCDM for "Human Cell Differentiation Molecules".

All currently defined human CD antigens (a total of 361 in this release) are annotated and integrated in UniProtKB/Swiss-Prot. A CD antigen appears in an entry as a synonym of the protein name (e.g. CD305 antigen for Leukocyte-associated immunoglobulin-like receptor 1). The CD name is also propagated to all orthologous mammalian proteins, so that human CD antigens and their orthologs in other mammals can be retrieved easily.

UniProtKB News

Extension of the UniProtKB accession number format

Before release 9.1, UniProtKB accession numbers consisted of 6 alphanumerical characters in the following format:

1 2 3 4 5 6
[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]

Due to the large increase in the number of protein sequences in UniProtKB, we had to extend the existing accession number format by allowing the first character to be any of the 26 letters (instead of only O, P and Q). To avoid assigning accession numbers identical to those which have been used by the International Nucleotide Sequence Database, the extension in the first position goes along with a restriction in the third position which can only be a letter. The new format for UniProtKB accession numbers is therefore:

1 2 3 4 5 6
[A-N,R-Z] [0-9] [A-Z] [A-Z, 0-9] [A-Z, 0-9] [0-9]
[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]

Cross-references to DrugBank

Cross-references have been added to the DrugBank database. The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains many drug entries including FDA-approved small molecule drugs, FDA-approved biotech (protein/peptide) drugs, nutraceuticals and experimental drugs. Additionally, protein (i.e. drug target) sequences are linked to these drug entries. Each DrugCard entry contains more than 80 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data.

The DrugBank database is available at http://redpoll.pharmacy.ualberta.ca/drugbank/.

The format of the explicit links in the flat file is:

Resource abbreviation DrugBank
Resource identifier DrugBank accession number.
Optional information 1 Drug generic name.
Example
P08185:
DR   DrugBank; DB00240; Alclometasone.
DR   DrugBank; DB00394; Beclomethasone.
DR   DrugBank; DB01410; Ciclesonide.
DR   DrugBank; DB00663; Flumethasone Pivalate.
DR   DrugBank; DB00180; Flunisolide.
DR   DrugBank; DB00591; Fluocinolone Acetonide.
DR   DrugBank; DB01047; Fluocinonide.
DR   DrugBank; DB00324; Fluorometholone.
DR   DrugBank; DB00846; Flurandrenolide.
DR   DrugBank; DB00588; Fluticasone Propionate.
DR   DrugBank; DB00596; Halobetasol Propionate.
DR   DrugBank; DB00253; Medrysone.
DR   DrugBank; DB00648; Mitotane.
DR   DrugBank; DB01384; Paramethasone.
DR   DrugBank; DB00860; Prednisolone.
DR   DrugBank; DB00896; Rimexolone.
DR   DrugBank; DB00620; Triamcinolone.

Cross-references to euHCVdb

Cross-references have been added to the European Hepatitis C Virus database. The development of the European Hepatitis C Virus database (euHCVdb) started in 1999 as the French HCV Database. The euHCVdb is mainly oriented towards protein sequence, structure and function analyses and structural biology of HCV.

The European Hepatitis C Virus database is available at https://euhcvdb.ibcp.fr/euHCVdb/.

The format of the explicit links in the flat file is:

Resource abbreviation euHCVdb
Resource identifier EMBL Accession number.
Examples
P26664:
DR   euHCVdb; AF271632; -.
DR   euHCVdb; M62321; -.

P27953:
DR   euHCVdb; X53136; -.

Cross-references to dbSNP

Explicit links are present in the FT VARIANT lines of protein sequence entries of Hominidae to the Single Nucleotide Polymorphism database (dbSNP). We will prefix dbSNP identifiers in human FT VARIANT lines by "rs". NCBI/dbSNP has rs and ss numbers, but we only refer to SNPs with rs numbers.

Examples:

P08185:
 FT   VARIANT     246    246       S -> A (in dbSNP:rs2228541).
 FT                                /FTId=VAR_024350.
P06307:
 FT   VARIANT      32     32       G -> E (in dbSNP:rs11571848).
 FT                                /FTId=VAR_018818.
 FT   VARIANT      95     95       R -> W (in dbSNP:rs3774395).
 FT                                /FTId=VAR_024452.

UniProt release 9.0

Published October 31, 2006

Full statistics and release notes

Headlines

UniProtKB/Swiss-Prot major release (51.0)

UniProt Knowledgebase release 9.0 includes Swiss-Prot release 51.0 and TrEMBL release 34.0.

Release 51.0 of 31-Oct-06 of UniProtKB/Swiss-Prot contains 241'242 sequence entries, comprising 88'541'632 amino acids abstracted from 148'048 references.

19'061 sequences have been added since release 50.0, the sequence data of 1'336 existing entries has been updated and the annotations of 222'181 entries have been revised.

Many improvements were carried out in the last 5 months:

  • We have changed the format of the ID line, replacing the terms STANDARD/PRELIMINARY by Reviewed/Unreviewed, and removing the legacy field MoleculeType.
  • We have also standardized the FASTA header lines of UniProtKB and UniRef entries.
  • Cross-references to five new databases have been added, and the cross-references to Gene Ontology (GO) have been modified to include the source database for the mapping.
  • There are two new documents: Index of Oryza sativa entries and their corresponding gene designations, and Protein naming guidelines.

UniProtKB News

ID (IDentification) line

The format of the ID line was:

ID   EntryName DataClass; MoleculeType; SequenceLength.

We have changed the values of the DataClass field as described in this table:

Old DataClass New DataClass Description
STANDARD Reviewed Entries that have been manually reviewed and annotated by UniProtKB curators (Swiss-Prot section of the UniProt Knowledgebase).
PRELIMINARY Unreviewed Computer-annotated entries that have not been reviewed by UniProtKB curators (TrEMBL section of the UniProt Knowledgebase).

We have also dropped the field MoleculeType, which was a legacy of compatibility with the EMBL flat file format. The new format of the ID line is:

ID   EntryName DataClass; SequenceLength.

Examples:

ID   CYC_PIG                 Reviewed;         104 AA.
ID   Q3ASY8_CHLCH            Unreviewed;     36805 AA.

FASTA header line

We have standardized the FASTA header line of UniProtKB and UniRef entries in the following way:

Format for UniProtKB

>UniqueIdentifier|EntryName ProteinName - OrganismName
  • UniqueIdentifier is the primary accession number of the UniProtKB entry, or, in the case of entries that describe several protein isoforms, an isoform identifier.
  • EntryName is the entry name of the UniProtKB entry.
  • ProteinName is the recommended or submitted protein name of the UniProtKB entry. (This is the name before the first bracket, excluding 'precursor' but including 'Fragment' if appropriate - see the Protein name documentation.)
  • OrganismName is the scientific name of the organism of the UniProtKB entry.

Examples:

>P24856|ANP_NOTCO Ice-structuring glycoprotein (Fragment) - Notothenia coriiceps neglecta
>P51650-1|SSDH_RAT Succinate semialdehyde dehydrogenase - Rattus norvegicus

Cross-references to Human Protein Atlas

Cross-references have been added to the Human Protein Atlas. The Human Protein Atlas shows the expression and localization of proteins in a large variety of normal human tissues and cancer cells. The data is presented as high resolution images representing immunohistochemically stained tissue sections. Each antibody in the database has been used for immunohistochemical staining of both normal and cancer tissue. The immunohistochemical protocols used result in a brown-black staining, localized where an antibody has bound to its corresponding antigen.

The format of the explicit links in the flat file is:

Resource abbreviation HPA
Resource identifier Antibody identifier.
Examples
O75843:
DR   HPA; HPA004106; -.

P08183:
DR   HPA; CAB001716; -.

Cross-references to Gene Ontology (GO)

The last field of the cross-references to the Gene Ontology (GO) database has been modified. This field displays the GO evidence code and we have appended to this the source database from which the cross-reference was obtained, separated by a colon.

Examples:

Q15738:
DR   GO; GO:0005783; C:endoplasmic reticulum; IDA:LIFEdb.

P13569:
DR   GO; GO:0005524; F:ATP binding; TAS:ProtInc.

Changes concerning keywords

UniProt release 8.9

Published October 17, 2006

Headlines

Human polymorphisms: juggling with health and disease

Recent advances in genomics and proteomics promise to give new insights into the molecular mechanisms of diseases and hopefully will lead to the discovery of novel treatments. The integration of phenotype descriptions along with sequence data, genetic information, as well as physiological, biochemical and structural knowledge may help understand the chain of events leading from a molecular defect to a pathology. In this context, UniProtKB/Swiss-Prot provides the scientific community with a wealth of information on genetic diseases, disease-linked variants and polymorphisms.

In the current release, over 2,000 human entries contain a disease description in the comment section under the topic "involvement in disease". The disease description is short, but it is supplemented with links to the OMIM database, allowing the retrieval of more detailed information about genetic disorders. Additional links to gene-specific databases can be found under web resources.

At the sequence level, close to 28,500 human single amino acid polymorphisms (SAPs) are described, more than half of which are associated with a disease state and about 30% are linked to the Single Nucleotide Polymorphism database (dbSNP). SAPs are described in the feature table and characterized by a unique identifier (FTId), which gives access to the variant web pages. These pages display a synopsis of relevant information for a given variant, including references, sequence context, as well as residue conservation throughout evolution and structural data, when available (example). Mutations that cause major changes to a protein sequence (as is the case for most frameshift mutations) are not and will not be considered to be relevant to UniProtKB/Swiss-Prot, as their deleterious effects on a given protein function is usually obvious.

Finally, our medical annotation effort also consists of the creation of keywords to allow easy retrieval of proteins involved in complex disorders and genetically heterogeneous diseases.

Currently about 100 "medical" keywords have been created, and the list is growing.


UniProtKB News

Changes concerning keywords

UniProt release 8.8

Published October 3, 2006

Headlines

Over 1,000 rice proteins annotated

Rice (Oryza sativa) is the most important food crop in the world and part of the daily diet of over half of the human population. It is grown in 114 countries worldwide and provides 50-80% of the calory consumption in a number of Southeast Asian countries (see world rice statistics).

In the current release, over 1,000 rice entries have been completed in UniProtKB/Swiss-Prot. How?

Following the completion of the first genome sequence of the model plant Arabidopsis thaliana, in 2001 the Swiss-Prot group initiated the Plant Proteome Annotation Program, which focuses on the annotation of plant-specific proteins and protein families. Our major effort was directed towards Arabidopsis, but the completion of the Oryza sativa (cultivar Nipponbare) genome sequence by the IRGSP prompted us to broaden our focus.

Each manually annotated rice entry already contains the TIGR locus identifiers - which map each protein to the corresponding gene in the rice genome - and will soon also include RAP loci. Amongst the numerous cross-references in rice entries is the link to Gramene which gives access to comparative grass genomics. We also plan to link our entries to RAP-DB in the near future, which will provide links to genomic data and genome annotation.

We are currently concentrating on the annotation of well-characterized proteins for which experimental data are available. The function of a number of rice proteins reflects physiological trait adaptation and grain property evolution owing to centuries of selection by farmers (over 100,000 rice varieties exist throughout the world).

As an example, large areas of Southeast Asia are flooded during the monsoon season. Deepwater rice copes with this by way of rapid internode elongation (up to 25 cm/day), and expansin A4 contributes by causing the cell walls to slacken and expand.

What is more, a primary factor that decreases rice crop yield is coastal salinity and the accumulation of salts in irrigated land. Pokkali, an indica variety of lowland rice, is classified as highly tolerant, because it contains a specific potassium-sodium cotransporter (HKT2), which mediates increased potassium uptake with external sodium accumulation.

Finally, grain texture of cooked rice is essential in various food cultures. A generic classification exists between long grain, medium grain and short grain rice, where the first is separate and fluffy and the last more moist, sticky and tender. The proportion of long chain amylopectin is correlated with firmer cooked rice. A starch synthase (SSII-3), which synthesizes long chain amylopectin, is barely active in the sticky cultivar japonica Nipponbare, however, a variation of 4 amino acids leads to an increased activity in firmer indica varieties.

All rice proteins annotated in UniProtKB/Swiss-Prot are classified by chromosome locus (ordered locus name starting with "Os") and listed in the file rice.txt. In the future, we plan to manually annotate every rice gene family and to develop semi-automated annotation tools to complete rice proteome annotation.

UniProt release 8.7

Published September 19, 2006

Headlines

The search for the origin of HIV-1

The origin of Human immunodeficiency virus 1 (HIV-1) has been the subject of hot debate for more than twenty years. In 1999, American, Japanese and French researchers claimed to have discovered an indisputable link between a chimpanzee virus from central West Africa called SIVcpz (Simian Immunodeficiency Virus from chimpanzees) and HIV-1. SIVcpz is 70-90% identical to HIV-1 and does not appear to cause illness in chimpanzees.

However, since SIVcpz was only found in a few chimpanzees held in captivity, the possibility existed that another yet unidentified species could be the natural reservoir of both HIV-1 and SIVcpz.

A recent study provides for the first time a clear picture of the origin of HIV-1 and the seeds of the AIDS pandemic. New strains of SIVcpz have been identified in wild chimpanzees from Cameroon. These new strains are more closely related to human HIV-1 than to any Simian viruses.

There are three HIV-1 lineages: M (Major), O (Outlier) and N (New). The new SIVcpz isolate MB66 turned out to be more closely related to HIV-1 group M than to any Simian virus (see a similarity search for SIVcpz MB66 gag-pol protein). Moreover, another wild virus, SIVcpz isolate EK505, is very closely related to HIV-1 group N. This suggests that at least two independent SIVcpz transfers from chimpanzee to man occurred in this region. HIV-1 group M presumably crossed species early in the 20th century. HIV-1 group N may have infected humans more recently.

The authors of the study also postulate that "given the extensive genetic diversity and phylogeographical clustering of SIVcpz now recognised and the vast areas of west central Africa not yet sampled, it is quite possible that still other SIVcpz lineages exist that could pose risks for human infection and prove problematic for HIV diagnostics and vaccines."

Proteins from SIVcpz isolates MB66 and EK505 are fully annotated and available from UniProtKB/Swiss-Prot.


UniProtKB News

Cross-references to KEGG

The KEGG (Kyoto Encyclopedia of Genes and Genomes) database is part of the research projects of the Kanehisa Laboratories in the Bioinformatics Center of Kyoto University and the Human Genome Center of the University of Tokyo. The aim of this bioinformatics resource is to provide as far as possible a complete computer representation of the cell, the organism, and the biosphere, which will enable computational prediction of higher-level complexity of cellular processes and organism behaviors from genomic and molecular information.

The format of the explicit links in the flat file is:

Resource abbreviation KEGG
Resource identifier KEGG's organism code for the genome and gene number, separated by a colon.
Examples
P54609:
DR   KEGG; ath:At3g09840; -.

O43623:
DR   KEGG; hsa:6591; -.

UniProt release 8.6

Published September 5, 2006

Headlines

3D-structure information for over 10,000 proteins

3D-structure information is now available for over 10,000 proteins in UniProtKB/Swiss-Prot (containing more than 36,000 individual cross-references to PDB.).

Protein structures not only delight the eye, they shed light on protein architecture and provide proof for the existence of a given protein fold. They are indispensable to determine the interactions of a protein with its ligands (substrates, ions, cofactors or regulatory molecules) and provide solid proof for post-translational modifications. Likewise, 3D-structures pinpoint the exact position of residues that cause a genetic disease when mutated (e.g. Q8NBK3). They help to design experiments and make it possible to attribute a function to so-far hypothetical proteins (e.g. Q46856).

UniProtKB aims to be fully synchronized with PDB and provide access to information about protein 3D-structures via cross-references to PDB, and by giving high priority to the annotation of proteins with known 3D-structures. A semi-automated mapping procedure was established in collaboration with the Macromolecular Structure Database (MSD), so that the whole PDB archive could be mapped to UniProtKB.


UniProtKB News

Cross-references to ArrayExpress

ArrayExpress is a public repository database for microarray gene expression data. We introduced a new cross-reference to the ArrayExpress, which stores gene-indexed expression profiles from a curated subset of experiments in the repository.

The format of the explicit links in the flat file is:

Resource abbreviation ArrayExpress
Resource identifier UniProtKB primary AC.
Example
O00139:
DR   ArrayExpress; O00139; -.

Cross-references to PeroxiBase

PeroxiBase is a database that centralizes most of the peroxidase superfamilies encoding sequences (classes I, II, III peroxidase superfamily, glutathione peroxidases, NADPH oxidases, and animal peroxidases), to follow the evolution of peroxidase among living organism and to compile the information concerning putative functions and transcription regulation.

The format of the explicit links in the flat file is:

Resource abbreviation PeroxiBase
Resource identifier PeroxiBase accession number.
Optional information 1 PeroxiBase entry name
Example
O23044:
DR   PeroxiBase; 79; AtPrx03.

Changes concerning keywords

PTMs

Terms introduced in the controlled vocabulary for PTMs:

Terms for the feature key 'CROSSLNK':

  • 3-(S-cysteinyl)-tyrosine (Cys-Tyr)
  • 3'-(S-cysteinyl)-tyrosine (Tyr-Cys)
  • 4-amino-3-isothiazolidinone serine (Cys-Ser)
  • Peptide (Met-Gly) (interchain with G-...)
  • Tryptophyl-tyrosyl-methioninium (Trp-Tyr)
  • Tryptophyl-tyrosyl-methioninium (Tyr-Met)

Terms for the feature key 'LIPID':

  • 3'-geranyl-2',N2-cyclotryptophan
  • GPI-like-anchor amidated alanine
  • GPI-like-anchor amidated asparagine
  • N6-myristoyl lysine
  • N6-palmitoyl lysine
  • S-stearoyl cysteine

Terms for the feature key 'MOD_RES':

  • 1-thioglycine
  • 2-oxobutanoic acid
  • Citrulline
  • Cysteine sulfenic acid (-SOH)
  • Cysteine sulfinic acid (-SO2H)
  • Cysteinyl-selenocysteine (Sec-Cys)
  • Methionine sulfone
  • N2-succinyltryptophan
  • Nitrated tyrosine
  • O-acetylserine
  • O-acetylthreonine
  • S-selanylcysteine

UniProt release 8.5

Published August 22, 2006

Headlines

10'000 species in UniProtKB/Swiss-Prot

We have now 10'000 different species represented in UniProtKB/Swiss-Prot for which protein entries are stored in the knowledgebase. Ten times more species are stored in UniProtKB/TrEMBL. Each species present in UniProtKB/Swiss-Prot is curated: the curation consists of the verification of the scientific name validity, the consistency of the lineage and the existence of a common name and/or synonym. You think the taxomony is indigestible? Have a look at the following recipe ;-)

Pizza recipe

Pizza is not a new program, it is really a delicious and tasteful recipe!

Pizza crust: Toppings: Homemade tomato sauce:
(*) Lactobacillus helveticus is used for the manufacture of these 2 cheeses

Add fresh Saccharomyces cerevisiae to the water and stir until dissolved. Add Beta vulgaris sugar, Olea europaea oil, salt and Triticum aestivum powder. On lightly floured board, knead dough until smooth and elastic. Place in a bowl and let rise in a warm place until volume has doubled.

Heat Olea europaea oil in a wide frying pan over medium heat; add Allium cepa and cook for about 10 minutes until softened, stirring often. Turn the heat on to high and add Allium sativum, herbs (Ocimum basilicum, Origanum vulgare and Petroselinum crispum and Lycopersicon esculentum paste. Add Capsicum annuum powder and season to taste with salt.

Let simmer for at least 30 minutes.

Roll dough into a large circle, place on greased baking sheet, press around edges to form 2 cm rim. Cover with homemade tomato sauce. Layer toppings on dough in order listed. Bake at 240±C for 13 minutes until nicely coloured. You can top the pizza with a few leaves of Diplotaxis tenuifolia (it tastes hotter than Eruca sativa).

You uncovered 18 species in our recipe but 9'982 other species are now in UniProtKB/Swiss-Prot

ENJOY :)

UniProt release 8.4

Published July 25, 2006

Headlines

Happy anniversary, Swiss-Prot!

On July 21st 1986, the first Swiss-Prot release was created. It contained close to 4,000 protein sequence entries and was produced by a single graduate student, Amos Bairoch, at the University of Geneva. In 1996, while Swiss-Prot was rapidly growing (60,000 entries) and was used worldwide, the granting agencies could not find a solution to finance it. Without the support of thousands of users, Swiss-Prot would not be celebrating its 20th anniversary today! This financial crisis was solved by the creation of the SIB Swiss Institute of Bioinformatics, and additional resources were provided by license fees paid by commercial users, Swiss-Prot remaining freely accessible to the academic community.

The first Swiss-Prot annotators used to annotate protein sequences concomitant with the submission of the nucleotide coding sequences to the EMBL database. However, the increase of submissions made it impossible to keep pace. In collaboration with the European Bioinformatics Institute (EBI), a solution was found with the creation of TrEMBL, a computer-annotated supplement to Swiss-Prot in 1996, which contained roughly 60,000 entries in its first release.

In 2006, a staff of 60 annotators at the SIB and the EBI, supported by a dedicated programming team, is maintaining Swiss-Prot. Close to 250,000 entries are currently in the knowledgebase. Interestingly, 10 years were necessary to reach the first 50,000 protein sequence entries, while 50,000 proteins can now be manually annotated in about 18 months. In parallel, TrEMBL's exponential growth results in a database containing close to 3 millions entries.

Since 2002, both databases are at the heart of the UniProt project and together they constitute the UniProt Knowledgebase (UniProtKB), one of 3 UniProt components. UniProt is produced by a collaboration between 3 institutes, SIB, EBI and PIR (Protein Information Resource). This single, centralized, authoritative resource for protein sequences and functional information aims to make protein data available, to facilitate their retrieval and to provide new tools to help in their analysis. Since Swiss-Prot became UniProtKB/Swiss-Prot, the access to the knowledgebase is free again for commercial users. Currently 160 persons are involved in the UniProt services to the scientific community.

The means have changed, but the 20 year old key idea of a graduate student to share knowledge is still, and more than ever, vivid.


UniProtKB News

Large scale analyses

The term 'LARGE SCALE ANALYSIS' was added in RP lines for references that report large screen results to indicate that results have not been extensively studied.

AC   P33304
RP   PHOSPHORYLATION [LARGE SCALE ANALYSIS] AT SER-22, AND MASS
RP   SPECTROMETRY.

UniProt release 8.3

Published July 11, 2006

Headlines

Of mice and men: over 10'000 orthologous sequence pairs in UniProtKB/Swiss-Prot

Comparing orthologous proteins between mammalian species is very helpful to understand the biological basis underlying disease susceptibility or responsiveness to drugs, or simply to understand what makes us human and not simply another great ape.

Human protein sequences and those of all available mammalian orthologous sequences are annotated and compared in the frame of the UniProtKB/Swiss-Prot HPI annotation program (Human Proteomics Initiative). During the annotation process, sequence length, alternative splicing isoforms or even polymorphisms can be validated. In order to provide our users with a coherent view of mammalian proteomes, similar isoforms are shown for orthologous proteins from all mammalian species whenever possible.

The laboratory mouse is a widely used model organism and thus many murine sequences are available for annotation. It is currently the most highly represented non-human mammal with more than 11'000 entries, and 91% of these entries are orthologous to human proteins. Human-mouse orthologous pairs share 85% identity on average. About 36% of these pairs have identical sequence length and share 94% identity. The most highly conserved proteins are involved in core biological processes such as mRNA processing and transport, translation and ubiquitin-dependent protein degradation. In contrast, fast evolving proteins generally play roles in immunity, reproduction and signal transduction.

The percentage of identity between orthologous protein pairs in the most highly represented mammals in UniProtKB/Swiss-Prot is shown in the table below:

              Orangutan   Bovine   Mouse     Rat
 Human            97.43    87.37   85.46   85.80
 Orangutan                 89.34   87.44   87.20
 Bovine                            83.99   84.80
 Mouse                                     93.48

UniProtKB/Swiss-Prot entries for orthologous proteins usually share the same protein mnemonic code in the entry name (ID line) and thus can be easily identified.

UniProtKB News

Changes concerning keywords

  • Polyprotein (deleted)

UniProt release 8.2

Published June 27, 2006

Headlines

Looking for Titin

"I am looking for Titine" Charlie Chaplin sang in Modern Times. While for many people Titin brings back memories about this song, for the scientific community the meaning is completely different. Titin is a giant sarcomeric protein of roughly 35'000 aa. Protein analysis programs used to crash when encountering huge proteins, and the size limit of a protein to be integrated into UniProtKB/Swiss-Prot used to be under 10'000 aa long. Modern times finally arrived and bioinformatics has improved by leaps and bounds. Programs are now able to deal with huge proteins and titin has finally been integrated into UniProtKB/Swiss-Prot.

Titin is a long (up to 1 micron), slender and flexible strand, frequently with a large globule at one end. It has a complex modular structure that varies depending on the splicing events. In its longest form it may contain up to 132 fibronectin type-III domains, 152 Ig-like domains, 9 Kelch, 17 RCC1, 14 TPR, 15 WD and 31 PEVK repeats and 1 protein kinase domain. Titin functions as a mechanical sensor through its interaction with many other proteins, such as myomesins, tropomyosins, myosins, actins, myopalladin, etc. By providing connections at the level of individual microfilaments, it contributes to the fine balance of forces between the two halves of the sarcomere and thus to muscle extensibility. In non-muscle cells, it seems to play a role in chromosome condensation and segregation during mitosis.

Needless to say, the titin-seeking of Charlie Chaplin was a legitimate demand, because all human beings need titin in their life.


UniProtKB News

Overview of Oryza sativa (rice) entries

The document rice.txt lists all the Oryza sativa (rice) entries in UniProtKB/Swiss-Prot. For each UniProtKB/Swiss-Prot rice entry, there is the corresponding chromosome locus, the UniProtKB/Swiss-Prot accession number, the UniProtKB/Swiss-Prot entry name, the description and the gene names.

Changes concerning keywords

UniProt release 8.1

Published June 13, 2006

Headlines

Man gave names to all the... proteins

We have spent many years curating all kinds of proteins from all kinds of species. One recurring challenge is to offer an easily searchable and consistent knowledgebase dealing, in particular, with many ambiguities and discrepancies regarding protein names. Nomenclature is not only indispensable for communication, but also for literature search and entry retrieval. We feel that our experience in this field can be valuable, and that we can play a role in helping the standardization of protein nomenclature.

To take up this challenge, we created a new document which describes guidelines used by UniProtKB/Swiss-Prot annotators to give each entry the most appropriate name, called the "Recommended name" (RN). In short, an RN should follow the approved nomenclature, if it exists, and should be unique and attributed to all orthologs. Other rules deal mostly with the syntax of submitted protein names in order to have consistent and reproducible RNs in spite of the variability observed in various submissions. If our RN differs from the submitted one, the latter is kept as "alternative name". In this way we enhance the searchability, as well as the consistency, of our database.

We sincerely hope that researchers will adhere as much as possible to these guidelines for naming new proteins when publishing or submitting their data. This will make their results easily searchable, allow tracking of a given protein across related organisms and help us in our continuing effort to standardize nomenclature.


UniProtKB News

Protein naming guidelines

The document nameprot.txt lists a number of rules for naming proteins. UniProt is constantly striving to further standardize the nomenclature for a given protein across related organisms. In this context, we try to use these rules to attribute a recommended name to all the proteins of UniProtKB/Swiss-Prot. We also we hope that authors/laboratories will follow as much as possible these rules for naming new proteins.

Cross-references to RZPD-ProtExp

Cross-references have been added to the RZPD-ProtExp. RZPD Deutsches Ressourcenzentrum fuer Genomforschung is a non-profit service center for genomics and proteomics research. We introduced a new cross-reference to the RZPD "Full ORF Clones" product, which is a collection of validated ORF protein expression clones containing the complete coding sequences for genes.

The RZPD-ProtExp database is available at http://www.rzpd.de/products/orfclones/.

The format of the explicit links is:

Resource abbreviation RZPD-ProtExp
Resource identifier Clone name.
Examples
Q8NHQ1:
DR   RZPD-ProtExp; IOH13284; -.
DR   RZPD-ProtExp; IOH22331; -.
DR   RZPD-ProtExp; W0600; -.

Q9NP90:
DR   RZPD-ProtExp; IOH42108; -.
DR   RZPD-ProtExp; U1183; -.

Changes concerning keywords

UniProt release 8.0

Published May 30, 2006

Headlines

UniProtKB/Swiss-Prot major release (50.0)

Release 50.0 of 30-May-2006 of UniProtKB/Swiss-Prot contains 222'289 sequence entries, comprising 81'585'146 amino acids abstracted from 142'438 references.

15'220 sequences have been added since release 49.0, the sequence data of 953 existing entries has been updated and the annotations of 190'604 entries have been revised. This represents an increase of 8%.

Many improvements were carried out in the last 3 months:

  • In order to improve the consistency of annotation of pre- and co-translational events, we have modified the syntax of the comment line topic ALTERNATIVE PRODUCTS, and the feature key VARSPLIC was replaced by VAR_SEQ.
  • We have also replaced the CC line topic DATABASE by WEB RESOURCE to clarify the conceptual difference between the content of these lines and the DR (Database cross-Reference) lines.
  • For viral UniProtKB entries, a new line type, the OH line, was introduced to indicate the host(s) either as a specific organism or taxonomic group of organisms.
  • Cross-references to several databases have been added.

UniProt Knowledgebase release 8.0 includes Swiss-Prot release 50.0 and TrEMBL release 33.0.

Full statistics and release notes

UniProtKB News

Replacement of the feature key VARSPLIC by VAR_SEQ

Pre-translational events have so far been represented by several feature keys, e.g. alternative splicing and promoter usage were annotated with the VARSPLIC feature key, alternative initiation with the INIT_MET feature key and RNA editing with the VARIANT feature key. In order to improve the consistency of annotation of pre- and co-translational events, we have removed the feature key VARSPLIC and introduced the new feature key VAR_SEQ for the description of alternative splicing, alternative promoter usage, alternative initiation and ribosomal frameshifting. The INIT_MET feature key remains, but its usage is now restricted to the annotation of initiator methionine cleavage. We will continue to use the VARIANT feature key and the comment line topic RNA EDITING to describe RNA editing.

Syntax modification of the comment line (CC) topic ALTERNATIVE PRODUCTS

In order to improve the consistency of annotation of pre- and co-translational events, we have modified the syntax of the comment line topic ALTERNATIVE PRODUCTS. This modification allows programs to reconstruct alternative sequences according to the corresponding feature identifiers not only for alternative splicing events, as with the old syntax, but also for alternative promoter usage and alternative initiation events.

The new format of ALTERNATIVE PRODUCTS is:

 CC   -!- ALTERNATIVE PRODUCTS:
 CC       Event=Event(, Event)*; Named isoforms=Number_of_isoforms;
(CC         Comment=Free_text;)?
(CC       Name=Isoform_name;( Synonyms=Synonym(, Synonym)*;)?
 CC         IsoId=Isoform_identifier(, Isoform_identifer)*;
 CC         Sequence=(Displayed|External|Not described|Feature_identifier(, Feature_identifier)*);
(CC         Note=Free_text;)?)+

Note: Variable values are represented in italics. Perl-style multipliers indicate whether a pattern (as delimited by parentheses) is optional (?), may occur 0 or more times (*), or 1 or more times (+). Alternative values are separated by a pipe symbol (|).

The "Event" item lists one or a combination of the following values:

  • Alternative promoter usage
  • Alternative splicing
  • Alternative initiation
  • Ribosomal frameshifting

The "Note" item may specify the event(s), if there are several.

Example:

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing, Alternative initiation; Named isoforms=3;
CC         Comment=Isoform 1 and isoform 2 arise due to the use of two
CC         alternative first exons joined to a common exon 2 at the same
CC         acceptor site but in different reading frames, resulting in two
CC         completely different isoforms;
CC       Name=1; Synonyms=p16INK4a;
CC         IsoId=O77617-1; Sequence=Displayed;
CC       Name=3;
CC         IsoId=O77617-2; Sequence=VSP_004099;
CC         Note=Produced by alternative initiation at Met-35 of isoform 1;
CC       Name=2; Synonyms=p19ARF;
CC         IsoId=O77618-1; Sequence=External;
..
FT   VAR_SEQ       1     34       Missing (in isoform 3).
FT                                /FTId=VSP_004099.

Replacement of the comment line (CC) topic DATABASE by WEB RESOURCE

We have replaced the CC line topic DATABASE by WEB RESOURCE to clarify the conceptual difference between the content of these lines and the DR (Database cross-Reference) lines. At the same time we have simplified the format by suppressing the 'FTP=' field, which is no longer in use.

The format of the DATABASE topic was:

CC   -!- DATABASE: NAME=ResourceName[; NOTE=FreeText][; WWW=WWWAddress][; FTP=FTPAddress].

The format of the WEB RESOURCE topic is:

CC   -!- WEB RESOURCE: NAME=ResourceName[; NOTE=FreeText]; URL=WWWAddress.

The length of these lines may exceed 75 characters because long URL addresses are not wrapped into multiple lines.

Introduction of the new line type OH (Organism Host) for viral hosts

A virus is a living organism only if we consider it associated with its host. The viral taxonomy is arbitrarily based on the nature of viral genomes, and viruses of the same family can infect a wide range of hosts. There are numerous virus-host interactions, which we intend to annotate. We have therefore introduced to viral UniProtKB entries a new line type, the OH line, to indicate the host(s) either as a specific organism or taxonomic group of organisms.

The format of the OH line is:

OH   NCBI_TaxID=TaxID; HostName.

The HostName consists of the official name and, optionally, a common name and/or synonym. The length of an OH line may exceed 75 characters.

Example:

OS   Tomato black ring virus (strain E) (TBRV).
OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Comoviridae;
OC   Nepovirus; Subgroup B.
OX   NCBI_TaxID=12277;
OH   NCBI_TaxID=4681; Allium porrum (Leek).
OH   NCBI_TaxID=4045; Apium graveolens (Celery).
OH   NCBI_TaxID=161934; Beta vulgaris (Sugar beet).
OH   NCBI_TaxID=38871; Fraxinus (ash trees).
OH   NCBI_TaxID=4236; Lactuca sativa (Garden lettuce).
OH   NCBI_TaxID=4081; Lycopersicon esculentum (Tomato).
OH   NCBI_TaxID=39639; Narcissus pseudonarcissus (Daffodil).
OH   NCBI_TaxID=3885; Phaseolus vulgaris (Kidney bean) (French bean).
OH   NCBI_TaxID=35938; Robinia pseudoacacia (Black locust).
OH   NCBI_TaxID=23216; Rubus (bramble).
OH   NCBI_TaxID=4113; Solanum tuberosum (Potato).
OH   NCBI_TaxID=13305; Tulipa.
OH   NCBI_TaxID=3603; Vitis.

UniProt release 7.7

Published May 16, 2006

Headlines

Tox-Prot

In order to provide the scientific community with a summary of the current knowledge on animal protein toxins, the Swiss-Prot group initiated the Tox-Prot annotation project. The aim of this program is the annotation of all toxin proteins produced by venomous animals, such as snakes, scorpions, spiders, jellyfish, insects, cone snails, sea anemones, lizards, some fish, and platypus.

Toxins are small (usually less than 100 amino acids) and extremely stable. They undergo numerous post-translational modifications and they have very specific targets. The 3D-structure of about 15% of the known toxins has been unravelled, which provides clues to the understanding of their specificity. Many toxins, such as those synthetized by cone snails (conotoxins), can be used as drugs and some are presently tested in clinical trials.

At the level of annotation and ultimately for the sequence retrieval by our users, the lack of a systematic nomenclature represents a real problem. This issue is currently being addressed by the scientific community and an official nomenclature has been developed for potassium channel scorpion toxins: Tytgat et al. (1999) and Rodriguez de la Vega and Possani (2004). With the help of Dr. Ricardo C. Rodriguez de la Vega and Prof. Lourival D. Possani, we have created a document which provides links between the official nomenclature and the associated UniProtKB/Swiss-Prot entries.

Finally, we would like to draw your attention on the fact that many toxin sequences are not submitted to any databases and have to be manually retrieved by Swiss-Prot annotators. This step slows down the annotation process itself and may be error-prone. We thus would like to encourage researchers to share their data by submitting it to public databases such as EMBL/GenBank/DDBJ for nucleic acid sequences, and UniProtKB/Swiss-Prot for protein sequences. Experimental data can also be directly submitted by e-mail. For any comments or suggestions concerning the Tox-Prot annotation project, don't hesitate to contact us.


UniProtKB News

Nomenclature of scorpion potassium channel toxins

The document scorpktx.txt lists the potassium-channel-specific scorpion toxins known to date, according to the nomenclature system first described in 1999 by Tytgat et al. and extended in de la Vega et al.. This document contains, for each individual UniProtKB/Swiss-Prot scorpion potassium channel toxin, the UniProtKB/Swiss-Prot accession number, the UniProtKB/Swiss-Prot entry name, the systematic name, together with other scorpion potassium channel toxin names.

Changes concerning keywords

UniProt release 7.6

Published May 2, 2006

Headlines

5,000 rat entries in UniProtKB/Swiss-Prot

While in imperial China the rat was associated with creativity, honesty and generosity, western culture tends to see them as vicious, unclean, parasitic animals that steal food and spread disease. Whatever your feelings towards this small rodent are, for modern biologists rats have proved to be a good animal model for many human diseases, such as diabetes, arthritis and cardiovascular diseases. Despite its importance for medical research and although its genome sequence has been published in 2004, the amount of sequence data available is still much smaller than for the other two best studied mammals, namely human and mouse. The number of rat ESTs at the NCBI is only 11% of that of human and 18% of that of mouse. Not many high-throughput cDNA sequencing projects have been initiated and, in the NIH Mammalian Gene Collection, rat sequences represent less than one quarter of human ones. As a result, this trend is also observed in UniProtKB/Swiss-Prot, which is highly dependent upon submissions of sequence data to the public DNA sequence databases EMBL/GenBank/DDBJ. In UniProtKB/Swiss-Prot, rat is the third best represented mammal, after human and mouse. With more than 5,000 entries, it is still underrepresented compared to human and mouse. However, new rat entries are continuously integrated in order to represent all mammalian orthologs of human proteins. Our final aim is to provide our users with a complete set of rat proteins.


UniProtKB News

Sequences with over 10,000 amino acids in UniProtKB/Swiss-Prot

The first sequence with over 10,000 amino acids has entered the Swiss-Prot section of the UniProt Knowledgebase.

Changes concerning keywords

UniProt release 7.5

Published April 18, 2006

Headlines

The ComX pheromone

ComX, a pheromone involved in a major quorum sensing system in bacilli, is post-translationally modified by strain-specific prenylation.

In order to acquire genetic competence once the cell density has reached a critical threshold, bacteria have developed a sophisticated quorum-sensing system. This system proceeds through the release of a pheromone, comX, that activates a two-component system, which eventually propagates the signal into the cell. Both components of this system have been identified: they are the sensor histidine kinase, comP, and the response regulator, the transcription factor comA.

The crucial pheromone ComX is produced as an inactive precursor which is activated by 2 post-translational modifications (PTM): the prenylation of a conserved tryptophan residue and a proteolytic cleavage. The protein comQ is thought to catalyze the maturation of comX. ComX and comQ sequences show striking variability among different strains, as do the prenyl derivatives. The mass of the prenyl groups linked to comX has been determined by mass spectrometry. Surprisingly, three different masses were observed: 120Da, 136Da and 205Da, depending on the strain studied. The 136 and 205Da forms are thought to consist of farnesyl and geranyl groups, respectively. The structure of the 205Da prenyl group in strain W23/RO-E-2 was recently obtained (Q8VL79). It consists of a geranyl group bound to a cyclic tryptophan. Interestingly enough, the nature of the prenyl group was shown to depend on the comX sequence itself rather than on the origin of the modifying enzyme comQ.

To our knowledge, this is the first report of a post-translational prenylation catalyzed by a bacterial enzyme, despite the universal availability of the necessary isoprenoid substrates and the existence of various other lipid-modified proteins in this kingdom. The exact function of comX prenylation is not known, but, by analogy with the situation in eukaryotes, it may provide anchoring to membrane structures.

Of note, this PTM, like all other PTMs in UniProtKB/Swiss-Prot, is annotated using controlled vocabulary.


UniProtKB News

Changes in the tisslist.txt file

The tisslist.txt file lists the tissues that are used in the "TISSUE" topic of the references of UniProtKB/Swiss-Prot entries. It has been changed in the following way:

Each term in the list has an accession number and optionally further relevant information such as synonyms and mappings to eVOC terms. eVOC contains four ontologies - Anatomical system, Cell type, Developmental stage, Pathology - which provide appropriate sets of detailed terms that describe the sample source of human experimental material such as cDNA and SAGE libraries.

The file contains the following line types:

---------  ---------------------------     ----------------------
Line code  Content                         Occurrence in an entry
---------  ---------------------------     ----------------------
ID         Identifier (tissue)             Once; starts an entry
AC         Accession (TS-xxxx)             Once
SY         Synonyms                        Optional; Once or more
DR         eVOC ontologies (eVOC) mapping  Optional; Once or more
//         Terminator                      Once; ends an entry

Examples:

ID   Embryonic lung fibroblast.
AC   TS-0254
DR   eVoc; 0100042; anatomical-system: lung.
DR   eVoc; 0200032; cell-type: fibroblast.
DR   eVoc; 0300001; development-stage: embryo.
//
ID   Mammary tumor.
AC   TS-0597
SY   Breast tumor; Mammary gland tumor; Mammary tumour.
DR   eVoc; 0100124; anatomical-system: breast.
DR   eVoc; 0400051; pathology: tumour.
//

Changes concerning keywords

UniProt release 7.4

Published April 4, 2006

Headlines

Complete proteome for Drosophila melanogaster

The keyword complete proteome has been added to all Drosophila melanogaster entries in to UniProtKB. This is the second metazoa to have the keyword added; the other one being Caenorhabditis elegans. Eleven other eukaryotes have the keyword added; ten complete fungal genomes and Plasmodium yoelii yoelii. The presence of this keyword allows easy retrieval of a complete non-redundant set of proteins from the Drosophila melanogaster genome (nuclear and mitochondrial) across the Swiss-Prot and TrEMBL sections of the UniProt Knowledgebase. To add the keyword, all fruit fly UniProtKB/Swiss-Prot entries have been updated for addition of the genome project reference (Adams et al, 2002, Science 287:2185-2195), along with other relevant updates as appropriate.

UniProtKB Release 7.4 has 2361 Drosophila melanogaster entries in UniProtKB/Swiss-Prot and 25453 entries in UniProtKB/TrEMBL. Addition of the keyword 'Complete proteome' will allow the retrieval of the complete nonredundant proteome consisting of 16229 entries, 2329 from UniProtKB/Swiss-Prot and 13900 from UniProtKB/TrEMBL. The proteome can also be downloaded from our FTP server or from Integr8.


UniProtKB News

Cross-references to UniGene

Cross-references have been added to UniGene, a sequence database which provides the automatic partition of GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.

Examples:

Q9ZNT7:
DR   UniGene; At.24021.
DR   UniGene; At.64486.

P59990:
DR   UniGene; Hs.505267.

UniProt release 7.3

Published March 21, 2006

Headlines

Chikungunya virus annotation

The Chikungunya virus has made a severe outbreak in French island of Reunion and also in Mauritius, Seychelles, Mayotte and Madagascar, all located off the southeast coast of Africa. The virus is not deadly, but causes severe fever, rash, arthritis and joint pain. These symptoms are at the origin of the name Chikungunya, which means in Swahili "that which bends up".

The virus belongs to the large family of Togaviridae, genus Alphavirus. This family includes exotic viruses like O'nyong nyong and Igba Oro, which are very closely related to Chikungunya and induce actually the same disease in humans. Interestingly, the name O'nyong nyong comes from the Nilotic language of Uganda and Sudan and means "weakening of the joints".

Chikungunya is transmitted by mosquitos, in which it infects salivary glands, but the natural host reservoir is constituted of different types of monkeys.

The molecular strategy used by Alphaviruses to replicate and hijack cellular defense is very surprising:

After virus entry into the target cell, the mRNA(+) genome is translated into a nonstructural polyprotein, which starts discretely to replicate the genome in the cytoplasm. After this early phase where the virus avoids cellular defense by restraining its activity, the nonstructural polyprotein is processed into four proteins. There goes the virus at full strength to replicate large amount of his genome, and innocently transcribes also a subgenomic 26S RNA. This replication has a drawback: it creates dsRNA by genome and antigenome hybridization.

No eukaryotic cell can accept such an offence: dsRNA is a signature of viral infection. The host cell reacts violently: human PKR is strongly activated by the dsRNA, resulting in a complete shutoff of cellular translation through inactivation of early initiation of translation factor EIF2A.

But this powerful cellular defense was expected by the virus. The 26S mRNA possesses a unique feature in biology: an enhancer element which allows the mRNA to be translated independently of EIF2A. This 26S RNA codes for the structural proteins, which are now the only proteins synthesized by the cell! These structural proteins form new virions which bud from the doomed cell to find new targets.

The following are examples of new Alphavirus entriesin release 7.3:


UniProtKB News

Changes concerning keywords

UniProt release 7.2

Published March 7, 2006

Headlines

The most frequently updated entry

The most obvious way to quantify the work done by UniProtKB/Swiss-Prot is to count the increase in new entries. However, the integration of new entries is only part of the annotation work. Providing the scientific community with high quality data also - and maybe mostly - involves time-consuming updates of older entries. Thanks to the introduction of sequence and annotation version numbers and to the creation of the UniProtKB Sequence/Annotation Version Database, it is now possible to know when and how a UniProtKB/Swiss-Prot entry has been updated.

Sequences shown at the bottom of each entry are relatively stable. In 83% of entries, the sequence has not been updated since its integration in the knowledgebase, in 15% of the entries, it has been updated once. So far, the maximal number of sequence updates is 6 times and this is observed only in 8 entries. By contrast, the annotation has to be constantly reviewed.

Currently the average UniProtKB/Swiss-Prot entry has been updated more than 50 times. The most frequently updated entry is human coagulation factor IX which has been reviewed 103 times while its sequence has been updated only once. The human coagulation factor IX entry was created in the first UniProtKB/Swiss-Prot release in July 1986. Since then, over 50 references have been added, mostly dealing with polymorphisms and disease-causing mutations. This is also reflected at the level of the feature table, where the number of described variants had risen from 1 to 145, most of them associated with hemophilia. While the presence of gamma-carboxyglutamate was already well-established 20 years ago and the sites of N-glycosylation suspected, other post-translational modifications, such as sites of O-glycosylation, phosphorylation and sulfation were described, and thus annotated, later. Information on secondary structure was added to the entry in 1994, as well as the first link to the 3D structure submitted to PDB. Today 7 cross-references to PDB are provided.

Science is going forward and we, at Swiss-Prot, are doing our best to keep pace. Nevertheless, we need the user community to help us in this task. All our entries are equipped with a "Submit update" button and we greatly encourage you to use it every time your favourite protein is not up-to-date in UniProtKB/Swiss-Prot, or if it is not yet integrated.


UniProtKB News

Cross-references to GenomeReviews

Cross-references have been added to GenomeReviews, a genome annotation database which provides up-to-date, standardised and comprehensively annotated view of the genomic sequence of organisms with completely deciphered genomes.

The format of the explicit links in the flat file is:

Resource abbreviation GenomeReviews
Resource identifier GenomeReviews accession number.
Optional information 1 Ordered locus name, or, if it does not exist, gene name.
Examples
P08409:
DR GenomeReviews; U00096_GR; b0016.
DR GenomeReviews; U00096_GR; b0582.
DR GenomeReviews; U00096_GR; b2394.

Q92YD2:
DR GenomeReviews; AE006469_GR; betB2.

Changes concerning keywords

UniProt release 7.1

Published February 21, 2006

Headlines

Over 25'000 protein polymorphisms annotated in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot pays a particular attention to the annotation of protein polymorphisms as well as disease mutations, due to their importance for the understanding of genetic diseases. Information on genetic variations and diseases is annotated in the comment section under the topic CC DISEASE and in the feature table using the key FT VARIANT. Literature reports used for data extraction are also cited in the entry, and data are manually checked prior to integration into the knowledgebase. Links to OMIM are provided whenever possible. As UniProtKB/Swiss-Prot is a 'proteocentric' resource, we do not annotate frameshifts or nonsense mutations as their deleterious effect on the protein is usually obvious. We therefore concentrate on tracking and storing data on amino acid substitutions, small deletions or insertions.

We have currently reached a total of 25'255 variants in 4'196 human sequences: 98% of the variants are single amino acid polymorphisms (SAP). Association of a variant with a disease is annotated according to literature reports. In the current release, 13'581 SAPs are disease-associated, 9'451 are neutral polymorphisms and 1'816 are unclassified.


UniProtKB News

The UniProtKB Sequence/Annotation Version Database (UniSave)

The introduction of more exact sequence and entry modification dates allowed us to introduce a new service: the UniProtKB Sequence/Annotation Version Database (UniSave) is a comprehensive archive of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL entry versions. Unlike the UniProt Knowledgebase, which contains only the latest Swiss-Prot and TrEMBL entry and sequence versions, the UniProtKB Sequence/Annotation Version Database provides access to all versions of these entries. This allows to track sequence changes, to find out when a given annotation appeared in an entry and how it evolved.

All archived entry versions are available through the UniProtKB Sequence/Annotation Version Database in flat file and fasta format. Any two given entry versions can be compared to each other.

Changes concerning keywords

UniProt release 7.0

Published February 7, 2006

Full statistics and release notes

Headlines

UniProtKB/Swiss-Prot major release (49.0)

Release 49.0 of 07-Feb-2006 of UniProtKB/Swiss-Prot contains 207'132 sequence entries, comprising 75'438'310 amino acids abstracted from 139'151 references. 12'815 sequences have been added since release 48, the sequence data of 991 existing entries has been updated and the annotations of all entries have been revised. This represents an increase of 7%.

Many improvements were carried out in the last 5 months. In particular, we have changed from showing only the dates corresponding to full UniProtKB releases in the DT lines to displaying the date of the biweekly release at which an entry is integrated or updated. We dropped the information concerning the release number and introduced entry and sequence version numbers in the DT lines.

Cross-references to several databases have been added, and we have changed our copyright statement. (Please read below for further details)

Full statistics and release notes

UniProtKB News

Dates and Versions

We changed from showing only the dates corresponding to full UniProtKB releases in the DT lines to displaying the date of the biweekly release at which an entry is integrated or updated. We dropped the information concerning the release number and introduced entry and sequence version numbers in the DT lines.

The new format of the three DT lines is:

DT   DD-MMM-YYYY, integrated into UniProtKB/database_name.
DT   DD-MMM-YYYY, sequence version version_number.
DT   DD-MMM-YYYY, entry version version_number.

Example for UniProtKB/Swiss-Prot:

DT   01-JAN-1998, integrated into UniProtKB/Swiss-Prot.
DT   15-OCT-2001, sequence version 3.
DT   01-APR-2004, entry version 14.

Example for UniProtKB/TrEMBL:

DT   01-FEB-1999, integrated into UniProtKB/TrEMBL.
DT   15-OCT-2000, sequence version 2.
DT   15-DEC-2004, entry version 5.

The sequence version number of an entry is incremented by one when its amino acid sequence is modified. The entry version number is incremented by one whenever any data in the flat file representation of the entry is modified.

We retrofitted the entry and sequence version numbers, as well as all dates, using archived UniProtKB releases.

Addition of a feature (FT) key CHAIN over the whole sequence length

The feature key CHAIN was previously only used to describe processed protein sequences. Now we added, in the UniProtKB/Swiss-Prot database, a "FT CHAIN" to all the entries having neither a "FT CHAIN" nor a "FT PEPTIDE". This led to the addition of a "FT CHAIN", covering the full length of the sequence, to more than 170 000 entries. In this way, in UniProtKB/Swiss-Prot, all the mature proteins will be described in the feature lines.

Release of a new document about post-translational modifications

The controlled vocabulary for the post-translational modifications (PTMs) that are annotated in the UniProtKB feature table has moved from the UniProtKB User Manual to a separate document, ptmlist.txt.

The document contains, for each individual modification, the controlled vocabulary term and its associated feature key, and additional information, such as the amino acid that can be modified and the mass difference, the position on the protein sequence, the taxonomic distribution, the protein location, associated keyword(s), as well as links to the UniProtKB entries that contain the annotation, and to the corresponding entry in the RESID Database of Protein Modifications.

Copyright

We have changed the copyright statement in all UniProt Knowledgebase entries. All UniProtKB/Swiss-Prot and UniProtKB/TrEMBL entries, as well as all documents now contain the statement
CC   -----------------------------------------------------------------------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs License
CC   -----------------------------------------------------------------------

Cross-References to PptaseDB

Cross-references have been added to the Prokaryotic Protein Phosphatase Database, a database which provides information concerning prokaryotic and archaeal phosphatases for which experimental evidence exists demonstrating phosphatase activity. The Prokaryotic Protein Phosphatase Database is available at http://vigen.biochem.vt.edu/p3d/p3d.htm.

The format of the explicit links in the flat file is:

Resource abbreviation PptaseDB
Resource identifier PptaseDB unique phosphatase identifier.
Example
O52787:
DR   PptaseDB; P3D040495; -.

Cross-references to BioCyc

Cross-references have been added to BioCyc, a collection of Pathway/Genome Databases. Each Pathway/Genome Database describes the genome and metabolic pathways of a single organism, with the exception of the MetaCyc database, which is a reference source on metabolic pathways from many organisms. BioCyc is available at http://www.biocyc.org/.

Implicit links to the EcoCyc database have already been provided before in the NiceProt view of relevant Swiss-Prot entries on ExPASy.

The format of the explicit links in the flat file is:

Resource abbreviation BioCyc
Resource identifier BioCyc database code and identifier, separated by a colon.
Examples
P21170:
DR   BioCyc; EcoCyc:ARGDECARBOXBIO-MONOMER; -.

Q9HCC0:
DR   BioCyc; MetaCyc:MONOMER-10082; -.

UniProt release 6.9

Published January 24, 2006

Headlines

Mammalian, Xenopus and Zebrafish Gene Collections: a goldmine for high-quality sequences

High-quality nucleotide sequences derived from high-throughput sequencing projects, such as those generated by the NIH Gene Collection (GC) initiatives are extremely valuable for a protein sequence database, like UniProtKB/Swiss-Prot. More than 99.98% of the UniProtKB/Swiss-Prot sequences are generated by translation of nucleotide sequences rather than direct protein sequences. In this context, high-quality nucleotide sequences provide a rapid and easy way to control the accuracy of the sequences. Differences between sequences may point at the existence of polymorphisms and many alternative splicing isoforms have been introduced thanks to these projects.

Launched in 1999, the Mammalian Gene collection (MGC) is a NIH multi-institutional initiative. Its goal is to identify and sequence cDNA clones containing a full-length open reading frame. Initially aimed at human and mouse sequences, it was further expanded to rat and bovine clones. Two additional projects enriched the first initiative, these deal with Xenopus (XGC) and Zebrafish (ZGC). The sequences obtained by these projects are submitted to the EMBL/GenBank/DDBJ databases, the submitted CDS are translated and automatically integrated into UniProtKB/TrEMBL. The UniProtKB/TrEMBL entries can then be manually annotated and integrated into UniProtKB/Swiss-Prot. Following the principle of non-redundancy, sequences derived from the same gene in the same species are merged into one UniProtKB/Swiss-Prot entry. This is reflected at the level of cross-references. For instance, currently, the average number of distinct nucleotide sequence cross-references per human entry is close to 5. This implies that each human sequence has been confirmed, on average, by 5 independent submitted sequences, and thus the accuracy of the sequences shown in UniProtKB/Swiss-Prot entries is quite high.

Currently, close to 16'000 UniProtKB/Swiss-Prot entries contain data from GC submissions. Considering the various species involved, it means that MGC data are found in more than 60% of the human entries, more than 50% of mouse entries, 25% of rat entries, but only 2% of bovine entries. ZGC data can be found in close to 55% of zebrafish entries and XGC in close to 20% of Xenopus laevis entries and 85% of Xenopus tropicalis entries.


UniProtKB News

Cross-references to MIM

Various MIM cross-references can be present in a single UniProtKB/Swiss-Prot human entry. They were annotated in the DR lines according to a format that does not distinguish between MIM entries describing a gene and MIM entries describing a phenotype:

DR   MIM; 608463; -.

We added a field to the DR MIM line to allow users and programs to distinguish between MIM "gene" and "phenotype" entries.

The new format of the DR MIM line is:

DR   MIM; MIM_identifier; token.

Where token is one of the following values:

gene
MIM entries which describe a gene
phenotype
MIM entries which describe a phenotype
gene+phenotype
MIM entries which describe both a gene and a phenotype

Examples:

DR   MIM; 608463; gene.
DR   MIM; 603813; phenotype.
DR   MIM; 124080; gene+phenotype.

Changes concerning keywords

Deleted keywords:

  • Feather
  • Plasma

UniProt release 6.8

Published January 10, 2006

Headlines

Major update of influenza A viruses: H5N1 pathogenicity

Influenza A viruses are named depending on their surface protein subtype, H for hemagglutinin and N for neuraminidase. There are 16 known H subtypes and 9 known N subtypes for influenza A virus, all of them infect birds, a few such as H1N1, H1N2 and H3N2 can infect human.

'Avian influenza' is used to name viruses commonly restricted to birds, such as H5,H7,H9,... subtypes. Most avian influenza subtypes cause very mild diseases, but the H5 and H7 subtypes can cause outbreaks involving massive deaths in domestic poultry. During these outbreaks, sporadic transmission to human has been reported. Fortunately humans are dead end hosts for these viruses, i.e. infected humans do not transmit the virus. Although few human cases of H7N7 and H9N2 have been documented, the major threat remains the H5N1 subtype.

H5N1 is not a new virus, it was isolated on birds in Scotland back in 1959 (hemagglutinin: P09345). It became famous after the first big outbreak in 1997 in Hong Kong, where 1.5 millions of poultry were affected and destroyed, and 18 human cases occurred, six of whom died (hemagglutinin: O56140). This was the first time an avian influenza A virus transmission directly from birds to humans had been found.

In 2003 two cases of H5N1 occurred in Hong Kong, one fatal. How or where these two family members were infected was not determined. In 2004 and 2005, severe outbreaks happened in Thailand, Vietnam, Cambodia and Indonesia, for a total of 130 human cases, 70 of whom died. Most of these cases occurred as a result of people having direct or close contact with infected poultry, however a few cases of human-to-human spread of H5N1 have occurred.

Why is H5N1 so deadly in poultry and humans? Presumably because of small sequence variations in hemagglutinin.

Hemaglutinin is present at the virion surface, and its function both to bind cellular receptor and induce fusion of viral and target cell membrane. In order to be able to promote fusion, the protein must be cleaved. In common influenza A viruses, the cleavage site is specific to proteases present in the respiratory tract. Hence influenza is restricted to infect this organ.

H5 and H7 have a completely different cleavage site, rich in arginine and lysine residues (RRRKKR in Hong Kong 1997: O56140), which can be processed by ubiquitous proteases: furins. This result in an infection of almost all host organs, and an acute pathology which can be quickly fatal.

Few antiviral drugs are effective against influenza. Zanamivir (Relenza) and oseltamivir (Tamiflu) are inhibitors of the neuraminidase (e.g. Q9W7Y7), amantadine and rimantadine are inhibitors of ion channel M2 protein (e.g. O70632). Unfortunately drug resistance evolves rapidly, and already a case of H5N1 resistant to Tamiflu has been reported (Nature 437:1108-1108(2005)).

The following are examples of updated influenza entries:

H5N1 isolated from human, in Hong Kong 1997:

Hemagglutinin: HEMA_IAHO3 (O56140) Neuraminidase: NRAM_IAHO3 (Q9W7Y7) M1: M1_IAHO3 (Q77Y95) M2: M2_IAHO3 (O70632) NS1 : NS1_IAHO3 (O56264) NEP: NEP_IAHO3 (O56263) PB1-F2 : PB1F2_IAHO3 (P0C0U0) Nucleoprotein: NCAP_IAHO3 (O92784) PA : PA_IAHO3 (O89752) PB1 : RDRP_IAHO3 (Q9WLS3) PB2: PB2_IAHO3 (O56266)

H5N1 isolated in 1959 on chicken:

Hemagglutinin: HEMA_IACKS (P09345)



UniProtKB News

Format change in the dbxref.txt document file

The dbxref.txt file lists the names and abbreviations and URLs of all databases cross-referenced in the UniProt Knowledgebase. We have added a new mandatory field, "Cat". This field contains the database category, and will allow us to display cross-references in our entry view in a more user-friendly and explicit manner.

Currently used categories are:

  • 2D gel databases
  • 3D structure databases
  • Enzyme and pathway databases
  • Family and domain databases
  • Gene expression databases
  • Ontologies
  • Organism-specific gene databases
  • Other
  • PTM databases
  • Polymorphism databases
  • Protein family/group databases
  • Protein-protein interaction databases
  • Sequence databases

Example:

Abbrev: EcoGene
Name  : Escherichia coli strain K12 genome database
LinkTp: Explicit
Server: http://www.ecogene.org/
Db_URL: www.ecogene.org/geneInfo.php?eg_id=%s
Cat   : Organism-specific gene databases

Changes concerning keywords

Modified keywords:

Deleted keywords:

  • Chorion
  • Multigene family
  • Myelin
  • Seminal vesicle
  • Sperm
  • T-cell
  • Testis

UniProt release 6.7

Published December 20, 2005

Headlines

'De-merge' of multi-species in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot as a non-redundant protein database used to "merge" entries originating from different species, if there were 100% conserved. In merged entries, information about the source of each organism was noted in the OS (Organism Species) lines, e.g. actin, P03996 (ACTA_HUMAN):

                        OS   Homo sapiens (Human), Mus musculus (Mouse), Rattus norvegicus (Rat),
                        OS   Bos taurus (Bovine), and Oryctolagus cuniculus (Rabbit).
                        
                     

However, the OC (Organism Classification) lines only contained the taxonomy of the first listed species, and the "species part" of the entry name was built on the first organism in the list ("_HUMAN").

As the type of information on proteins has greatly evolved, and more and more data have been documented that are species specific, Swiss-Prot had to adapt and change its merging policy. While it may seem to contradict the principle of non-redundancy on the sequence level to create two or more entries for an identical sequence, this does make sense from the annotation point of view. The new policy allows to clarify which information item has been proven for which organism. Even if a protein has the same sequence in two or more different organisms, there may be evidence for different post-translational modifications, sequence variants, alternative splicing, protein-protein interactions, tissue specificity, and implication in diseases. Moreover, since some organism-specific scientific communities use different gene name nomenclatures, it is important to reflect such species-specific nomenclature usage.

With this release, we have completed the de-merging of all the UniProtKB/Swiss-Prot entries (almost 6'000) that contained information relative to two or more distinct species.

The primary accession number of a formerly merged entry has been retained as a secondary accession number in all of the resulting de-merged entries. A new primary accession number has been attributed to all de-merged entries.

In the example above: ACTA_HUMAN (old primary AC: P03996, old secondary AC: P04108) has been de-merged into:

entry name new primary AC secondary ACs
ACTA_BOVIN P62739 P03996 P04108 Q862W5
ACTA_HUMAN P62736 P03996 P04108
ACTA_MOUSE P62737 P03996 P04108
ACTA_RABIT P62740 P03996 P04108
ACTA_RAT P62738 P03996 P04108 P70476

UniProtKB News

Changes concerning keywords

Modified keywords:

Deleted keywords:

  • Seed
  • Seed embryo

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key 'CROSSLNK':

  • Cyclopeptide (Arg-Cys)
  • Cyclopeptide (Cys-Arg)

New terms for the feature key 'MOD_RES':

  • ADP-ribosylasparagine
  • ADP-ribosylserine
  • Cysteine methyl disulfide

UniProt release 6.6

Published December 6, 2005

Headlines

200'000 entries in UniProtKB/Swiss-Prot

The Swiss-Prot group is happy to announce that a total number of 200'000 manually annotated entries has been reached in UniProtKB/Swiss-Prot. It took 15 years and 2 months to reach the first 100'000 entries in September 2001, but only 4 years and 2 months to reach 200'000 entries. The first (P99999) and the 200'000th (Q52V10) entries deal with human and common squirrel monkey cytochrome c, respectively. Both were created by Amos Bairoch, who founded the knowledgebase, the first in July 1986 and the last in November 2005.

We would like to acknowledge our users, who provide continous support by suggesting entry updates, by sharing their expertise in order to increase annotation quality or simply by using UniProtKB/Swiss-Prot.

Many thanks also to all annotators, programmers, system administrators, administrative support persons, members of the UniProt Consortium, who have contributed to this major achievement.

UniProt release 6.5

Published November 22, 2005

Headlines

Keyword hierarchies and categories

We have changed the structure of the UniProtKB keyword list, and would like to take this opportunity to describe some concepts behind the use of the keywords in UniProtKB/Swiss-Prot.

UniProtKB/Swiss-Prot entries are tagged with keywords. Keywords help summarize the contents of individual entries, simplify retrieval of sets of entries, and allow entries to be grouped easily according to different aspects such as biological processes, molecular function, subcellular location, domains, ligands, sequence modifications and diseases.


The keywords are described in the keywlist.txt file using the following format:


---------  ---------------------------     ----------------------
Line code  Content                         Occurrence in an entry
---------  ---------------------------     ----------------------
ID         Identifier (keyword)            Once; starts an entry
AC         Accession (KW-xxxx)             Once
DE         Definition                      Once or more
SY         Synonyms                        Optional; Once or more
GO         Gene ontology (GO) mapping      Optional; Once or more
HI         Hierarchy                       Optional; Once or more
CA         Category                        Once
//         Terminator                      Once; ends an entry

Example of a complete keyword description:

ID   Calcium channel.
AC   KW-0107
DE   Cell membrane glycoprotein forming a channel in a biological membrane
DE   selectively permeable to calcium ions. Calcium is essential for a
DE   variety of bodily functions, such as neurotransmission, muscle
DE   contraction and proper heart function.
GO   GO:0005262; calcium channel activity
HI   Molecular function: Ionic channel; Calcium channel.
HI   Biological process: Transport; Ion transport; Calcium transport; Calcium channel.
HI   Ligand: Calcium; Calcium channel.
CA   Molecular function.
//

Some keywords are by definition supersets or subsets of others. Such hierarchical relationships are stated in HI lines:

HI   Category: Keyword(1); ...; Keyword(n); Described keyword.

From the previous example we can infer that a UniProtKB/Swiss-Prot entry that is tagged with the keyword "Calcium channel" will at least have the following additional keywords appear in the KW line:

KW   Calcium; Calcium transport; Ion transport; Ionic channel; Transport.

This formalization of the relationships between keywords enables our curators (assisted by automated procedures) to ensure coherence, and to increase the coverage of UniProtKB/Swiss-Prot entries which keywords describing both specific and more general concepts. This in turn facilitates the retrieval of complete and coherent entry sets by keyword. The current UniProtKB/Swiss-Prot release contains close to one million keywords in almost 200'000 entries.

A "Category" is a top-level keyword that never appears directly in UniProtKB/Swiss-Prot entries. Categories are described along with the other keywords, but are introduced by an IC rather than an ID line using the following format:

---------  ---------------------------     ----------------------
Line code  Content                         Occurrence in an entry
---------  ---------------------------     ----------------------
IC         Identifier (category)           Once; starts a category entry
AC         Accession (KW-xxxx)             Once
DE         Definition                      Once or more

Example of a category description:

IC   PTM.
AC   KW-9991
DE   Keywords assigned to proteins because their sequences can differ from
DE   the mere translation of their corresponding genes, due to some post-
DE   translational modification.

Changes concerning keywords

New keywords:

UniProt release 6.4

Published November 8, 2005

Headlines

A city-sized crowd of authors

UniProtKB/Swiss-Prot is a manually annotated protein knowledgebase. This involves not only sequence curation, but also a critical review of the scientific literature. All references used to create an entry are always cited, whether they are complex publications or simple submissions. Currently, there are more than 1'600 journals referenced in UniProtKB/Swiss-Prot, and more than 210'000 distinct authors.

Bringing all these authors together for a meeting would result in a gathering of a size similar to that of Geneva, the city where UniProtKB/Swiss-Prot was created and is still based... A small, Swiss city with an international vocation, a little like UniProtKB/Swiss-Prot, which began as a local project of a graduate student with quite an ambitious aim: providing the scientific community throughout the world with a central hub for sharing biological knowledge. To achieve this goal, the Swiss-Prot group very quickly developped a strong and fruitful collaboration with the European Institute of Bioinformatics (EBI) and more recently with the Protein Information Resource (PIR) of the Georgetown University Medical Center in USA (http://www.uniprot.org/). The Swiss-Prot group is not only international but also interdisciplinary. Various educational backgrounds are mixed: biologists, biochemists, programmers, mathematicians, wet lab experts or students, etc. This team work is what ensures the quality of the knowledgebase.


UniProtKB News

Cross-references to LinkHub

Cross-references have been added to LinkHub, a database providing links to different genomics and protein resources. LinkHub is available at http://hub.gersteinlab.org/.

The format of the explicit links in the flat file is:

Resource abbreviation LinkHub
Resource identifier UniProtKB primary AC.
Example
O00623:
DR   LinkHub; O00623; -.

UniProt release 6.3

Published October 25, 2005

Headlines

Over 100'000 prokaryotic entries in UniProtKB/Swiss-Prot

We have reached >100'000 prokaryotic (bacterial and archaeal) entries in UniProtKB/Swiss-Prot. Of these, just over 10'000 are archaeal entries. To deal with the enormous increase in the amount of available prokaryotic protein sequences, the Swiss-Prot group started the HAMAP project, which aims to automatically annotate, with a high-throughput but with no decrease in quality, proteins from complete microbial genomes that belong to a family (well-defined or uncharacterized).

The HAMAP annotation system is based on manually curated family rules, which contain the information, derived from searches of the available literature, that can be safely propagated to all members of the family. Profiles are generated from an alignment of seed members; these profiles, in turn, are used to scan all available sequences in Swiss-Prot and TrEMBL to identify family members. The goal of the system is not to search for distant sequence similarities, but to annotate only the proteins that can be conservatively assigned to a HAMAP family. Cases and conditions are included in most rules so that warnings are generated if some conserved features, such as active sites or metal-binding amino acids, are not present in a given protein sequence, or if there are other problems, such as size or taxonomic range. All entries that contain warnings are subjected to manual verification. In fact, since the implementation of the system and for the time being, ALL the proteins that have been annotated using the HAMAP family rules have been manually verified to check the reliability of the HAMAP annotation module.

More than 1'200 family rules are available, and almost 70'000 prokaryotic entries belong to one (or more) HAMAP families. From the HAMAP website it is possible to scan protein sequences for matches against HAMAP families, and it is also possible to submit a whole proteome, by confidential ftp, to be scanned against the collection of HAMAP family rules.


UniProtKB News

Changes concerning keywords

New keywords:

Deleted keywords:

  • Dehydrin

UniProt release 6.2

Published October 11, 2005

Headlines

Albumin: the most popular entry in UniProtKB/Swiss-Prot

With 1'997 clicks in September 2005 from 702 different sites, albumin can be considered as the most popular protein in UniProtKB/Swiss-Prot. Albumin is also the most abundant plasma protein with 35-50 g/l (75% of protein molecules in plasma), a concentration superior by a factor of 1010 to that of cytokines, such as interleukin-6. This broad range of protein concentration in plasma makes identification of low abundance proteins quite a difficult task, as finding an individual human being by searching through the population of the entire world. Albumin is a multifunctional protein with ligand-binding and transport properties, antioxidant functions and enzymatic activities. Physiologically, it is responsible for maintaining colloid osmotic pressure and may influence microvascular integrity and aspects of the inflammatory pathway, including neutrophil adhesion and the activity of cell signaling moieties (for a review, see PubMed: 15915465).

In our "Hit Parade", albumin is followed by p53 (P04637) (1'862 clicks from 430 different sites), the EGF receptor (P00533) (1'414 clicks from 316 different sites) and insulin (P01308) (1'162 clicks from 361 different sites).

We are regularly analyzing our server access logs in order to determine annotation priorities. In particular, the most frequently requested entries from UniProtKB/TrEMBL are queued for manual annotation and "promotion" into UniProtKB/Swiss-Prot.


UniProtKB News

Definition of a further molecule type in the DR EMBL line

In the cross-references to the EMBL nucleotide sequence database the term pre-RNA has been added as a valid value for the optional information field 3 (MOLECULE_TYPE).

The format of the DR EMBL line is:

DR   EMBL; ACCESSION_NUMBER; PROTEIN_ID; STATUS_IDENTIFIER; MOLECULE_TYPE.

The controlled vocabulary of the MOLECULE_TYPE now consists of:

  • Genomic_DNA
  • Genomic_RNA
  • pre-RNA
  • mRNA
  • Unassigned_DNA
  • Unassigned_RNA
  • Other_DNA
  • Other_RNA
  • -

Cross-references to MAIZE-2DPAGE

Cross-references to the Maize-2DPAGE have been removed.

UniProt release 6.1

Published September 27, 2005

Headlines

10'000 mouse entries

The threshold of 10'000 mouse entries is about to be reached in UniProtKB/Swiss-Prot. 35% of them contain information about isoforms generated by alternative splicing. When these are taken into account, the total number of mouse sequences in UniProtKB/Swiss-Prot is close to 13'000. The second most represented rodent is UniProtKB/Swiss-Prot is rat with about 4'600 entries.

The majority of the mouse entries contain sequence data generated by one of the two high-throughput cDNA sequencing projects: the NIH Mammalian Gene Collection (MGC) and the RIKEN (Rikagaku Kenkyusho, Institute of Physical and Chemical Research) mouse full-length cDNA encyclopedia. 97% of the entries also include cross-references to the Mouse Genome Informatics (MGI) database which provides access to additional data on the genetics, genomics and biology of the laboratory mouse.

Our aim is to annotate all mouse proteins along with the orthologous sequences in other mammals, especially homo sapiens, in order to provide a complete and comprehensive view of mammalian proteomes.


UniProtKB News

Annotation changes concerning the feature key METAL

The feature key METAL describes the binding of metal ions. More than 1 metal ion could be listed in the description field, when more than one ion binds to the same sequence residue. We have now restricted the annotation to only 1 metal ion per FT METAL line. Example:

FT   METAL        61     61       Copper and zinc.

became:

FT   METAL        61     61       Copper.
FT   METAL        61     61       Zinc.

UniProt release 6.0

Published September 13, 2005

Headlines

UniProtKB/Swiss-Prot major release (48.0)

Release 48.0 of 13-Sep-2005 of UniProtKB/Swiss-Prot contains 194'317 sequence entries, comprising 70'391'852 amino acids abstracted from 133'723 references. 11'963 sequences have been added since release 47, the sequence data of 1'095 existing entries has been updated and the annotations of 93'692 entries have been revised. This represents an increase of 7%.

Many improvements were carried out in the last 4 months. In particular, we have expanded our system of feature identifiers (FTIds): Feature keys concerning protein processing (CHAIN, PEPTIDE, PROPEP) have been tagged by a new feature identifier with the prefix PRO. We also changed the format of the OG Chloroplast and Cyanelle lines, to be able to indicate more precisely the kind of plastid organelle.

UniProt Knowledgebase release 6.0 includes Swiss-Prot release 48.0 and TrEMBL release 31.0. For more information you can also read the Full statistics and release notes for the UniProt Knowledgebase, i.e. Swiss-Prot and TrEMBL.


UniProtKB News

Changes in the OG (OrGanelle) line

We changed the format of the OG Chloroplast and Cyanelle lines, to be able to indicate more precisely the kind of plastid organelle. So far we defined the following lines:

OG   Plastid.
OG   Plastid; Apicoplast.
OG   Plastid; Chloroplast.
OG   Plastid; Cyanelle.
OG   Plastid; Non-photosynthetic plastid.

The line "OG Plastid" is used when the type of plastid - from which the gene coding for a protein originates - is unknown. This will be the case for most TrEMBL entries.

The line "OG Plastid; Apicoplast" is used for plastid-type organelles from the apicocomplexan parasites. These plastids are not photosynthetic, and encode a different suite of proteins than do photosynthetic organisms.

The line "OG Plastid; Chloroplast" is used for plastids from all organisms able to perform photosynthesis except the glaucophyte algae (see next).

The line "OG Plastid; Cyanelle" is used for plastids from the glaucophyte algae.

The line "OG Plastid; Non-photosynthetic plastid" is used for plastids derived from non-photosynthetic, but not apicocomplexan organisms. Examples of such organisms are the land plant Epifagus virginiana, the chlorophyte algae Prototheca wickerhamii and the euglenoid Astasia longa, none of which encode the genes necessary for photosynthesis on their plastid genome.

Cross-references to TAIR

Cross-references have been added to TAIR, The Arabidopsis Information Resource, which is a model organism database providing a centralized, curated gateway to Arabidopsis biology. TAIR is available at http://arabidopsis.org. Implicit links to this database have already been provided before in the NiceProt view of relevant Swiss-Prot entries on ExPASy.

The format of the explicit links in the flat file is:

Resource abbreviation TAIR
Resource identifier TAIR unique locus identifier.
Example
P33487:
DR   TAIR; At4g02980; -.

Introduction of a new feature identifier

The system of feature identifiers has been expanded. All feature keys concerning protein processing (CHAIN, PEPTIDE, PROPEP) have been tagged with the new feature identifier with the prefix PRO. We now have 4 types of feature identifiers:

  • prefix CAR relevant to the feature key CARBOHYD (e.g. /FTId=CAR_123456.)
  • prefix VAR relevant to the feature key VARIANT (e.g. /FTId=VAR_123456.)
  • prefix VSP relevant to the feature key VARSPLIC (e.g. /FTId=VSP_123456.)
  • new prefix PRO relevant to CHAIN, PEPTIDE, PROPEP (e.g. /FTId=PRO_1234567890.)

Examples:

Q9W568:
FT   CHAIN        23    611       Halfway protein.
FT                                /FTId=PRO_0000021413.
P15515:
FT   PEPTIDE      20     57       Histatin 1.
FT                                /FTId=PRO_0000021416.
Q7XAD0:
FT   PROPEP       25     48
FT                                /FTId=PRO_0000021449.

Changes concerning keywords

New keywords:

Deleted keywords:

  • Bombesin family
  • Bradykinin
  • Embryo
  • Erythrocyte
  • Galectin
  • Glucagon family
  • Insulin family
  • Lipocalin
  • Macrophage
  • Pancreas
  • Pentaxin
  • Phytochrome
  • Pituitary
  • Placenta
  • Platelet
  • Pyrokinin
  • Rubredoxin
  • Selectin
  • Serpin
  • Tachykinin
  • Thionin
  • Thymus
  • Whey

UniProt release 5.8

Published August 30, 2005

Headlines

Escherichia coli inner membrane proteome

We have integrated into UniProtKB/Swiss-Prot the results obtained by Gunnar von Heijne's group on the inner membrane proteome of Escherichia coli (see Daley D.O. et al. , Science 308:1321-3, 2005; PubMed ID: 15919996).

von Heijne's group has applied the PhoA/GFP fusion approach to derive topology models for almost the entire E. coli inner membrane proteome. More than 500 entries concerning membrane proteins had their subcellular location updated and the topology added.

Integral membrane proteins account for the coding capacity of 20 to 30% of the genes in typical organisms and are critically important for many cellular functions. However, owing to their hydrophobic and amphiphilic nature, membrane proteins are difficult to study, and they account for less than 1% of the known high-resolution protein structures. Overexpression, purification, biochemical analysis, and structure determination are all far more challenging than for soluble proteins, and membrane proteins have rarely been considered in proteomics or structural genomics contexts to date.


Changes concerning keywords

New keywords:

UniProt release 5.7

Published August 16, 2005

Headlines

Integration of data from an enzyme genomics project

We have integrated into UniProtKB/Swiss-Prot the results of Aled Edwards' and Alexander Yakunin's group which were summarized in FEMS Microbiology Reviews 29:263-279 (2005) (PubMed: 15808744). Using general enzymatic assays to screen individually purified proteins for enzymatic activity, they have identified activity for 36 previously uncharacterized proteins of E.coli, T. maritima, T. acidophilum, M. jannaschii and P. aeruginosa.

The sequencing of complete genomes produce increasing number of CDSs which are annotated as "hypothetical proteins". Approximately 40% of the protein sequences deposited in databases do not have any characterized function. This hinders the progress and research in many areas ranging from genome annotation to metabolic engineering. It is therefore of fundamental importance to carry on with experimental verification of function of these proteins, and, equally important, to integrate the results into the database. One of the major priorities in Swiss-Prot is to be up-to-date with respect to this kind of new findings, and we strive to integrate new characterizations as quickly as possible. We urge all groups obtaining these results to submit update requests to us. We will treat these requests with the highest priority.

UniProt release 5.6

Published August 2, 2005

Headlines

The dramatic outbreak of Severe Acute Respiratory Syndrome virus may be due to two mutations in the virus spike protein

The SARS coronavirus is a new human pathogen that emerged in Asia in 2002-2003. The animal reservoir of the virus is presumably palm civet, whose meat is a delicacy in Southern China.

The virus induces an acute respiratory distress in human and is deadly in 10% of all cases. It enters pulmonary cells through binding of the viral spike protein (human isolate Tor2 and palm civet isolate SZ3: P59594) to angiotensin-converting enzyme 2 (ACE2) (palm civet: Q56NL1; human: Q9BYF1). These proteins have recently been annotated or updated in UniProtKB/Swiss-Prot.

It has been recently shown that two amino acid mutations on palm civet SARS spike protein, Lys-479 and Ser-487, are sufficient for the virus to acquire the ability to bind efficiently human ACE2 (see EMBO J. 24:1634-1643(2005); PubMed: 15791205).

The severity of the 2002-2003 epidemic was presumably due to those two amino-acid mutations, giving opportunity to an animal virus to cause a major infection in the human species.

Changes concerning keywords

Deleted keyword:

  • Trans-acting factor

UniProt release 5.5

Published July 19, 2005

Headlines

Orangutan, the most represented non-human primate in UniProtKB/Swiss-Prot

With more than 500 entries, Pongo pygmaeus (Orangutan) is now the most represented non-human primate. Most of these entries are built around sequence data generated by a cDNA sequencing project launched by the German cDNA Consortium in 2003. Almost 4'000 entries submitted by the consortium are still in UniProtKB/TrEMBL. It should be noted however that some of these sequences are fragments and are thus not a priority for UniProtKB/Swiss-Prot annotation. We plan to manually annotate as many Orangutan entries as possible, starting with sequences orthologous to human ones already described in Swiss-Prot.

There are currently almost 15'500 primate entries in Swiss-Prot, 82% of which describe human proteins.


UniProtKB News

Obsolete file uniprot_trembl_varsplic.fasta.gz

All UniProtKB/TrEMBL entries with annotated alternative splicing events (KW Alternative splicing) have been moved to UniProtKB/Swiss-Prot. Thus the file uniprot_trembl_varsplic.fasta.gz became obsolete and has been removed from the ftp site.

Please note that UniProtKB/TrEMBL still includes splice isoforms, but each in an individual entry and not merged into one single entry.

Cross-references to Genew

The format of the UniProtKB cross-reference to the Human Gene Nomenclature Database Genew has changed: The term Genew has been replaced by HGNC, which stands for HUGO Gene Nomenclature Committee.

Example:

DR Genew; HGNC:12849; YWHAB.

has changed to

DR HGNC; HGNC:12849; YWHAB.

Changes concerning keywords

New keyword:

Deleted keyword:

  • Calcium-binding

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key 'CROSSLNK':

  • Cyclopeptide (Arg-Cys) (interchain with C-...)
  • Cyclopeptide (Cys-Arg) (interchain with R-...)

New term for the feature key 'MOD_RES':

  • N6-carboxylysine

UniProt release 5.4

Published July 5, 2005

UniProtKB News

Modified wording of reldate.txt in UniProt Knowledgebase ftp directories

The official names for the manually and automatically annotated sections of the UniProt Knowledgebase are UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.

To reflect this, we have changed the wording of the reldate.txt file in UniProt Knowledgebase ftp directories:

from

UniProt Release 5.4 consists of:
Swiss-Prot Release 47.4 of 05-Jul-2005
TrEMBL Release 30.4 of 05-Jul-2005

to

UniProt Knowledgebase Release 5.4 consists of:
UniProtKB/Swiss-Prot Release 47.4 of 05-Jul-2005
UniProtKB/TrEMBL Release 30.4 of 05-Jul-2005

Format change in the dbxref.txt document file

The dbxref.txt file lists the names and abbreviations and URLs of all databases cross-referenced in the UniProt Knowledgebase. We have added a new field, "Note", which is optional. This field will be used, among others, to list obsolete abbreviations for the cross-referenced databases.

Example:

Abbrev: MGI
Name  : Mouse genome database (MGD) from Mouse Genome Informatics (MGI)
LinkTp: Explicit
Server: http://www.informatics.jax.org/
Db_URL: www.informatics.jax.org/searches/accession_report.cgi?id=%s
Note  : Obsolete abbreviation: MGD

Multiple comment line (CC) topics COFACTOR

From now on, the CC line topic COFACTOR can occur more than once per entry. When an enzyme can bind several cofactors, each of them is indicated in a separate topic.

Example:

CC   -!- COFACTOR: Binds 1 2Fe-2S cluster per subunit (By similarity).
CC   -!- COFACTOR: Binds 1 Fe(2+) ion per subunit (By similarity).
CC   -!- COFACTOR: Binds 5 heme groups covalently per monomer.
CC   -!- COFACTOR: Binds 1 calcium ion per monomer.

Changes concerning keywords

New keyword:

Deleted keywords:

  • Testosterone
  • Submandibular gland

Changes concerning the controlled vocabulary for PTMs

New term for the feature key 'CROSSLNK':

  • 2-tetrahydro-2-pyridyl-5-imidazolinone (Lys-Gly)

UniProt release 5.3

Published June 21, 2005

Headlines

Hydrogenosomal genome encoded proteins

It was recently found (see Nature 434:74-79(2005); PubMed=15744302) that some anaerobic ciliates such as Nyctotherus ovalis (which thrives in the hindgut of cockroaches!) have retained a rudimentary hydrogenosomal genome. Hydrogenosomes are double-membraned subcellular structures that generate hydrogen while making the energy-storage compound ATP. They are found in certain eukaryotic unicellular organisms that inhabit oxygen-deficient environments.

The hydrogenosomal genome of N.ovalis is only 14 kb long and seems to encode for 11 different proteins, among which 5 subunits of the NADH dehydrogenase (complex I) and 2 ribosomal proteins. The genome and the proteins encoded are highly similar to their mitochondrial genome-encoded counterparts, thus establishing an evolutionary link between mitochondria and hydrogenosomes.

We are in the process of annotating the proteins from the N.ovalis hydrogenosomal genome.

This is linked with the introduction of "Hydrogenosome" in the list of valid values in the OG line.


UniProtKB News

New OG (OrGanelle) line value: Hydrogenosome

We have added "Hydrogenosome" to the list of valid values in the OG line.

Example Q5DUX5: OG Hydrogenosome.

Changes concerning keywords

New keywords:

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key 'MOD_RES':

  • (E)-2,3-didehydrotyrosine
  • 2,3-didehydrotyrosine
  • Aspartyl aldehyde
  • N6-formyllysine
  • Phosphoarginine
  • Serine microcin E492 siderophore ester

New terms for the feature key 'LIPID':

  • O-decanoyl serine
  • O-decanoyl threonine
  • O-octanoyl threonine

UniProt release 5.2

Published June 7, 2005

Headlines

Uncleaved N-terminal translocation signals

The translocation of a protein to another subcellular compartment requires the existence of at least one translocation signal specific to the relevant trafficking mechanism across the membrane. Proteins destined for secretion, incorporation into the plasma membrane, chloroplast, cyanelle, microbodies or the mitochondrial matrix usually possess an N-terminal transfer signal, which is cleaved during the transfer process. Recently, some proteins have been found to obviously get around this cleavage step. In some cases, the uncleaved signal peptide even confers important functional properties to the protein (P27169).

In the current Swiss-Prot release, the annotation of 23 protein entries indicates an uncleaved signal sequence (e.g. O95445) and the transit peptide of the mitochondrial 3-ketoacyl-CoA thiolase was reported to be not removed (P42765).

UniProt release 5.1

Published May 24, 2005

Headlines

Average of more than 10 cross-references per UniProtKB/Swiss-Prot entry

As described in the UniProt Knowledgebase user manual, integration with other data resources is one of the priorities of UniProtKB/Swiss-Prot.

This is reflected in the high number of cross-references: UniProtKB/Swiss-Prot currently contains more than 1.8 million explicit cross-references, which translates to just over 10 links per Swiss-Prot entry. 68 external databases are referenced in this manner, in addition to the 32 databases to which we link via implicit links, created on the fly by the ExPASy server. All these resources are listed in the List of databases cross-referenced in Swiss-Prot.

The most represented type of cross-references are the ones to the family and domain classification databases, i.e. InterPro and its member databases, as well as, obviously, the nucleotide sequence database (DR EMBL), our main source for sequence data.

UniProt release 5.0

Published May 10, 2005

Headlines

UniProtKB/Swiss-Prot major release (47.0)

Release 47.0 of UniProtKB/Swiss-Prot contains 181'571 sequence entries, comprising 65'742'349 amino acids abstracted from 128'438 references. 11'531 sequences have been added since release 46, the sequence data of 841 existing entries has been updated and the annotations of 166'572 entries have been revised. This represents an increase of 6%.

Many improvements were carried out in the last 3 months. In particular, we have introduced 5 new feature keys in order to better describe different types of regions in a protein sequence, and we added an additional qualifier to our cross-references to nucleotide sequence databases, the molecule type.

All the recent changes to the Swiss-Prot format are described in detail in the continuously updated document:

UniProt Knowledgebase release 5.0 includes Swiss-Prot release 47.0 and TrEMBL release 30.0.

Full statistics and release notes

UniProtKB News

Format change in the DR line

The DR (Database cross-Reference) lines are used as pointers to information in external data resources that is related to UniProtKB entries. Until now, the format of a DR line was:

DR   DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER[; TERTIARY_IDENTIFIER].

We have introduced a forth identifier, changing the DR line format to:

DR   DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER[; TERTIARY_IDENTIFIER][; QUATERNARY_IDENTIFIER].

The database cross-references to EMBL is affected by this modification (see below).

Changes in the cross-references to EMBL

The biological source of the molecule has been added as quaternary identifier to the cross-reference (DR line) of the EMBL database.

Former format:

DR   EMBL; ACCESSION_NUMBER; PROTEIN_ID; STATUS_IDENTIFIER.

New format:

DR   EMBL; ACCESSION_NUMBER; PROTEIN_ID; STATUS_IDENTIFIER; MOLECULE_TYPE.

The molecule type is controlled vocabulary and currently includes:

  • Genomic_DNA
  • Genomic_RNA
  • mRNA
  • Unassigned_DNA
  • Unassigned_RNA
  • Other_DNA
  • Other_RNA
  • -

Examples:

DR   EMBL; M68939; AAA26107.1; -; Genomic_DNA.
DR   EMBL; U56386; AAB72034.1; -; mRNA.

Cross-references to PANTHER

Cross-references have been added to PANTHER, which stands for Protein ANalysis THrough Evolutionary Relationships, a classification system that was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. Proteins have been classified according to families and subfamilies, molecular functions, biological processes and pathways. PANTHER is available at https://panther.appliedbiosystems.com/.

The format of the explicit links in the flat file is:

Resource abbreviation PANTHER
Resource identifier PANTHER's unique identifier for a protein family or sub-family.
Optional information 1 PANTHER's entry name for a protein family or sub-family.
Optional information 2 Number of domains found, which is generally 1, rarely 2 for the fusion of identical domains/proteins.
Example
O59826:
DR   PANTHER; PTHR11732:SF69; KCNAB_channel; 1.

New feature (FT) keys and redefinition of existing FT keys

The feature keys DOMAIN and SITE were used to describe distinct types of regions in a protein sequence and we found this situation unsatisfactory. We therefore redefined these two feature keys and introduced 5 new ones.

Redefinition of the feature keys DOMAIN and SITE:

  • DOMAIN - Extent of a domain, which is defined as a specific combination of secondary structures organized into a characteristic three-dimensional structure or fold. Example:
    FT   DOMAIN       37     94       Ig-like.
  • SITE - Any interesting single amino-acid site on the sequence, that is not defined by another feature key. It can also apply to an amino acid bond which is represented by the positions of the two flanking amino acids. Example:
    FT   SITE        176    177       Cleavage (by trypsin) (Probable).

Description of the 5 new feature keys:

  • COILED - Extent of a coiled-coil region. Example:
    FT   COILED      100    165       Potential.
  • COMPBIAS - Extent of a compositionally biased region. Example:
    FT   COMPBIAS     43     57       Pro/Thr-rich.
  • MOTIF - Short (<=20 amino acids) sequence motif of biological interest. Example:
    FT   MOTIF       153    154       Di-leucine motif.
  • REGION - Extent of a region of interest in the sequence. Example:
    FT   REGION       23     54       Binds to CD4.
  • TOPO_DOM - Topological domain. Example:
    FT   TOPO_DOM    139    182       Cytoplasmic (Potential).

The introduction of these new feature keys allows to establish a clear sorting order for feature tables. The following order is used:

1. Molecule processing
     * INIT_MET, SIGNAL, PROPEP, TRANSIT, CHAIN, PEPTIDE
2. Regions
     * TOPO_DOM, TRANSMEM
     * DOMAIN, REPEAT
     * CA_BIND, ZN_FING, DNA_BIND, NP_BIND
     * REGION
     * COILED
     * MOTIF
     * COMPBIAS
3. Sites
     * ACT_SITE
     * METAL
     * BINDING
     * SITE
4. Amino acid modifications (pre and PTM)
     * SE_CYS
     * MOD_RES
     * LIPID
     * CARBOHYD
     * DISULFID
     * CROSSLNK
5. Natural variations
     * VARSPLIC
     * VARIANT
6. Experimental info
     * MUTAGEN
     * UNSURE
     * CONFLICT
     * NON_CONS
     * NON_TER
7. Secondary structure
      * HELIX, TURN, STRAND

Keys of equal priority (listed on one line above) are ordered according to sequence positions.

UniProt release 4.6

Published April 26, 2005

Headlines

1 million cysteine residues in UniProtKB/Swiss-Prot

The total number of cysteine residues in Swiss-Prot has reached the 1 million mark. While there is nothing special about this number, we thought it was interesting in the context of the natural bias in the amino composition of proteins. There is more than a 8-fold difference between the frequency of the rarest amino acid (tryprophan at 1.15%) and that of the most frequent one (leucine at 9.64%). There are a number of reasons for this compositional bias, one of which is the degeneracy of the genetic code (which allows from 1 to 6 different triplets to code for a specific amino acid), and another one is the prevalence of hydrophobic aliphatic residues such as leucine or isoleucine in transmembrane domains and in signal sequences.

UniProtKB News

New Swiss-Prot document: pathway.txt

The new Swiss-Prot document pathway.txt includes an index of CC PATHWAY lines. For each step of an annotated pathway, a list of Swiss-Prot entries is given that are annotated to participate in that pathway.

Change in cross-references MGI (former MGD)

Mouse Genome Informatics have asked us to use the acronym MGI in our cross-references to the Mouse Genome Database, which we used to refer to as "MGD". We changed the database name in the relevant cross-references (DR lines) accordingly.

Example:

AC   P07724;
DR   MGI; MGI:87991; Alb1..

The Index of MGD entries referenced in Swiss-Prot (mgdtosp.txt) keeps its name, and so does the "special selections file" (mgd.seq.gz) containing all entries with "DR MGI" lines.

Changes concerning keywords

New keywords:

Modified keywords:

Deleted keyword:

  • Hypothalamus

UniProt release 4.5

Published April 12, 2005

UniProtKB News

Cross-references to SMR

Cross-references have been added to The SWISS-MODEL Repository, which is a database of annotated three-dimensional comparative protein structure models generated by the fully automated homology-modelling pipeline SWISS-MODEL. The repository is developed at the Biozentrum Basel within the SIB Swiss Institute of Bioinformatics and available at http://swissmodel.expasy.org/repository.

The format of the explicit links in the flat file is:

Resource abbreviation SMR
Resource identifier SWISS-MODEL's unique identifier for a protein, which is identical to the UniProtKB primary accession number of that protein.
Optional information 1 Range(s) covered by the structural model.
Example
P11416:
DR   SMR; P11416; 87-161, 182-416.

UniProt release 4.4

Published March 29, 2005

Headlines

Adding the keyword 'Complete proteome' to fungal entries

The keyword 'Complete proteome' is added to UniProtKB entries which originate from an organism whose genome has been completely sequenced. Until recently this keyword was only used for proteins from complete bacterial or archaeal genomes. We want to gradually increase the scope of this keyword to other groups of species. As a first step, we have now added this keyword to entries originating from 8 complete fungal genomes, namely: Ashbya gossypii, Candida glabrata, Debaryomyces hansenii, Encephalitozoon cuniculi, Kluyveromyces lactis, Saccharomyces cerevisiae, Schizosaccharomyces pombe and Yarrowia lipolytica.

The presence of this keyword allows to easily retrieve a complete non- redundant set of proteins from a specified genome across the Swiss-Prot and TrEMBL sections of the UniProt Knowledgebase.

UniProtKB News

New Swiss-Prot document: humsavar.txt

The new Swiss-Prot document humsavar.txt includes an index of sequence variation in human proteins. For each variant annotated in the feature table (FT) of the Swiss-Prot entry of a human protein, the following information is indicated:

  • gene name
  • Swiss-Prot entry name and accession number
  • feature identifier (FTId)
  • protein sequence position
  • amino acid substitution
  • type of variant (polymorphism or disease mutation; unclassified, if unknown)
  • disease name for disease variants

A text-only version of this index can be downloaded by ftp.

Changes concerning keywords

New keywords:

UniProt release 4.3

Published March 15, 2005

UniProtKB News

Changes concerning keywords

New keyword:

UniProt release 4.2

Published March 1, 2005

Headlines

More than 10'000 additional sequences encoded on splice variants in Swiss-Prot

Swiss-Prot is a non-redundant protein knowledgebase where the protein sequences from the same organism originating from the same gene are merged into one entry. When alternative products are produced by alternative splicing, the number of isoforms and their properties are indicated in the comment lines under the "Alternative products" topic, and the "Alternative splicing" keyword is added to the entry. The sequences of the alternative forms, if known, are described in the feature table under the key name "VARSPLIC". E.g. PRKN2_HUMAN (O60260).

More frequently than not, it is the longest isoform that is shown in a Swiss-Prot entry. The additional isoforms can be reconstructed from the annotation. We assign an unique identifier for each isoform (IsoId), and unique feature identifiers (FTId) to each "VARSPLIC" feature described in the feature table. Each IsoId comes with a list of FTIds which serve as "instructions" to be applied in order to reconstruct the isoform's sequence.

The number of these additional Swiss-Prot recreated splice variants has recently reached the 10'000 mark.

More than a half are human isoforms. More than 80% are from mammals.

A fasta-formatted file containing all splice variants annotated in Swiss-Prot and TrEMBL can be downloaded for use with similarity search programs. Most sequence analysis and proteomic tools on ExPASy, e.g. BLAST or Aldente, have been adapted to take into account, in addition to all Swiss-Prot and TrEMBL entries, all other annotated splice isoforms.

UniProtKB News

Changes concerning keywords

New keyword:

UniProt release 4.1

Published February 15, 2005

Headlines

Massive number of changes to entry names

We recently allowed entry names to consist of up to 11 characters instead of 10. An entry name consists of two parts, a prefix which is a mnemonic code representing the protein name and a suffix which is a mnemonic species identification code. Example: RECA_BACSU is the entry name for the recA protein of Bacillus subtilis. The increase from 10 to 11 characters allows the protein name mnemonic to increase from 4 to 5 characters.

Thanks to this change, we are now able to assign more meaningful entry names to a significant number of entries. In the past month we have gone through almost all of the Swiss-Prot entries and have checked to see if they could benefit from an entry name update. As a consequence of this process we updated more than 35'000 entry names (about 20% of all the entries in Swiss-Prot). In about 33'000 cases we created ID prefixes consisting of 5 characters and in the rest of the cases we changed existing prefixes of 3 to 4 characters to more meaningful and consistent prefixes of the same length.

Due to this massive changes in entry names it is probable that the names of some protein entries that you are using have changed. It is therefore useful to remind our users that we provide a tool, the IDtracker which allows to trace the identifiers (ID) of protein entries. You can use this tool to enquire on the whereabouts of one or more entry names and to obtain the newly assigned names and primary accession numbers.

It can seem paradoxical that we insist in warning users that they should always use accession numbers when citing an entry, yet we strive to provide meaningful entry names. The reason for this dichotomy is simple: if you want to refer to specific entries in a publication or in any document , you need to ensure that your reference is stable, unique and unambiguous. Such a mechanism is provided by the accession numbers. Accession numbers are stable identifiers and you can be sure that you will be always able to track down a specific entry if you use an accession number. But when you want to access an entry from a web server or a sequence analysis program, then it is much easier to remember an entry name than an accession number. The human mind is structured in such a way that is generally easier for most of us to remember something like "APOA5_HUMAN" rather than something like "Q6Q788".

UniProtKB News

Cross-references to GeneDB_Spombe

We changed the Data bank identifier for the Schizosaccharomyces pombe GeneDB Prototype from GeneDB_SPombe to GeneDB_Spombe.

Cross-references to LegioList

Cross-references have been added to LegioList, a database which provides a complete dataset of DNA and protein sequences derived from L. pneumophila strain Paris and strain Lens, linked to the relevant annotations and functional assignments. LegioList is available at http://genolist.pasteur.fr/LegioList/.

The format of the explicit links in the flat file is:

Resource abbreviation LegioList
Resource identifier Ordered locus name.
Example
Q5X2T6:
DR   LegioList; lpp2301; -.

UniProt release 4.0

Published February 1, 2005

Headlines

UniProtKB/Swiss-Prot major release (51.0)

Release 46.0 of Swiss-Prot contains 168'297 sequence entries, comprising 61'443'278 amino acids abstracted from 124'910 references.

4'537 sequences have been added since release 45, the sequence data of 866 existing entries has been updated and the annotations of 77'494 entries have been revised. This represents an increase of 3%.

Many improvements were carried out in the last 3 months. In particular, we have extended the format for ID lines in both Swiss-Prot and TrEMBL.

UniProt Knowledgebase release 4.0 includes Swiss-Prot release 46.0 and TrEMBL release 29.0.

Full statistics and release notes

UniProtKB News

Modification to the ftp server directory structure

In order to provide access to the last major Swiss-Prot and TrEMBL releases (as opposed to the biweekly releases) via the UniProt ftp servers, ftp.uniprot.org/databases/uniprot, ftp.expasy.org/databases/uniprot and ftp.ebi.ac.uk/databases/uniprot, we changed the directory structure of our ftp sites.

In addition to the possibility of downloading the complete databases, we provide the data in the form of taxonomic divisions for archaea, bacteria, fungi, human, invertebrates, mammals, plants, rodents, vertebrates, viruses and unclassified.

The new structure will be:

/databases/uniprot
     /current_release
          /knowledgebase
              /complete
                 uniprot_sprot.dat.gz
                 uniprot_sprot.fasta.gz
                 uniprot_sprot.xml.gz
                 uniprot_trembl.dat.gz
                 uniprot_trembl.fasta.gz
                 uniprot_trembl.xml.gz
                 etc
              /taxonomic_divisions
                 uniprot_sprot_archaea.dat.gz
                 uniprot_trembl_archaea.dat.gz
                 uniprot_sprot_bacteria.dat.gz
                 uniprot_trembl_bacteria.dat.gz
                 etc
          /uniref
     /previous_releases
          /release1.0
              /knowledgebase
              /uniref
          /release2.0
              /knowledgebase
              /uniref
          etc
Symbolic links will be established for the following existing directories:
/databases/uniprot/knowledgebase to
               /databases/uniprot/current_release/knowledgebase
/databases/uniprot/uniref        to
                /databases/uniprot/current_release/uniref

On ftp.expasy.org and ftp.ebi.ac.uk:
/databases/swiss-prot/release_compressed to
                /databases/uniprot/previous_releases/releaseX.0/knowledgebase
/databases/trembl/release_compressed     to
                /databases/uniprot/previous_releases/releaseX.0/knowledgebase

The directory on ExPASy that used to contain uncompressed Swiss-Prot releases, /databases/swiss-prot/release/, will be removed.

Please note that if you are interested in complete proteomes, you can download:

Extension of the TrEMBL entry name format

Previously, TrEMBL used the accession number as the entry name. With this release, TrEMBL entry names are composed of the accession number and organism identification codes (O95417_HUMAN, Q9VVG0_DROME, P71025_BACSU, Q9SR52_ARATH, etc.). The speclist.txt file lists the organism identification codes which are used to build the "organism" part of an entry name in Swiss-Prot. This file has been extended to include codes to be used in TrEMBL. As it is not possible in a reasonable timeframe to manually assign organism codes to all species represented in TrEMBL, it was decided to define "virtual" codes that regroup organisms at a certain taxonomic level. Such codes are prefixed by the number "9" and generally correspond to a "pool" of organisms which can be 'wide' as a kingdom. Here are some examples of such codes:

9BACT B      2: N=Bacteria
9CNID E   6073: N=Cnidaria
9FUNG E   4751: N=Fungi
9REOV V  10880: N=Reoviridae
9TETR E  32523: N=Tetrapoda
9VIRI E  33090: N=Viridiplantae

TrEMBL entries are widely used for sequence analysis such as similarity search, multiple sequence alignments or phylogenetic analysis. The extension of the entry name will simplify the species identification in the analysis results.

Change of the entry name in many Swiss-Prot entries

In the last release we introduced to Swiss-Prot the first entry with the new format of the entry name. With this release, many entry names have changed to the new format.

New comment line (CC) topic: INTERACTION

The CC line topic INTERACTION is used to convey information relevant to binary protein-protein interactions. It is automatically derived from the IntAct database and is updated on a monthly basis. The occurrence is one INTERACTION topic per entry, with each binary interaction being presented in a separate line. Each data line can be longer than 75 characters.

Interactions can be derived by any appropriate experimental method, but must be confirmed by a second experiment, if resulting from a single yeast- two-hybrid experiment. For large-scale experiments interactions are referred, if a high confidence is assigned from the authors.

The format of the CC line topic INTERACTION is:


CC   -!- INTERACTION:
CC       {{SP_Ac:identifier[ (xeno)]}|Self}; NbExp=n; IntAct=IntAct_Protein_Ac, IntAct_Protein_Ac;

where

SP_Ac is the Swiss-Prot or TrEMBL accession number of the interacting protein. If appropriate, the IsoId is used instead to specify the relevant interacting protein isoform.
identifier serves to describe the interacting protein. It is derived from the Swiss-Prot or TrEMBL GN line and thus presents either a "gene name", a "ordered locus name" or a "ORF name". When no GN line is available a dash is indicated instead.
(xeno) is an optional qualifier indicating that the interacting proteins are derived from different species. This may be due to the experimental set-up or may reflect a pathogen-host interaction.
Self reflects a self-association; the corresponding current entry's SP_Ac and 'identifier' are not given/repeated.
NbExp=n refers to the number of experiments in IntAct supporting the interaction.
IntAct_Protein_Ac is the IntAct accession number of a interacting protein. The first IntAct_Protein_Ac refers to the protein or an isoform of the current entry, the second refers to the interacting protein or isoform.

Within the CC INTERACTION topic, homomeric interactions are listed before the heteromeric interactions; latter are sorted alphanumerical according the 'identifier'.

"IntAct=IntAct_Protein_Ac, IntAct_Protein_Ac" identifies the interaction in IntAct by using the two IntAct protein identifiers.

Examples of interaction lines are given below. The CC INTERACTION topics are not complete; only explained interaction lines are indicated.

CC   -!- INTERACTION:
CC       P11450:fcp3c; NbExp=1; IntAct=EBI-126914, EBI-159556;

In the typical example the current protein is interacting with P11450 which is further characterized by "fcp3c" derived from its GN line and presents its gene name "Fcp3C". The interaction is supported by one experiment stored in IntAct. Experimental details for this interaction can be found by quering IntAct with "EBI-126914, EBI-159556".

CC   -!- INTERACTION:
CC       Q9W1K5-1:cg11299; NbExp=1; IntAct=EBI-133844, EBI-212772;
CC       ...

The current protein interacts with an isoform of Q9W1K5 defined by the IsoID Q9W1K5-1.

CC   -!- INTERACTION:
CC       Q8NI08:-; NbExp=1; IntAct=EBI-80809, EBI-80799;

No gene name information for the interacting protein is available.

CC   -!- INTERACTION:
CC       Self; NbExp=1; IntAct=EBI-123485, EBI-123485;

The protein self-associates.

CC   -!- INTERACTION:
CC       Q8C1S0:2410018m14rik (xeno); NbExp=1; IntAct=EBI-394562, EBI-398761;

The source organisms of the interacting proteins are different.

CC   -!- INTERACTION:
CC       P51617:irak1; NbExp=1; IntAct=EBI-448466, EBI-358664;
CC       P51617:irak1; NbExp=1; IntAct=EBI-448472, EBI-358664;

Different isoforms of the current protein are shown to interact with the same protein (P51617). This is reflected by different IntAct_Protein_Acs for the current protein.

Example entry with many interaction lines: Q02821.

New Swiss-Prot document: similar.txt

There is a new Swiss-Prot document: similar.txt: Index of CC SIMILARITY lines. This index lists all names of families and domains occurring in CC SIMILARITY lines of Swiss-Prot entries.

Changes concerning keywords

New keywords:

UniProt release 3.5

Published January 4, 2005

UniProtKB News

Extension of the Swiss-Prot entry name format

We endeavor to assign meaningful entry names that facilitate the identification of the proteins and the species of origin. Swiss-Prot uses a general purpose naming convention that can be symbolized as X_Y, where X is a mnemonic code of alphanumeric characters representing the protein name, the '_' sign serves as a separator, and the Y is a mnemonic species identification code of at most 5 alphanumeric characters representing the biological source of the protein.

The entry name used to consist of up to ten uppercase alphanumeric characters. We now elongated the mnemonic code for the protein name from up to 4 characters to up to 5 characters, thus entry names can from now on consist of up to 11 characters.

As this modification might have an impact on many programs, we introduced in this release only one Swiss-Prot entry with an entry name in the new format: TINA1_DROME (Q9W0Y1). With UniProtKB release 4.0 at the beginning of February, we will change the entry names of many Swiss-Prot entries.

We strongly advise users to cite Swiss-Prot entries by their unique and stable identifer, which is the first (primary) accession number of an entry. It happens occasionally that entries are only referred to by the entry name. As we will soon change the entry names of thousands of entries, we provide the tool IDtracker, which allows users of the Swiss-Prot protein knowledgebase to trace the identifiers (ID) of protein entries.

Cross-references to Ensembl

Cross-references have been added to the Ensembl database, a bioinformatics project that organizes biological information around the sequences of large genomes. Ensembl is available at http://www.ensembl.org.

The format of the explicit links in the flat file is:

Resource abbreviation Ensembl
Resource identifier Ensembl unique identifier for a gene.
Optional information 1 Species name.
Example
O43462:
DR   Ensembl; ENSG00000012174; Homo sapiens.

UniProt release 3.4

Published December 21, 2004

Headlines

Annotation of TRPA1 and TRPM8 transient receptors involved in sensing cold

Mammals detect temperature with specialized neurons in the peripheral nervous system. Two transient receptor-class channels, TRPA1 and TRPM8, have been implicated in sensing cold.

The first of these channels, TRPA1, is activated by temperatures below 17 degrees Celsius, which corresponds to the noxious cold threshold. The second channel, TRPM8, plays a role in sensing less extreme cool temperatures, being activated below 25 degrees Celsius.

Interestingly, the TRPM8 channel is also activated by products such as eucalyptol or menthol, which may provide a ready explanation for the sensation of coolness triggered by these flavors.

See:

UniProtKB News

Changes in the RP (Reference Position) line

We changed the following items of the RP line:

  • 'SEQUENCE FROM N.A.' became 'NUCLEOTIDE SEQUENCE'
  • ' This topic can be tagged with an optional qualifier indicating the origin of the sequence: 'NUCLEOTIDE SEQUENCE [ORIGIN OF SEQUENCE DATA]', where valid names are:
    • GENOMIC DNA: the individual gene has been sequenced
    • GENOMIC RNA: the individual gene has been sequenced
    • MRNA: the individual cDNA has been sequenced
    If the data originates from a genome project or cDNA large-scale project, the qualifier is preceded by the term "LARGE SCALE"; this generally indicates that the relevant publication describes the project rather than giving further details on the protein of interest.
    • LARGE SCALE GENOMIC DNA: the gene has been sequenced as part of a genome project
    • LARGE SCALE MRNA: the gene has been sequenced as part of a large-scale cDNA project
  • 'SEQUENCE' became 'PROTEIN SEQUENCE'
  • 'REVISION(S)' became 'SEQUENCE REVISION'

Example:

RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1), PROTEIN SEQUENCE
RP   OF 108-131; 220-231 AND 349-393, CHARACTERIZATION, AND MUTAGENESIS OF
RP   ARG-336.

If 2 qualifiers apply, both are indicated, separated by a '/'.

Example:

RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA / MRNA].

New comment line (CC) topic: BIOPHYSICOCHEMICAL PROPERTIES

A new comment line (CC) topic has been introduced: BIOPHYSICOCHEMICAL PROPERTIES. This topic is used to convey information relevant to biophysical and physicochemical data and information on pH dependence, temperature dependence, kinetic parameters, redox potentials, and maximal absorption.

The format of this comment block is:

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       Absorption:
CC         Abs(max)=xx nm;
CC         Note=free_text;
CC       Kinetic parameters:
CC         KM=xx unit for substrate [(free_text)];
CC         Vmax=xx unit enzyme [free_text];
CC         Note=free_text;
CC       pH dependence:
CC         free_text;
CC       Redox potential:
CC         free_text;
CC       Temperature dependence:
CC         free_text;

A BIOPHYSICOCHEMICAL PROPERTIES block must contain at least one of the properties Absorption, Kinetic parameters, pH dependence, Redox potential, Temperature dependence and may have any combination of these properties (ordered as indicated above). The meaning of these subtopics is as follows:

Property Description
Absorption indicates the wavelength at which photoreactive proteins such as opsins and DNA photolyases show maximal absorption
Kinetic parameters mentions the Michaelis-Menten constant (KM) and maximal velocity (Vmax) of enzymes
pH dependence describes the optimum pH for enzyme activity and/or the variation of enzyme activity with pH variation
Redox potential reports the value of the standard (midpoint) oxido-reduction potential(s) for electron transport proteins
Temperature dependence indicates the optimum temperature for enzyme activity and/or the variation of enzyme activity with temperature variation; the thermostability/thermolability of the enzyme is also mentioned when it is known

Examples:

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       Absorption:
CC         Abs(max)=395 nm;
CC         Note=Exhibits a smaller absorbance peak at 470 nm. The
CC         fluorescence emission spectrum peaks at 509 nm with a shoulder
CC         at 540 nm;

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       Kinetic parameters:
CC         KM=62 mM for glucose;
CC         KM=90 mM for maltose;
CC         Vmax=0.20 mmol/min/mg enzyme with glucose as substrate;
CC         Vmax=0.11 mmol/min/mg enzyme with maltose as substrate;
CC         Note=Acetylates glucose, maltose, mannose, galactose, and
CC         fructose with a decreasing relative rate of 1, 0.55, 0.20, 0.07,
CC         0.04;

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       Kinetic parameters:
CC         KM=1.76 uM for chlorophyll;
CC       pH dependence:
CC         Optimum pH is 7.5. Active from pH 5.0 to 9.0;
CC       Temperature dependence:
CC         Optimum temperature is 45 degrees Celsius. Active from 30 to 60
CC         degrees Celsius;

Changes concerning keywords

New keywords:

Modified keyword:

Deleted keywords:

  • DNA priming
  • Noncapsid protein
  • Nonstructural protein
  • Nucleocapsid

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key 'CROSSLNK':

  • 2-iminomethyl-5-imidazolinone (Glu-Gly)
  • 2-iminomethyl-5-imidazolinone (Met-Gly)
  • 5-imidazolinone (Asn-Gly)
  • 5-imidazolinone (Lys-Gly)

UniProt release 3.3

Published December 7, 2004

UniProtKB News

Changes concerning keywords

New keywords:

UniProt release 3.2

Published November 23, 2004

Headlines

Major update of C.elegans entries

We have recently finished a major update of Caenorhabditis elegans entries in Swiss-Prot. The following tasks were carried out:

  • Update and "homogeneization" of all references to the genome project and sequence revisions;
  • Addition of cross-references to WormBase;
  • Re-annotation of many existing entries using literature references;
  • Manual annotation of about one hundred new entries.

UniProtKB News

Removal of the file submit.txt

The file submit.txt is no longer distributed. Information on how to submit sequence data, updates or corrections can be found in the submissions and updates help.

UniProt release 3.1

Published November 9, 2004

UniProtKB News

Conversion of Swiss-Prot to mixed-case characters

The conversion of Swiss-Prot entries from all UPPER CASE to MiXeD CaSe is now completed. This modification does not apply to the following line types:

  • ID (IDentification) line
  • AC (ACcession number) line
  • RP (Reference Position) line
  • SQ (SeQuence) line
  • Amino acid sequence

Changes concerning keywords

New keywords:

UniProt release 3.0

Published October 25, 2004

Headlines

UniProtKB/Swiss-Prot major release (45.0)

Release 45.0 of Swiss-Prot contains 163'235 sequence entries, comprising 59'631'787 amino acids abstracted from 120'520 references. 6'183 sequences have been added since release 44, the sequence data of 2'851 existing entries has been updated and the annotations of 71'220 entries have been revised. This represents an increase of 4%.

Many improvements were carried out in the last 3 months at the level of the DR, CC, KW and FT lines.

UniProt Knowledgebase release 3.0 includes Swiss-Prot release 45.0 and TrEMBL release 28.0.

Full statistics and release notes

UniProtKB News

UniProtKB release notes: relnotes.html

With release 3.0 we introduce the UniProtKB release notes, which replaces the Swiss-Prot release notes (rnote_sp.html) and TrEMBL release notes (rnote_tr.html). The UniProtKB release notes includes the release statistics of both databases, the status of the model organisms and various other useful information. It can all be downloaded from ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/docs/.

Changes concerning keywords

New keywords:

UniProt release 2.7

Published October 11, 2004

UniProtKB News

Cross-references to H-InvDB

Cross-references have been added to the human gene database H-Invitational Database (H-InvDB), which provides information on annotated full-length cDNA clones available from six high throughput cDNA sequencing projects. The H-Invitational Database is available at http://www.h-invitational.jp/.

The format of the explicit links in the flat file is:

Resource abbreviation H-InvDB
Resource identifier H-InvDB's unique identifier for a cDNA cluster.
Example
P78314:
DR   H-InvDB; HIX0004037; -.

Cross-references to WormBase

We have added cross-references to WormBase, which provides information concerning the genetics, genomics and biology of C. elegans and some related nematodes. WormBase is available at http://www.wormbase.org/.

The identifiers of the appropriate DR line are:

Resource abbreviation WormBase
Resource identifier WormBase's unique identifier for a gene.
Optional information 1 Gene designation.
Example
DR   WormBase; WBGene00006806; unc-74.

Changes concerning keywords

New keywords:

Deleted keyword:

  • Semen
  • Yolk

UniProt release 2.6

Published September 27, 2004

UniProtKB News

Changes concerning keywords

Deleted keyword:

  • Connective tissue
  • Eggshell
  • Endothelial cell
  • Eosinophil

UniProt release 2.5

Published September 13, 2004

UniProtKB News

Changes concerning keywords

Deleted keyword:

  • Parotid gland

Changes concerning the controlled vocabulary for PTMs

We are continuously overhauling the annotation of post-translational modifications (PTMs). For the feature key MOD_RES, the new introduced controlled vocabularies for PTMs are:

  • Flavin binding: All entries with annotated flavin-binding sites have the keyword Flavoprotein.

     FMN phosphoryl serine (Keyword: FMN)
     FMN phosphoryl threonine (Keyword: FMN)
     O-8alpha-FAD tyrosine (Keyword: FAD)
     S-4a-FMN cysteine (Keyword: FMN)
     S-6-FMN cysteine (Keyword: FMN)
     S-8alpha-FAD cysteine (Keyword: FAD)
     Tele-8alpha-FAD histidine (Keyword: FAD)
     Tele-8alpha-FMN histidine (Keyword: FMN)
    
  • Hydroxylation: All entries with annotated hydroxylation sites have the keyword Hydroxylation.

     4,5-dihydroxylysine
     3,4-dihydroxyproline
    
  • Other new controlled vocabularies for PTMs that are annotated with the feature key MOD_RES:

     O-(sn-1-glycerophosphoryl)serine
    

UniProt release 2.4

Published August 31, 2004

Headlines

1'500 cited journals

It is interesting to note that information relevant to the scope of Swiss-Prot is found in a continuously increasing number of scientific journals. Currently Swiss-Prot cites 1'500 different journals. Only 5 years ago, this number was slightly less than 1'000. Out of those 1'500 journals, 157 are either no longer published or have changed their names. It is also noteworthy that about 50% of these 1'500 journals are only cited less than four times in the knowledgebase. At the other extreme, only 106 journals are cited more than 100 times.

UniProtKB News

Release notes: rnote_sp.html & rnote_tr.html

The TrEMBL release notes (rnote_tr.html) were added to the documents distributed with the UniProtKB release. The name of the Swiss-Prot release notes changed from relnotes.html to accordingly. These documents can all be downloaded from ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/docs/.

UniProt release 2.3

Published August 16, 2004

UniProtKB News

New RL line structure for electronic publications

Electronic publications have been indicated in the RL line with the '(er)' prefix that stands for electronic resource:

RL   (er) Free text.

Example:

RL   (er) Plant Gene Register PGR98-023.

Removal of the submission references to HIV data bank

We replaced all submission references to the HIV data bank by publications, thus RL lines of the type:

RL   Submitted (XXX-YYYY) to the HIV data bank.

do no longer exist in Swiss-Prot.

Change in cross-references to PDB

The structure determination method (X-ray, NMR, etc) as well as the mapping of the extent of the cross-reference on the sequence have been introduced as tertiary identifier to the PDB cross-reference line.

Former format:

DR   PDB; ENTRY_NAME; REVISION_DATE.

New format:

DR   PDB; ENTRY_NAME; Method; CHAIN[S]=RANGE.

The methods are controlled vocabulary and currently include:

  • X-ray (for X-ray crystallography)
  • NMR (for NMR spectroscopy)
  • EM (for electron microscopy and cryo-electron diffraction)
  • Fiber (for fiber diffraction)
  • IR (for infrared spectroscopy)
  • Model (for predicted models)
  • Neutron (for neutron diffraction)

Example:

DR   PDB; 1NB3; X-ray; A/B/C/D=116-335, P/R/S/T=98-105.

The tertiary identifier indicates the chain(s) and the corresponding range, of which the structure has been determined. If the range is unknown, a dash is given rather than the range positions. Example:

DR   PDB; 1IYJ; X-ray; B/D=-.

If the chains and the range is unknown, a dash is used. Example:

DR   PDB; 1N12; X-ray; -.

With the introduction of the new format, DR PDB lines can become longer than 75 characters.

UniProt release 2.2

Published July 30, 2004

UniProtKB News

Changes concerning the controlled vocabulary for PTMs

We are continuously overhauling the annotation of post-translational modifications (PTMs). For the feature key MOD_RES, the new introduced controlled vocabularies for PTMs (PTMlist.txt) are:

  • Methylation: All entries with annotated methylation sites have the keyword Methylation.

     N6,N6,N6-trimethyl-5-hydroxylysine
     N6-poly(methylaminopropyl)lysine
    
  • Hydroxylation: All entries with annotated hydroxylation sites have the keyword Hydroxylation.

     3-hydroxyproline
     4-hydroxyarginine
    
  • Unidentified N-terminal blocking modifications:

     Blocked amino end (Leu)
    
  • Other new controlled vocabularies for PTMs that are annotated with the feature key MOD_RES:

     Glycine radical (Keyword: Organic radical)
     Pentaglycyl murein peptidoglycan amidated alanine (Keyword: Peptidoglycan-anchor)
     Tryptophylquinone
    

UniProt release 2.1

Published July 19, 2004

Headlines

Annotation of HERV protein sequences

The human genome contains a number of human endogenous retroviruses (HERVs). These proviruses (the integrated form of retroviral DNA) are retroviral sequences that are transmitted vertically as part of the host germ line. A number of HERV 'families' have been identified, each derived from an independent colonisation event.

Some proviruses display open reading frames with coding capacity for a variety of viral-like proteins (Env, Gag, Pol, Pro, etc.). We have already annotated in Swiss-Prot a significant number of HERV proteins. We only include such potential proteins if they are meeting one of these three criteria: i) if there is evidence of their expression by the host, ii) if the derived sequence encodes a full-length protein, iii) if the protein has a potential cellular function.

UniProtKB News

Change in the keyword line (KW)

Keywords are now stored by alphabetical order on the KW lines of both Swiss-Prot and TrEMBL entries.

Format change in the comment line (CC) topic: MASS SPECTROMETRY

We have slightly changed the format for the comment line topic MASS SPECTROMETRY, which reports the exact molecular weight of a protein or part of a protein as determined by mass spectrometric methods. The modifications concern the topic RANGE, which has become mandatory, and the introduction of the new mandatory topic NOTE, which is used to indicate the relevant reference number.

New format:

CC   -!- MASS SPECTROMETRY: MW=XXX[; MW_ERR=XX][; METHOD=XX]; RANGE=XX-XX[ (Name)]; NOTE={Free text (Ref.n)|Ref.n}.

Where:

  • 'MW=XXX' is the determined molecular weight (MW);
  • 'MW_ERR=XX' (optional) is the accuracy or error range of the MW measurement;
  • 'METHOD=XX' (optional) is the ionization method;
  • 'RANGE=XX-XX[ (Name)]' (mandatory) is used to indicate what part of the protein sequence entry corresponds to the molecular weight. In case of multiple products, the name of the relevant isoform is enclosed.
  • 'NOTE={Free text (Ref.n)|Ref.n}' (mandatory) indicates the relevant reference, which optionally can be preceded by a comment in free text format.

Example:

CC   -!- MASS SPECTROMETRY: MW=32875.93; METHOD=MALDI;
CC       RANGE=1-284 (Isoform 3); NOTE=Ref.6.

Cross-references to AGD

We have added cross-references to Ashbya genome database, available at http://agd.unibas.ch/.

The identifiers of the appropriate DR line are:

Resource abbreviation AGD
Resource identifier AGD's unique identifier for a gene. This is generally the OLN (Ordered Locus Name) for that gene (eg: AAR059C), except for mitochondrial genes where AGD uses an identifier based on the gene name (eg: AgCOB1).
Optional information 1 None; a dash '-' is stored in that field.
Example
Q00063:
DR   AGD; AAR059C; -.

Changes concerning the controlled vocabulary for PTMs

We are continuously overhauling the annotation of post-translational modifications (PTMs). For the feature key MOD_RES, the new introduced controlled vocabularies for PTMs are:

  • Hydroxylation: All entries with annotated hydroxylation sites have the keyword Hydroxylation.

     4-hydroxyproline
    
  • Unidentified N-terminal blocking modifications:

     Blocked amino end (Asx)
     Blocked amino end (Xaa)
    
  • Other new controlled vocabularies for PTMs that are annotated with the feature key MOD_RES:

     Pros-8alpha-FAD-histidine (Keyword: FAD)
     N6-murein peptidoglycan lysine
    

UniProt release 2.0

Published July 5, 2004

Headlines

UniProtKB/Swiss-Prot major release (44.0)

Release 44.0 of Swiss-Prot contains 153'825 sequence entries, comprising 56'599'343 amino acids abstracted from 117'387 references. 6'633 sequences have been added since release 43, the sequence data of 582 existing entries has been updated and the annotations of 139'855 entries have been revised. This represents an increase of 4%.

Many improvements were carried out in the last 3 months at the level of the GN, RX, CC and FT lines.

Full statistics and release notes

UniProtKB News

Incorporation of new entries into the biweekly UniProtKB releases of Swiss-Prot and TrEMBL

The files provided in the ftp directory /databases/uniprot/knowledgebase/new/ (and known as TrEMBL_New) have been removed. These files contained new sequence entries and sequences to be used to update existing Swiss-Prot or TrEMBL entries. Until now, these entries were integrated into Swiss-Prot and TrEMBL mostly only at full releases of these databases. We now incorporate these new and updated sequences into the biweekly UniProtKB releases of Swiss-Prot and TrEMBL (/databases/uniprot/knowledgebase/uniprot_sprot* and /databases/uniprot/knowledgebase/uniprot_trembl*), and, therefore, the distribution of these files is no longer necessary.

New format for the GN (Gene Name) line

We have introduced a new format for the GN (Gene Name) line and all gene names have been converted to mixed case. The new format is more structured than the previous one, in order to distinguish between three types of information:

  1. Gene names (a.k.a gene symbols). The names(s) used to represent a gene. As there can be more than one name assigned to a gene. We make a distinction between the one which we believe should be used as the official gene name and the other names which are listed as "Synonyms".
  2. Ordered locus names (a.k.a. OLN, ORF numbers, CDS numbers or Gene numbers). A name used to represent an ORF in a completely sequenced genome or chromosome. It is generally based on a prefix representing the organism and a number which usually represents the sequential ordering of genes on the chromosome. Depending on the genome sequencing center, numbers are attributed only to protein-coding genes, or also to pseudogenes, or also to tRNAs and other features. Examples: HI0934, Rv3245c, At5g34500, YER456W.
  3. ORF names (a.k.a. Sequencing names or Contig names or Temporary ORFNames). A name temporarily attributed by a sequencing project to an open reading frame. This name is generally based on a cosmid numbering system. Examples: MtCY277.28c, SYGP-ORF50, SpBC2F12.04, C06E1.1, CG10954.

The new format of the GN line is:

GN   Name=<name>; Synonyms=<name1>[, <name2>...]; OrderedLocusNames=<name1>[, <name2>...];
GN   ORFNames=<name1>[, <name2>...];

None of the above four tokens are mandatory. But a "Synonyms" token can only be present if there is a "Name" token.

If there is more than one gene, GN line blocks for the different genes are separated by the following line:

GN   and

Wrapping is done preferentially at a semicolon, otherwise at a comma.

Examples:

GN   Name=atpG; Synonyms=uncG, papC;
GN   OrderedLocusNames=b3733, c4659, z5231, ECs4675, SF3813, S3955;
GN   ORFNames=SPAC1834.11c;
GN   Name=cysA1; Synonyms=cysA; OrderedLocusNames=Rv3117, MT3199;
GN   ORFNames=MTCY164.27;
GN   and
GN   Name=cysA2; OrderedLocusNames=Rv0815c, MT0837; ORFNames=MTV043.07c;

Cross-references to IntAct

We have added cross-references to IntAct, the Protein interaction database and analysis system available at http://www.ebi.ac.uk/intact/.

The identifiers of the appropriate DR line are:

Resource abbreviation IntAct
Resource identifier The Swiss-Prot primary AC number for the protein. This is used by IntAct as a link to all the interactions in which that protein is involved.
Example
P14653:
DR   IntAct; P14653; -.

Change in cross-references to Reactome (former GK)

The Genome Knowledgebase (GK) was renamed to Reactome. We changed the database name in the relevant cross-references (DR lines) accordingly.

Example:

DR   Reactome; Q9BZJ0; -.

Changes concerning the controlled vocabulary for PTMs

We are continuously overhauling the annotation of post-translational modifications (PTMs). For the feature key MOD_RES, the new introduced controlled vocabularies for PTMs are:

  • Hydroxylation: All entries with annotated hydroxylation sites have the keyword Hydroxylation.

     3,4-dihydroxyarginine
     3,5-dihydroxylysine
    
  • Unidentified N-terminal blocking modifications:

     Blocked amino end (Ala)
     Blocked amino end (Arg)
     Blocked amino end (Asp)
     Blocked amino end (Cys)
     Blocked amino end (Gln)
     Blocked amino end (Glu)
     Blocked amino end (Gly)
     Blocked amino end (Ile)
     Blocked amino end (Met)
     Blocked amino end (Pro)
     Blocked amino end (Ser)
     Blocked amino end (Thr)
     Blocked amino end (Val)
    
  • Unidentified N-terminal blocking modifications:

     Blocked carboxyl end (His)
    

UniProt release 1.12

Published June 21, 2004

Headlines

Noah's ark or biodiversity in Swiss-Prot

While the crux of Swiss-Prot annotation is targeted toward a number of model organisms (human, Arabidopsis, Drosophila, E. coli, etc.), there is a continual increase in the number of species that are represented in the knowledgebase. Currently Swiss-Prot contains sequences originating from about 8'550 different species. For ~50% of these species there is only one associated entry in Swiss-Prot. This is often a protein whose gene is used for building phylogenetic trees, such as RuBisCO, cytochrome b or hemoglobin. On the other end of the spectrum, the 20 most represented species cover about 40% of the database (60'000 sequences). The most represented species is of course ourselves, with Homo sapiens filling up 7% of Swiss-Prot.

UniProtKB News

Digital Object Identifier (DOI) in the RX line

The Digital Object Identifier (DOI) is a system for identifying and exchanging intellectual property in the digital environment. We introduced the new optional identifier "DOI" to the RX line. It is used to store the Digital Object Identifier of a cited document. The format for this RX line topic is:

DOI=Digital_object_identifier;

The order of the optional topics in an RX line is:

RX   [MEDLINE=Medline_identifier; ][PubMed=Pubmed_identifier; ][DOI=Digital_object_identifier;]

Example:

RX   MEDLINE=97291283; PubMed=9145897; DOI=10.1007/s00248-002-2038-4;

Note: The length of a DOI is not restricted. If the topic DOI does not fit into an RX line that already contains a topic, a further RX line will be created, which may be longer than 76 characters.

New line type: RG (Reference Group)

The new reference line 'RG' (Reference Group) has been introduced to list the consortium name associated with a given citation. The RG line is mainly used in submission reference blocks, but can also be used in paper references, if the working group is cited as an author in the paper.

Note: RA (Reference Author) and RG line can be present in the same reference block; at least one RA or RG line is mandatory per reference block.

The same line type has recently been introduced in the EMBL nucleotide sequence database.

The format for this line is:

RG   Consortium_name;

Examples:

RG   The C. elegans sequencing consortium;
RG   The Brazilian network for HIV isolation and characterization;

Cross-references to EchoBASE

We have added cross-references to EchoBASE, the integrated post-genomic database for E. coli, available at http://www.biolws1.york.ac.uk/echobase/.

The identifiers of the appropriate DR line are:

Resource abbreviation EchoBASE
Resource identifier EchoBASE's unique identifier for a gene.
Example
O32528:
DR   EchoBASE; EB4119; -.

UniProt release 1.11

Published June 7, 2004

Headlines

Fungi and Swiss-Prot

As we are reaching the 10'000 entries mark for fungi in Swiss-Prot, we believe it is useful to inform our users that we are actively working in speeding up the annotation and re-annotation of fungal protein sequences and most notably those originating from the two model organisms Saccharomyces cerevisiae and Schizosaccharomyces pombe. We are currently building up a fungal annotation group which will soon consist of four annotators, three in Geneva and one in Hinxton.

UniProt news

Changes concerning keywords

New keywords:


Changes concerning the controlled vocabulary for PTMs

We are continuously overhauling the annotation of post-translational modifications (PTMs). For the feature key MOD_RES, the new initially introduced controlled vocabularies for PTMs are:

  • Bromination: All entries with annotated bromination sites have the keyword Bromination.

     Bromohistidine
     6'-bromotryptophan
    
  • Deamidation

     Deamidated asparagine
     Deamidated glutamine
    
  • Formylation: All entries with annotated formylation sites have the keyword Formylation.

     N-formylmethionine
     N-formylglycine
    
  • Hydroxylation: All entries with annotated hydroxylation sites have the keyword Hydroxylation.

     Hydroxyproline
     3-hydroxyasparagine
     3-hydroxyaspartate
     5-hydroxylysine
     3-hydroxytryptophan
    
  • Iodination: All entries with annotated iodination sites have the keyword Iodination.

     Thyroxine
     Triiodothyronine
    
  • Other new controlled vocabularies for PTMs that are annotated with the feature key MOD_RES:

     3',4'-dihydroxyphenylalanine
     3-methylthioaspartic acid
     3-oxoalanine (Cys)
     3-oxoalanine (Ser)
     4-carboxyglutamate (Keyword: Gamma-carboxyglutamic acid)
     ADP-ribosylarginine
     ADP-ribosylcysteine
     Allysine
     Aspartyl isopeptide (Asn)
     Cysteine persulfide
     Pentaglycyl murein peptidoglycan amidated threonine
     PolyADP-ribosyl glutamic acid
     Pyruvic acid (Ser)
     Pyruvic acid (Cys)
     S-nitrosocysteine (Keyword: S-nitrosylation)
    
  • Unidentified modifications

     Alanine derivative
     Arginine derivative
     Cysteine derivative
     Glutamine derivative
     Isoleucine derivative
     Lysine derivative
     Methionine derivative
     Tryptophan derivative
    

UniProt release 1.10

Published May 24, 2004

UniProtKB news

New comment line (CC) topic: TOXIC DOSE

We have introduced a new comment (CC) line topic: TOXIC DOSE. This topic is used to store information on the poisoning potential (acute toxicity) of a toxin.

Generally this topic holds information on the LD(50) and PD(50). LD stands for "Lethal Dose". LD(50) is the amount of a toxin, given all at once, which causes the death of 50% (one half) of a group of test animals.

PD(50) stands for "Paralytic dose". It is the amount of a toxin, which causes the paralysis of 50% of a group of test animals.

Examples:

CC   -!- TOXIC DOSE: PD(50) is 1.72 mg/kg by injection in blowfly larvae.
CC   -!- TOXIC DOSE: LD(50) is 0.015 mg/kg by intravenous injection for
CC       sarafotoxin-A and sarafotoxin-B, and 0.3 mg/kg for sarafotoxin-C.

Changes concerning keywords

New keywords:


Changes concerning the controlled vocabulary for PTMs

We are continuously overhauling the annotation of post-translational modifications (PTMs). For the feature key MOD_RES, the new initially introduced controlled vocabularies for PTMs are:

  • Acetylation: All entries with annotated acetylation sites have the keyword Acetylation.

    N-acetylalanine
    N-acetylaspartate
    N-acetylcysteine
    N-acetylglutamate
    N-acetylglycine
    N-acetylmethionine
    N-acetylproline
    N-acetylserine
    N-acetylthreonine
    N-acetyltyrosine
    N-acetylvaline
    N2-acetylarginine
    N6-acetyllysine
    
  • Amidation: All entries with annotated amidation sites have the keyword Amidation.

    Alanine amide
    Arginine amide
    Aspartic acid 1-amide
    Asparagine amide
    Cysteine amide
    Glutamic acid 1-amide
    Glutamine amide
    Glycine amide
    Histidine amide
    Isoleucine amide
    Leucine amide
    Lysine amide
    Methionine amide
    Phenylalanine amide
    Proline amide
    Serine amide
    Threonine amide
    Tryptophan amide
    Tyrosine amide
    Valine amide
    
  • Isomerization: All entries with annotated isomerization sites have the keyword D-amino acid.

    D-alanine (Ala)
    D-alanine (Ser)
    D-asparagine
    D-allo-isoleucine
    D-leucine
    D-methionine
    D-phenylalanine
    D-serine
    D-tryptophan
    
  • Other new controlled vocabularies for PTMs that are annotated with the feature key MOD_RES:

     2',4',5'-topaquinone (keyword: TPQ)
    3-phenyllactic acid
    N6-1-carboxyethyl lysine
    2,3-didehydroalanine (Ser)
    2,3-didehydrobutyrine
    (Z)-2,3-didehydrotyrosine
    

UniProt release 1.9

Published May 4, 2004

Headlines

Swiss-Prot reaches 150'000 entries

With this release the number of Swiss-Prot entries has reached the 150'000 mark. It took about 9.5 years to reach the 50'000 entries mark (January 1996), almost 6 more years to reach 100'000 entries (September 2001) and about 2.5 years to the current 150'000 entries.

The continuous increase in the speed of annotation is due to a number of factors among which the increase in the number of annotators working for Swiss-Prot at SIB and EBI, increase in the productivity of the work of these annotators, the implementation and improvement of software tools that help to automate some annotation tasks, facilitated access to many third party resources, and the gradual rise in quality of the underlying DNA sequences as well as the quality of genomic and cDNA annotation.

UniProtKB News

Changes concerning the controlled vocabulary for PTMs

We are continuously overhauling the annotation of post-translational modifications (PTMs). Methylation sites are described in the description field of the feature key MOD_RES, all entries with such a site contain the keyword 'Methylation'. The initially defined controlled vocabulary for methylation sites is listed below:

N-methylalanine
N,N,N-trimethylalanine
Omega-N-methylated arginine
Omega-N-methylarginine
Asymmetric dimethylarginine
Symmetric dimethylarginine
5-methylarginine
N5-methylarginine
N4-methylasparagine
N4,N4-dimethylasparagine
S-methylcysteine
Cysteine methyl ester
2-methylglutamine
N5-methylglutamine
Glutamate methyl ester (Gln)
Glutamate methyl ester (Glu)
Methylhistidine
Pros-methylhistidine
Tele-methylhistidine
N-methylisoleucine
N-methylleucine
Leucine methyl ester
N6-methylated lysine
N6-methyllysine
N6,N6-dimethyllysine
N6,N6,N6-trimethyllysine
Lysine methyl ester
N-methylmethionine
N-methylphenylalanine
N,N-dimethylproline
N-methyltyrosine

Changes concerning keywords

Deleted keyword:

  • Egg white

UniProt release 1.8

Published April 26, 2004

Headlines

Two new completely annotated microbial proteomes

In the framework of the HAMAP project we not only annotate specified microbial protein families, but we also aim to completely annotate all the proteins from a number of selected microbial genomes.

We maintain pages that list complete bacterial and archaeal proteomes and which report the status of completion of the annotations in Swiss-Prot.

We have now completed the annotation of two more microbial genomes, namely those of Buchnera aphidicola (subsp. Baizongia pistaciae) and Methanococcus jannaschii. The total number of microbial genomes where all proteins are annotated in Swiss-Prot is now 8 and more are yet to come.

UniProtKB news

Cross-references to Structural and functional annotation of Arabidopsis thaliana gene and protein families (GeneFarm)

We have added cross-references to the Structural and functional annotation of Arabidopsis thaliana gene and protein families (GeneFarm), available at http://genoplante-info.infobiogen.fr/Genefarm/index.htpl.

The identifiers of the appropriate DR line are:

Resource abbreviation GeneFarm
Resource identifier GeneFarm's unique identifier for a gene.
Optional information 1 GeneFarm's identifier for a gene family.
Example
O04500:
DR   GeneFarm; 1671; 91.

Changes concerning the controlled vocabulary for PTMs

We are continuously overhauling the annotation of post-translational modifications (PTMs). Sulfation sites are described in the description field of the feature key MOD_RES, all entries with such a site contain the keyword 'Sulfation'. The initially defined controlled vocabulary for sulfation sites is listed below:

Sulfotyrosine
Sulfoserine
Sulfothreonine

UniProt release 1.7

Published April 13, 2004

Headlines

Extinct organisms and Swiss-Prot...

Did you know that Swiss-Prot contains proteins originating from extinct organisms? Since the beginning of the 90s, various groups have sequenced gene fragments from a variety of extinct organisms. Most of the time, the resulting sequences are too small or too fragmentary to be translated into protein sequences. But this is not always the case, and we harbor a few complete or partial sequences originating from species that existed on earth in various periods of time.

For example we have a RuBisCO large subunit from a fossil leaf of a Miocene (17-20 Myr old) Magnolia, P30828.

Much more recent is a complete cytochrome b sequence from a Siberian mammoth, P92658.

But what is more interesting for those interested in the longevity of proteins, is the complete sequence of an osteocalcin from a steppe bison. This sequence was ontained by mass spectrometry directly from permafrost fossilized bones, about 55-56 Kyr old, P83489.

UniProtKB news

Cross-references to Oxford GlycoProteomics 2-DE database (OGP)

We have added cross-references to the Oxford GlycoProteomics 2-DE database (OGP), available at http://proteomewww.bioch.ox.ac.uk/2d/2d.html.

The identifiers of the appropriate DR line are:

Resource abbreviation OGP
Resource identifier OGP's unique identifier for a protein, which is identical to the Swiss-Prot primary AC number of that protein.
Example
P31946:
DR   OGP; P31946; -.

Cross-references to 2-DE database of rat heart

We have added cross-references to the 2-DE database of rat heart, at German Heart Institute Berlin, available at http://www.mpiib-berlin.mpg.de/2D-PAGE/RAT-HEART/2d/.

The identifiers of the appropriate DR line are:

Resource abbreviation Rat-heart-2DPAGE
Resource identifier Rat-heart-2DPAGE's unique identifier for a protein, which is identical to the Swiss-Prot primary AC number of that protein.
Example
P03996:
DR   Rat-heart-2DPAGE; P03996; -.

Changes concerning keywords

New keyword:


Changes concerning the controlled vocabulary for PTMs

We are continuously overhauling the annotation of post-translational modifications (PTMs). Phosphorylation sites are described in the description field of the feature key MOD_RES, all entries with such a site contain the keyword 'Phosphorylation'. The initially defined controlled vocabulary for phosphorylation sites is listed below:

Phosphocysteine
4-aspartylphosphate
Phosphohistidine
Tele-phosphohistidine
Pros-phosphohistidine
Phosphoserine
Phosphothreonine
Phosphotyrosine

UniProt release 1.6

Published March 29, 2004

Headlines

UniProtKB/Swiss-Prot major release (43.0)

Release 43.0 of Swiss-Prot contains 146'720 sequence entries, comprising 54'093'154 amino acids abstracted from 113'719 references. 10'760 sequences have been added since release 42, the sequence data of 663 existing entries has been updated and the annotations of 44'948 entries have been revised. This represents an increase of 8%.

Many improvements were carried out in the last 6 months at the level of the CC and FT lines.

Full statistics and release notes

UniProtKB news

Discontinuation of the plain text versions of the user manual and release notes

Both, the userman.txt and relnotes.txt files have been replaced by an HTML-formatted version, userman.html and relnotes.html. The plain text version of these files are no longer available.

UniProt release 1.5

Published March 15, 2004

UniProtKB news

Cross-references to Rat Genome Database (RGD)

We have added cross-references to the Rat Genome Database (RGD), available at http://rgd.mcw.edu/, which collects data from rat genetic and genomic research efforts and provides curation of mapped positions for quantitative trait loci, known mutations and other phenotypic data.

The identifiers of the appropriate DR line are:

Resource abbreviation RGD
Resource identifier RGD's unique identifier for a gene.
Optional information 1 RGD's gene symbol.
Example
O08557:
DR   RGD; 70968; Ddah1.

Changes concerning keywords

New keyword:

Changes concerning the controlled vocabulary for PTMs

New terms for the feature key 'CROSSLNK':

  • Isoaspartyl glycine isopeptide (Gly-Asn)
  • Isoglutamyl glycine isopeptide (Gly-Glu)

New term for the feature key 'LIPID':

  • GPI-anchor amidated carboxyl end

UniProt release 1.4

Published March 1, 2004

Headlines

More than 500'000 comment blocks in Swiss-Prot

One of the important aspects of the annotation process is to provide, for each protein, a description of a number of meaningful biological elements such as the function or role of a protein, its subcellular location, its membership in a specific family, etc. All of this information is stored in the comments field (CC). Comments are organized by topics, 24 types of which are currently defined. A specific comment can consist of several sentences or other textual elements, which are grouped into what we term a comment block.

The total number of comment blocks has now reached the 500'000 mark, which corresponds to an average of 3.5 blocks per Swiss-Prot entry.

UniProtKB news

Changes concerning keywords

New keyword:

UniProt release 1.3

Published February 16, 2004

UniProtKB news

Changes concerning keywords

New keyword:

UniProt release 1.2

Published February 2, 2004

Headlines

SPIN - the new web tool for sequence submission to Swiss-Prot

A new web-based tool, SPIN, is available for submitting directly sequenced protein sequences and their biological annotations to the Swiss-Prot Protein Knowledgebase. SPIN guides you through a sequence of WWW forms allowing interactive submission. The information required to create a database entry will be collected during this process.

Annotation updates for existing Swiss-Prot entries are highly appreciated and should be submitted via the "Submit Update" button at the top of any entry [example]. User update requests are treated with a high priority by our annotators.

UniProt release 1.1

Published January 16, 2004

Headlines

10,000 different citations for JBC in Swiss-Prot

The Journal of Biological Chemistry (generally known as JBC) has always been a gold mine for publications directly relevant to the scope of Swiss-Prot. Starting with the first release in 1986 and up to now it has always been the most cited journal in Swiss-Prot. We are now citing about 10,000 different JBC papers in Swiss-Prot. This is almost twice the value for the next most cited journal, PNAS (Proceedings of the National Academy of Sciences of the U.S.A.).

It is also noteworthy that JBC was the first major life science journal to be available as full text on the WWW. It is therefore a good opportunity to thank the JBC editorial board and its staff for the great service they are providing to the Life Sciences community.

UniProtKB news

Changes concerning keywords

New keywords:

Deleted keywords:

  • Liver
  • Muscle
  • Nerve
  • Neurone

New documentation file strains.txt

The strain information is usually given in the RC line of the reference block, but can also be indicated in the Organism (OS) line of a database entry. Strains are controlled vocabulary and we created a list of the strains and their synonyms. This information is now made available in the documentation file strains.txt, together with the mnemonic species identification code representing the biological source of the protein in the knowledgebase.

New format of the documentation file keywlist.txt

Keywords are controlled vocabulary and the annotation follows strict rules. As biological terms can have several meanings, we added to the list of keywords the definition of their usage in the knowledgebase and further information such as synonyms or relevant GO terms.

Please note that the file header changed and the format for each keyword entry looks as follows:

Line code  Content                         Occurrence in an entry

ID         Identifier (keyword)            Once; starts an entry
AC         Accession (KW-xxxx)             Once
DE         Definition                      Once or more
SY         Synonyms                        Optional; Once or more
GO         Gene ontology (GO) mapping      Optional; Once or more
HI         Hierarchy                       Optional; Once or more
WW         Interesting WWW site            Optional; Once or more
CA         Category                        Once
//         Terminator                      Once; ends an entry

Example of a keyword definition entry:

ID   Acetoin catabolism.
AC   KW-0006
DE   Protein involved in the degradation of acetoin (3-hydroxy-2-butanone).
DE   Acetoin is a component of the butanediol cycle (butanediol
DE   fermentation) in microorganisms.
SY   Acetoin degradation.
GO   GO:0045150; acetoin catabolism
CA   Biological process; Pathway.
//

Change in the name of the files containing deleted AC numbers

As UniProt knowledgebase documentation now comprises both Swiss-Prot and TrEMBL, we changed the name of the files "deleteac.txt" containing deleted AC numbers to

  • "delac_sp.txt" (for deleted Swiss-Prot ACs)
  • "delac_tr.txt" (for deleted TrEMBL ACs).

Discontinuation of the embltosp.txt index file

The embltosp.txt file, which contained an index of EMBL Nucleotide Sequence Database entries referenced in Swiss-Prot, is no longer available.

UniProt release 1.0

Published December 15, 2003

Headlines

First release of UniProt

Release 42.7 of Swiss-Prot is integrated in the first release of UniProt, the Universal Protein Resource. Swiss-Prot and TrEMBL are the two sections of the UniProt Knowledgebase.

Further reading about the Uniprot 1st release:

Old Swiss-Prot releases

Published November 28, 2003

Swiss-Prot release 42.6 of 28-Nov-2003

New comment line (CC) topic RNA EDITING

We have introduced a new comment (CC) line topic: 'RNA EDITING'. This topic is used to convey information relevant to all types of RNA editing that lead to one or more amino acid changes.

The format of this comment block is:

CC   -!- RNA EDITING: Modified_positions={x[, y, z, ...] | Not_applicable | Undetermined}[; Note=Text].

Examples:

CC   -!- RNA EDITING: Modified_positions=393, 431, 452, 495.
CC   -!- RNA EDITING: Modified_positions=59, 78, 94, 98, 102, 121; Note=The
CC       stop codon at position 121 is created by RNA editing. The nonsense
CC       codon at position 59 is modified to a sense codon.
CC   -!- RNA EDITING: Modified_positions=Not_applicable; Note=Some
CC       positions are modified by RNA editing via nucleotide insertion or
CC       deletion. The initiator methionine is created by RNA editing.

The free text in the 'Note' is standardized.

All entries with such a topic have the keyword RNA editing.

Changes concerning keywords

New keyword:

Swiss-Prot release 42.5 of 21-Nov-2003

Headlines: Monkey business!

The comparison of the genome of human with that of higher apes such as chimpanzees, gibbons, gorillas and the orangutans, was for a long time a wish of many life scientists.

It is becoming a reality due to various sequencing initiatives targeted toward the elucidation of primate genomic sequences. However it will take some time before a significant amount of high quality complete protein sequences are available. In the meanwhile we are trying to ensure that whenever an existing higher ape sequence is available that correspond to a cognate human protein, that sequence gets annotated very quickly.

For example, in the last two weeks, the number of annotated chimpanzees protein sequences in Swiss-Prot has doubled.

Swiss-Prot release 42.4 of 14-Nov-2003

Content changes in the speclist.txt document file

The speclist.txt file lists the organism identification codes which are used to build the "organism" part of an entry name (Examples: ARATH, BACSU, DROME, HUMAN, etc). This file contains for each organism code, the corresponding NCBI taxonomic database node identifier (TaxID) as well as the specific official (scientific) name and optionally common name and synonym.

Up to now organisms identification codes where only used in Swiss-Prot where all species represented in the database are associated with such a code. The TrEMBL section of the combined UniProt knowledgebase will soon also make use of entry names that are based on the species of origin. As it is not possible in a reasonable time frame to manually assign organism codes to all species represented in TrEMBL, it was decided to define "virtual" codes that regroup organisms at a certain taxonomic level. Such codes are prefixed by the number "9" and generally correspond to a "pool" of organisms which can be 'wide' as a kingdom. Here are some examples of such codes:

9BACT B      2: N=Bacteria
9CNID E   6073: N=Cnidaria
9FUNG E   4751: N=Fungi
9REOV V  10880: N=Reoviridae
9TETR E  32523: N=Tetrapoda
9VIRI E  33090: N=Viridiplantae

The list of all the "9" codes that have been defined are now been integrated as a subsection of the speclist.txt file.

Changes concerning the controlled vocabulary for PTMs

New terms for the Feature key 'LIPID':

  • GPI-anchor amidated residue
  • Omega-hydroxyceramide glutamate ester
  • Phosphatidylethanolamine amidated glycine

Changes concerning keywords

New keyword:

Swiss-Prot release 42.3 of 07-Nov-2003

Headlines: More than 10'000 human proteins have been annotated

In the framework of the HPI project, we have annotated more than 10'000 proteins (almost 10'300). The exact number of genes represented is not exactly equal to the number of proteins for at least four reasons:

  • We have entries that describe proteins encoded by more than one gene but whose amino acid sequences are 100% identical;
  • We sometimes are unable to describe highly divergent splice isoforms in one entry and these genes are therefore represented by two or more Swiss-Prot entries;
  • For MHC histocompatibility antigens, immunoglobulin and T cell receptors, we often have several entries representing groups of alleles;
  • A very small number of human entries probably represent "bogus" proteins originating from either pseudogenes or from contaminants.

But even taking the above factors into account, we do have more than 10'000 protein-encoding genes represented in Swiss-Prot.

Cross-references to DictyBase

We have added cross-references to the DictyBase database (available at http://dictybase.org/), an online informatics resource for Dictyostelium discoideum. DictyBase goals are to provide a single portal for access to Dictyostelium genome information, curated Dictyostelium literature, to facilitate access to experimental resources such as the Dictyostelium stock center, and to provide an on-line presence for the Dictyostelium community.

The identifiers of the appropriate DR line are:

Resource abbreviation DictyBase
Resource identifier DictyBase's unique identifier for a gene.
Optional information 1 DictyBase's gene symbol.
Example
P34092:
DR   DictyBase; DDB0002013; myoB.

Cross-refereces to DictyDb

Due to the availability of DictyBase (see above) and in agreement with the maintainers of both databases, we have removed all cross-references to the DictyDb database.

Cross-refereces to PhotoList

We have added cross-references to the PhotoList database (available at http://genolist.pasteur.fr/PhotoList/), a database dedicated to the analysis of the genome of Photorhabdus luminescens strain TT01.

The identifiers of the appropriate DR line are:

Resource abbreviation PhotoList
Resource identifier PhotoList's unique identifier for an ORF.
Example
Q8KM01:
DR   PhotoList; plu1253; -.

Changes concerning keywords

New keyword:

Deleted keywords:

  • B-cell
  • Bone

Swiss-Prot release 42.1 of 24-Oct-2003

Format change in the jourlist.txt document file

The jourlist.txt file lists the titles and abbreviations of all journals cited in Swiss-Prot. This file also includes other type of information such as ISSN and CODEN identifiers, publishers, web sites, etc. As of this release, we have added a field for the ISSN of the electronic (on-line) version of journals. This field which is termed "e-ISSN" is optional.

Example:

Abbrev: Acta Haematol.
Title : Acta Haematologica
ISSN  : 0001-5792
e-ISSN: 1421-9662
CODEN : ACHAAH
Publis: Karger AG
Server: http://www.karger.com/journals/aha/

Changes concerning keywords

New keyword:

Deleted keywords:

  • Alkylation
  • Brain
  • Cartilage

Swiss-Prot release 42.0 of 10-Oct-2003

Headlines: New major release is available (42.0)

Release 42.0 of Swiss-Prot contains 135'850 sequence entries, comprising 50'046'799 amino acids abstracted from 109'694 references. 13'374 sequences have been added since release 41, the sequence data of 1'298 existing entries has been updated and the annotations of 45'617 entries have been revised. This represents an increase of 11%.

Many improvements were carried out in the last 6 months at the level of the CC and FT lines. All the recent changes to Swiss-Prot format are described in detail in the continuously updated document:

Swiss-Prot release 41.26 of 04-Oct-2003

Controlled vocabulary in the feature (FT) key LIPID

We have revised the annotation of post-translational modified amino acids in lipoproteins, and made a major overhaul of the controlled vocabulary. Lipid annotation that was covered by other feature (FT) keys than LIPID has been moved accordingly, e.g. cholesterol-binding.

The currently defined controlled vocabulary for the feature descriptions of 'LIPID' FT lines is listed below:

Cholesterol glycine ester
Cis-14-hydroxy-10,13-dioxo-7-heptadecenoic acid aspartate ester
GPI-anchor amidated alanine
GPI-anchor amidated asparagine
GPI-anchor amidated aspartate
GPI-anchor amidated cysteine
GPI-anchor amidated glycine
GPI-anchor amidated serine
GPI-anchor amidated threonine
GPI-like-anchor amidated glycine
GPI-like-anchor amidated serine
N-myristoyl glycine
N-palmitoyl cysteine
N(6)-myristoyl lysine
N(6)-palmitoyl lysine
O-octanoyl serine
O-palmitoyl serine
O-palmitoyl threonine
Phosphotidylethanolamine amidated glycine
S-12-hydroxyfarnesyl cysteine
S-archaeol cysteine
S-diacylglycerol cysteine
S-farnesyl cysteine
S-geranylgeranyl cysteine
S-myristoyl cysteine
S-palmitoleyl cysteine
S-palmitoyl cysteine

Swiss-Prot release 41.24 of 19-Sep-2003

Changes concerning keywords

Deleted keyword:

  • T-DNA

Swiss-Prot release 41.22 of 29-Aug-2003

Changes concerning keywords

Modified keywords:

New keyword:

Swiss-Prot release 41.21 of 22-Aug-2003

Changes concerning keywords

Modified keyword:

New keyword:

Swiss-Prot release 41.20 of 16-Aug-2003

Case and wording change for submissions to Swiss-Prot in reference location (RL) lines

While proceeding with the conversion to mixed case of the different line types of a Swiss-Prot entry, we have decided to do the same for the name of our database, e.g. we are now using "Swiss-Prot" (instead of previously "SWISS-PROT") as the prevalent way of referring to it. This change affects the Swiss-Prot RL (reference location) lines of entries which were submitted directly to Swiss-Prot, and which the authors have not (yet) published. At the same time, we have changed the wording of those lines.

Former format:

RL   Submitted (MAY-2002) to the SWISS-PROT data bank.

New format:

RL   Submitted (MAY-2002) to Swiss-Prot.

Note: RL lines concerning submissions to EMBL/GenBank/DDBJ, PDB and other databases are not affected by this modification.

New comment line (CC) topic ALLERGEN

We have introduced a new comment (CC) line topic type: ALLERGEN. This topic is used to convey information relevant to allergenic proteins.

The format of this comment block is:

CC   -!- ALLERGEN: Text.

Examples:

 P19121:
CC   -!- ALLERGEN: Causes an allergic reaction in human. Binds IgE. It is a
CC       partially heat-labile allergen that may cause both respiratory and
CC       food-allergy symptoms in patients with the bird-egg syndrome.
 Q28050:
CC   -!- ALLERGEN: Causes an allergic reaction in human. Minor allergen of
CC       bovine dander.

Swiss-Prot release 41.18 of 25-Jul-2003

Headlines: Annotation of microbial H(+)-translocating pyrophosphatases

We have annotated the microbial H(+)-translocating pyrophosphatases present in the acidocalcisome, the first eukaryotic organelle to be found in bacteria.

Acidocalcisomes are organelles that have an acidic nature, high eletronic density and contain high concentrations of calcium, magnesium, pyrophosphate and polyP. They were originally found in unicellular eukaryotes, such as Toxoplasma gondii and trypanosomatids. It has been postulated that acidocalcisomes may have an important role as an energy source and in the regulation of intracellualr pH, calcium concentration and osmotic conditions.

Now the group of Roberto Docampo has found them in the bacterium Agrobacterium tumefaciens. This is the first organelle to be found in bacteria that have a direct counterpart in eukaryotes. The typical characteristic of the acidocalcisome is the presence of a number of pumps and exchangers: one of them is the H(+)-translocating pyrophosphatase (H+-PPase). This pump generates a proton motive force and may be responsible for the synthesis of pyrophosphate. They are found in several bacteria and archaea and at present it is unkown whether any of these is also localized in acidocalcisomes. As these pumps are present only in some pathogenic bacteria but not in humans, drugs that target them might be effective against these infections.

Changes concerning keywords

Modified keyword:

Swiss-Prot release 41.17 of 19-Jul-2003

Cross-references to GermOnline

We have added cross-references to the GermOnline database (available at http://germonline.unibas.org/), which is maintained by the Genome Bioinformatics group of the SIB Swiss Institute of Bioinformatics. GermOnline is a gateway for gametogenesis. Its goals are to provide a rapid access to a comprehensive compilation of genes, expression data and functions implicated in germline development, meiosis, gamete formation, and gamete function in 11 key model systems and H. sapiens. At this time, the majority of cross-references in Swiss-Prot concern Saccharomyces cerevisiae gene expression data.

The identifiers of the appropriate DR line are:

Resource abbreviation GermOnline
Resource identifier GermOnline's identifier for a gene.
Example
P58012:
DR   GermOnline; 305011; -.

Swiss-Prot release 41.16 of 11-Jul-2003

Changes concerning keywords

New keywords:

Swiss-Prot release 41.14 of 27-Jun-2003

Changes concerning keywords

New keywords:

Swiss-Prot release 41.12 of 16-Jun-2003

New feature key CROSSLNK, and removal of the feature keys THIOETH and THIOLEST

The feature key CROSSLNK has been introduced to describe bonds between amino acids, which are formed posttranslationally within a peptide or between peptides, such as isopeptidic bonds, carbon-carbon linkages, carbon-nitrogen linkages, thioether bonds, thiolester bonds, and backbone condensations.

Format:

FT   CROSSLNK    from     to      Description.

The initially defined controlled vocabulary is listed below:

1'-histidyl-3'-tyrosine (His-Tyr)
2-cysteinyl-L-phenylalanine (Cys-Phe)
2-cysteinyl-D-phenylalanine (Cys-Phe)
2-cysteinyl-D-allo-threonine (Cys-Thr)
2-iminomethyl-5-imidazolinone (Gln-Gly)
2-oxazoline (Cys-Ser)
2'-(S-cysteinyl)histidine (Cys-His)
3-cysteinyl-aspartic acid (Cys-Asp)
3'-histidyl-3-tyrosine (His-Tyr)
3'-(S-cysteinyl)-tyrosine (Cys-Tyr)
4-cysteinyl-glutamic acid (Cys-Glu)
4'-cysteinyl-tryptophylquinone (Cys-Trp)
5-imidazolinone (Ser-Gly)
5-imidazolinone (Ala-Gly)
5-imidazolinone (Cys-Gly)
Beta-methyllanthionine (Cys-Thr)
Beta-methyllanthionine (Thr-Cys)
Beta-methyllanthionine sulfoxide (Cys-Thr)
Isoaspartyl glycine isopeptide (Asn-Gly)
Isoaspartyl lysine isopeptide (Lys-Asn) (interchain with N-...)
Isoaspartyl lysine isopeptide (Asn-Lys) (interchain with K-...)
Isodityrosine (Tyr-Tyr)
Isoglutamyl cysteine thioester (Gln-Cys)
Isoglutamyl lysine isopeptide (Lys-Gln)
Isoglutamyl lysine isopeptide (Gln-Lys)
Isoglutamyl lysine isopeptide (Gln-Lys) (interchain with K-...)
Isoglutamyl lysine isopeptide (Lys-Gln) (interchain with Q-...)
Lanthionine (Ser-Cys)
Lanthionine (Cys-Ser)
Lysinoalanine (Lys-Ser)
Lysine tyrosylquinone (Lys-Tyr)
Lysinoalanine (Ser-Lys)
Lysyl topaquinone (Lys-Tyr)
N-isoaspartyl cysteine isopeptide (Asn-Cys)
Oxazole (Cys-Ser)
Oxazole (Gly-Ser)
Pyrroloquinoline quinone (Glu-Tyr)
S-(2-aminovinyl)-D-cysteine (Ser-Cys)
S-(2-aminovinyl)-3-methyl-D-cysteine (Thr-Cys)
Thiazole (Gly-Cys)
Thiazole (Ser-Cys)
Thiazole (Phe-Cys)
Thiazole (Cys-Cys)
Thiazole (Lys-Cys)
Tryptophan tryptophylquinone (Trp-Trp)
Glycyl lysine isopeptide (Gly-Lys) (interchain with K-...)
Glycyl lysine isopeptide (Lys-Gly) (interchain with G-...)
Ubiquitinyl cysteine thioester (Cys)

Examples:

 P01024:
FT   CROSSLNK   1010   1013       Isoglutamyl cysteine thioester (Cys-Gln).
 P29827:
FT   CROSSLNK     60     77       Beta-methyllanthionine (Cys-Thr).
FT   CROSSLNK     63     73       Lanthionine (Ser-Cys).
FT   CROSSLNK     64     70       Beta-methyllanthionine (Cys-Thr).
FT   CROSSLNK     65     78       Lysinoalanine (Ser-Lys).

Note: The feature keys THIOETH and THIOLEST have been removed. Various bonds between amino-acids that used to be described by the feature keys BINDING, MOD_RES or SITE will progressively, in groups according the type of PTM, be modified and indicated by CROSSLNK. Disulfide bonds occur so often in proteins, that we decided to keep the special feature key DISULFID to annotate this kind of linkage.

Changes concerning keywords

New keywords:

Swiss-Prot release 41.10 of 30-May-2003

Reference Comment (RC) line topics may span lines

The RC (Reference Comment) line store comments relevant to the reference cited, in currently 5 distinct topics: PLASMID, SPECIES, STRAIN, TISSUE and TRANSPOSON. It is not always possible to list all information within one line. Therefore we allow multiple RC lines, in which one topic might span over a line. Example:

 Q9EVG8:
RC   STRAIN=AZ.026, DC.005, GA.039, GA2181, IL.014, IN.018, KY.172, KY2.37,
RC   LA.013, MN.001, MNb027, MS.040, NY.016, OH.036, TN.173, TN2.38,
RC   UT.002, AL.012, AZ.180, MI.035, VA.015, and IL2.17;

Cross-references to Genome Knowledgebase (GK)

We have added cross-references to the Genome Knowledgebase (GK) (available at http://www.genomeknowledge.org/), which is a collaboration among Cold Spring Harbor Laboratory, The European Bioinformatics Institute, and The Gene Ontology Consortium to develop a curated resource of core pathways and reactions in human biology.

The identifiers of the appropriate DR line are:

Resource abbreviation GK
Resource identifier GK's unique identifier for a protein, which is identical to the Swiss-Prot primary AC number of that protein.
Example
Q9BZJ0:
DR   GK; Q9BZJ0; -.

Cross-references to PIR SuperFamilies of iProClass

We have added cross-references to the PIR SuperFamilies of iProClass (available at http://pir.georgetown.edu/iproclass/), which is an integrated protein classification database.

The identifiers of the appropriate DR line are:

Resource abbreviation PIRSF
Resource identifier iProClass superfamily number.
Optional information 1 Name for a superfamily.
Optional information 2: Number of hits found in the sequence, which is generally '1'.
Example
O28076:
DR   PIRSF; PIRSF006414; FTR; 1.

Swiss-Prot release 41.9 of 24-May-2003

Changes concerning keywords

New keyword:

Swiss-Prot release 41.5 of 23-Apr-2003

Headlines: SARS coronavirus protein sequences are available

We have made a first annotation run of the proteins potentially encoded by the SARS (Severe Acute Respiratory Syndrome) coronavirus. The following entries are available:

Nucleocapsid protein (P59595) E1 glycoprotein (P59596) E2 glycoprotein (P59594 Envelope protein (P59637) Replicase polyprotein 1ab (P59641) Hypothetical protein X1 (P59632) Hypothetical protein X2 (P59633) Hypothetical protein X3 (P59634) Hypothetical protein X4 (P59635) Hypothetical protein 5 (P59636)

Changes concerning keywords

New keywords:

Swiss-Prot release 41.8 of 16-May-2003

Headlines: Complete update of PDB cross-references

We have completely updated our cross-references to PDB. Thanks to work done by the EBI and Geneva Swiss-Prot groups in collaboration with the EBI MSD (Macromolecular Structure Database) group we have mapped at the atom level PDB structural data to the relevant Swiss-Prot and TrEMBL entries. This work has led to the introduction of cross-references to PDB in TrEMBL and a very significant increase in the number of these cross-references in Swiss-Prot. More than 6'000 cross-references were added and the number of Swiss-Prot entries that are linked to PDB is now above 5'300 (versus about 3'600 before this work was carried out).

Swiss-Prot release 41.3 of 04-Apr-2003

Changes concerning keywords

New keywords:

Swiss-Prot release 41.1 of 25-Mar-2003

New syntax of the CC line topic ALTERNATIVE PRODUCTS

In Swiss-Prot release 41.1 (and in the accompanying TrEMBL release), a new format was introduced for "CC ALTERNATIVE PRODUCTS" lines. The new format is more structured than the previous format. Associated with these changes are the introduction of stable identifiers for each named splice isoform in all entries that describe more than one splice isoform; the extension of feature identifiers, previously only used for human VARIANT and certain CARBOHYD features, to VARSPLIC features in entries from all species.

The new format of the CC line topic ALTERNATIVE PRODUCTS is:

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative promoter;
CC         Comment=Free text;
CC       Event=Alternative splicing; Named isoforms=n;
CC         Comment=Optional free text;
CC       Name=Isoform_1; Synonyms=Synonym_1[, Synonym_n];
CC         IsoId=Isoform_identifier_1[, Isoform_identifer_n];
CC         Sequence=Displayed;
CC         Note=Free text;
CC       Name=Isoform_n; Synonyms=Synonym_1[, Synonym_n];
CC         IsoId=Isoform_identifier_1[, Isoform_identifer_n];
CC         Sequence=VSP_identifier_1 [, VSP_identifier_n];
CC         Note=Free text;
CC       Event=Alternative initiation;
CC         Comment=Free text;

The qualifiers are described in the table below:

Topic Description
Event Biological process that results in the production of the alternative forms (Alternative promoter, Alternative splicing, Alternative initiation).
Format: Event=controlled vocabulary;
Example: Event=Alternative splicing;
Named isoforms Number of isoforms listed in the topics 'Name' currently only for 'Event=Alternative splicing'.
Format: Named isoforms=number;
Example: Named isoforms=6;
Comment Any comments concerning one or more isoforms; optional for 'Alternative splicing'; in case of 'Alternative promoter' and 'Alternative initiation' there is always a 'Comment' of free text, which includes relevant information on the isoforms.
Format: Comment=free text;
Example: Comment=Experimental confirmation may be lacking for some isoforms;
Name A common name for an isoform used in the literature or assigned by Swiss-Prot; currenty only available for spliced isoforms.
Format: Name=common name;
Example: Name=Alpha;
Synonyms Synonyms for an isoform as used in the literature; optional; currently only available for spliced isoforms.
Format: Synonyms=Synonym_1[,&nbsp;Synonym_n];
Example: Synonyms=B, KL5;
IsoId Unique identifier for an isoform, consisting of the Swiss-Prot accession number, followed by a dash and a number.
Format: IsoId=acc#-isoform_number[, acc#-isoform_number];
Example: IsoId=P05067-1;
Sequence Information on the isoform sequence; the term Displayed indicates, that the sequence is shown in the entry; a list of feature identifiers (VSP_#) indicates that the isoform is annotated in the feature table; the FTIds enable programs to create the sequence of a splice variant; if the accession number of the IsoId does not correspond to the accession number of the current entry, this topic contains the term External; Not described points out that the sequence of the isoform is unknown.
Format: Sequence=VSP_#[, VSP_#]|Displayed|External|Not described;
Example: Sequence=Displayed;
Example: Sequence=VSP_000013, VSP_000014; Example: Sequence=External;
Example: Sequence=Not described;
Note Lists isoform-specific information; optional.
Format: Note=Free text;
Example: Note=No experimental confirmation available;

Example of the CC lines and the corresponding FT lines for an entry with alternative splicing Q15746:

...
CC  -!- ALTERNATIVE PRODUCTS:
CC      Event=Alternative splicing; Named isoforms=6;
CC      Name=1;
CC        IsoId=Q15746-4; Sequence=Displayed;
CC      Name=2;
CC        IsoId=Q15746-5; Sequence=VSP_000040;
CC      Name=3A;
CC        IsoId=Q15746-6; Sequence=VSP_000041, VSP_000043;
CC      Name=3B;
CC        IsoId=Q15746-7; Sequence=VSP_000040, VSP_000041, VSP_000042;
CC      Name=4;
CC        IsoId=Q15746-8; Sequence=VSP_000041, VSP_000042;
CC      Name=del-1790;
CC        IsoId=Q15746-9; Sequence=VSP_000044;
...
FT   VARSPLIC    437    506       VSGIPKPEVAWFLEGTPVRRQEGSIEVYEDAGSHYLCLLKA
FT                                RTRDSGTYSCTASNAQGQVSCSWTLQVER -> G (in
FT                                isoform 2 and isoform 3B).
FT                                /FTId=VSP_004791.
FT   VARSPLIC   1433   1439       DEVEVSD -> MKWRCQT (in isoform 3A,
FT                                isoform 3B and isoform 4).
FT                                /FTId=VSP_004792.
FT   VARSPLIC   1473   1545       Missing (in isoform 4).
FT                                /FTId=VSP_004793.
FT   VARSPLIC   1655   1705       Missing (in isoform 3A and isoform 3B).
FT                                /FTId=VSP_004794.
FT   VARSPLIC   1790   1790       Missing (in isoform Del-1790).
FT                                /FTId=VSP_004795.

...

The corresponding modules of the Swiss-Prot parser Swissknife have been modified, and Release 1.31 of Swissknife can be downloaded.

Cross-references to Gene Ontology (GO)

We have added cross-references to the Gene Ontology (GO) database (available at http://www.geneontology.org/), which provides controlled vocabularies for the description of the molecular function, biological process and cellular component of gene products.

The identifiers of the appropriate DR line are:

Resource abbreviation GO
Resource identifier GO's unique identifier for a GO term.
Optional information 1 A 1-letter abbreviation for one of the 3 ontology aspects, separated from the GO term by a column. If the term is longer than 46 characters, the first 43 characters are indicated followed by 3 dots ('...'). The abbreviations for the 3 distinct aspects of the ontology are P (biological Process), F (molecular Function), and C (cellular Component).
Optional information 2 3-character GO evidence code. The meaning of the evidence codes is: IDA=inferred from direct assay, IMP=inferred from mutant phenotype, IGI=inferred from genetic interaction, IPI=inferred from physical interaction, IEP=inferred from expression pattern, TAS=traceable author statement, NAS=non-traceable author statement, IC=inferred by curator, ISS=inferred from sequence or structural similarity.
Examples
Q9XTD2
DR   GO; GO:0008601; F:protein phosphatase type 2A, regulator acti...; IPI.
DR   GO; GO:0000080; P:G1 phase of mitotic cell cycle; IDA.
DR   GO; GO:0008285; P:negative regulation of cell proliferation; IDA.
DR   GO; GO:0006470; P:protein amino acid dephosphorylation; IDA.

P04406:
DR   GO; GO:0005737; C:cytoplasm; NAS.
DR   GO; GO:0004365; F:glyceraldehyde 3-phosphate dehydrogenase (p...; NAS.
DR   GO; GO:0006096; P:glycolysis; NAS.

Changes concerning keywords

New keywords:

Deleted keyword:

  • Amphibian skin

Swiss-Prot release 41.0, 28-Feb-2003

Progress in the conversion of Swiss-Prot to mixed-case characters

We are gradually converting Swiss-Prot entries from all UPPER CASE to MiXeD CaSe. With this release the RC (Reference Comment) line topic STRAIN and the CC line topic CATALYTIC ACTIVITY have been converted.

"Nucleomorph" added to the OrGanelle (OG) line

The OG (OrGanelle) line indicates from which genome a gene for a protein originates. Until now, defined terms in the OG line where "Chloroplast", "Cyanelle", "Mitochondrion" and "Plasmid". The term "Nucleomorph" has been added, which is the residual nucleus of an algal endosymbiont that resides inside its host cell.

Multiple RP lines

Starting with release 41, there can be more than one RP (Reference Position) line per reference in a Swiss-Prot entry. The RP line describes the extent of the work carried out by the authors of the reference, e.g. the type of molecule that has been sequenced, protein characterization, PTM characterization, protein structure analysis, variation detection, etc.

As the number of experimental results per publication has increased over the years, the limitation of using a single RP line per reference no longer allowed to add all the information while maintaining a consistent format. Therefore we decided to permit multiple RP lines.

Example:

RP   SEQUENCE FROM N.A., SEQUENCE OF 23-42 AND 351-365, AND
RP   CHARACTERIZATION.

Cross-references to Schizosaccharomyces pombe GeneDB Prototype

We have added cross-references to the Schizosaccharomyces pombe GeneDB Prototype (available at http://www.genedb.org/genedb/pombe/index.jsp), which contains all S. pombe known and predicted protein coding genes, pseudogenes and tRNAs. It is hosted by the Sanger Institute.

The identifiers of the appropriate DR line are:

Resource abbreviation GeneDB_SPombe
Resource identifier GeneDB's unique identifier for a S. pombe gene.
Example
DR   GeneDB_SPombe; SPAC9E9.12c; -.

Cross-referecences to Genew

We have added cross-references to the Human Gene Nomenclature Database Genew (available at http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl), which provides data for all human genes which have approved symbols. It is managed by the HUGO Gene Nomenclature Committee (HGNC).

The identifiers of the appropriate DR line are:

Resource abbreviation Genew
Resource identifier HGNC's unique identifier for a human gene
Optional information 1 HGNC's approved gene symbol.
Example
DR   Genew; HGNC:5217; HSD3B1.

Cross-references to Gramene

We have added cross-references to the Gramene database, a comparative mapping resource for grains (available at http://www.gramene.org/).

The format of the explicit links in the flat file is:

Resource abbreviation Gramene
Resource identifier Unique identifier for a protein, which is identical to the Swiss-Prot primary AC number of that protein.
Example
DR   Gramene; Q06967; -.

Cross-references to HAMAP

We have added cross-references to the collection of orthologous microbial protein families, generated manually by expert curators of the HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes) project in the framework of the Swiss-Prot protein knowledgebase. The data is accessible at /sprot/hamap/families.html.

The identifiers of the appropriate DR line are:

Resource abbreviation HAMAP
Resource identifier HAMAP unique identifier for a microbe protein family
Optional information 1 The values are either '-', 'fused', 'atypical' or 'atypical/fused'. The value '-' is a placeholder for an empty field; the 'fused' value indicates that the family rule does not cover the entire protein; the value 'atypical' points out that the protein is divergent in sequence or has mutated functional sites, and should not be included in family datasets. The value 'atypical/fused' indicates both latter findings.
Optional information 2 Number of domains found in the protein, generally '1', rarely '2' for the fusion of 2 identical domains.
Example
DR   HAMAP; MF_00012; -; 1.

Cross-references to Phosphorylation Site Database

We have added cross-references to the Phosphorylation Site Database, PhosSite (available at http://vigen.biochem.vt.edu/xpd/xpd.htm), which provides access to information from scientific literature concerning prokaryotic proteins that undergo covalent phosphorylation on the hydroxyl side chains of serine, threonine or tyrosine residues.

The identifiers of the appropriate DR line are:

Resource abbreviation PhosSite
Resource identifier Unique identifier for a phosphoprotein, which is identical to the Swiss-Prot primary AC number of that protein.
Example
DR   PhosSite; P00955; -.

Cross-references to TIGRFAMs

We have added cross-references to TIGRFAMs, a protein family database available at http://www.tigr.org/TIGRFAMs/.

The identifiers of the appropriate DR line are:

Resource abbreviation TIGRFAMs
Resource identifier TIGRFAMs' unique identifier for a protein family.
Optional information 1 TIGRFAMs' entry name for a protein family.
Optional information 2 Number of hits found in the sequence.
Example
DR   TIGRFAMs; TIGR00630; uvra; 1.

Cross-references to CarbBank

We have removed the Swiss-Prot cross-references to CarbBank.

Cross-references to GCRDb

We have removed the Swiss-Prot cross-references to GCRDb.

Cross-references to Mendel

We have removed the Swiss-Prot cross-references to Mendel.

Cross-references to YEPD

We have removed the Swiss-Prot cross-references to the yeast electrophoresis protein database (YEPD).

Explicit links to dbSNP in FT VARIANT lines of human sequence entries

In human protein sequence entries we have introduced explicit links to the Single Nucleotide Polymorphism database (dbSNP) from the feature description of FT VARIANT keys.

The format of such links is:

FT   VARIANT    from     to   description (IN dbSNP:accession_number).
FT                                /FTId=VAR_number.
Example:
FT   VARIANT      65     65       T -> I (IN dbSNP:1065419).
FT                                /FTId=VAR_012009.

Feature key SIMILAR became obsolete

The feature key SIMILAR was used to describe the extent of a similarity with another protein sequence. Nowadays, most domains with similarity to other proteins are known regions described in domain and family databases, which are annotated in Swiss-Prot with the feature key DOMAIN or REPEAT and the comment (CC) line topic SIMILARITY; thus the feature key SIMILAR became obsolete and will not be used again.

Version of SP in XML format

A distribution version of Swiss-Prot and TrEMBL in XML format is being developed. The first draft of the XML specification was released for public review on February 21, 2002.

Swiss-Prot release 44.0

Published July 5, 2004
Swiss-Prot Protein Knowledgebase
Release Notes

Release 44.0 of 05-Jul-2004

Swiss-Prot release 44.0 release notes
Content


Introduction
Status of the model organisms
Swiss-Prot protein knowledgebase release 44.0 statistics
We need your help


See also Recent changes and Forthcoming changes.

Introduction

Release 44.0 of 05-Jul-04 of Swiss-Prot contains 153'871 sequence entries, comprising 56'608'159 amino acids abstracted from 117'396 references. 6'669 sequences have been added since release 43, the sequence data of 582 existing entries has been updated and the annotations of 139'865 entries have been revised. This represents an increase of 4%.

Release Date Number of entries Number of amino acids
2.0 09/86 3'939 900'163
3.0 11/86 4'160 969'641
4.0 04/87 4'387 1'036'010
5.0 09/87 5'205 1'327'683
6.0 01/88 6'102 1'653'982
7.0 04/88 6'821 1'885'771
8.0 08/88 7'724 2'224'465
9.0 11/88 8'702 2'498'140
10.0 03/89 10'008 2'952'613
11.0 07/89 10'856 3'265'966
12.0 10/89 12'305 3'797'482
13.0 01/90 13'837 4'347'336
14.0 04/90 15'409 4'914'264
15.0 08/90 16'941 5'486'399
16.0 11/90 18'364 5'986'949
17.0 02/91 20'024 6'524'504
18.0 05/91 20'772 6'792'034
19.0 08/91 21'795 7'173'785
20.0 11/91 22'654 7'500'130
21.0 03/92 23'742 7'866'596
22.0 05/92 25'044 8'375'696
23.0 08/92 26'706 9'011'391
24.0 12/92 28'154 9'545'427
25.0 04/93 29'955 10'214'020
26.0 07/93 31'808 10'875'091
27.0 10/93 33'329 11'484'420
28.0 02/94 36'000 12'496'420
29.0 06/94 38'303 13'464'008
30.0 10/94 40'292 14'147'368
31.0 02/95 43'470 15'335'248
32.0 11/95 49'340 17'385'503
33.0 02/96 52'205 18'531'384
34.0 10/96 59'021 21'210'389
35.0 11/97 69'113 25'083'768
36.0 07/98 74'019 26'840'295
37.0 12/98 77'977 28'268'293
38.0 07/99 80'000 29'085'965
39.0 05/00 86'593 31'411'114
40.0 10/01 101'602 37'315'215
41.0 02/03 122'564 44'986'459
42.0 10/03 135'850 50'046'799
43.0 03/04 146'720 54'093'154
44.0 07/04 153'871 56'608'159

Status of the model organisms

We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:

  • be as complete as possible. All sequences available at a given time should be immediately included in Swiss-Prot. This also includes sequence corrections and updates;
  • provide a higher level of annotation;
  • provide cross-references to specialized database(s) that contain, among other data, some information about the genes that code for these proteins;
  • provide specific indexes and documents.
From our efforts to annotate human sequence entries as completely as possible arose the HPI project, and the bacterial model organisms became the focus of the HAMAP project. Here is the current status of the model organisms which are not covered by these two projects:

Organism Database cross-references Index file Number of sequences
A.thaliana None yet arath.txt 2'853
C.albicans None yet calbican.txt 289
C.elegans Wormpep celegans.txt 2'482
D.discoideum DictyBase dicty.txt 322
D.melanogaster FlyBase fly.txt 2'048
M.musculus MGD mgdtosp.txt 7'853
S.cerevisiae SGD yeast.txt 4'948
S.pombe GeneDB_SPombe pombe.txt 2'459

Swiss-Prot protein knowledgebase release 44.0 statistics

                        
                        1.  INTRODUCTION
                        
                        Release 44.0 of 05-Jul-04 of Swiss-Prot contains 153871 sequence entries,
                        comprising 56608159 amino acids abstracted from 117396 references.
                        
                        6669 sequences have been added since release 43, the sequence data of
                        582 existing entries has been updated and the annotations of
                        139865 entries have been revised. This represents an increase of 4%.
                        
                        
                        2.  AMINO ACID COMPOSITION
                        
                        2.1  Composition in percent for the complete database
                        
                        Ala (A) 7.80   Gln (Q) 3.93   Leu (L) 9.62   Ser (S) 6.89
                        Arg (R) 5.29   Glu (E) 6.59   Lys (K) 5.93   Thr (T) 5.46
                        Asn (N) 4.22   Gly (G) 6.93   Met (M) 2.37   Trp (W) 1.16
                        Asp (D) 5.30   His (H) 2.27   Phe (F) 4.02   Tyr (Y) 3.09
                        Cys (C) 1.57   Ile (I) 5.91   Pro (P) 4.85   Val (V) 6.69
                        
                        Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.01
                        
                        
                        2.2  Classification of the amino acids by their frequency
                        
                        Leu, Ala, Gly, Ser, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
                        Gln, Tyr, Met, His, Cys, Trp
                        
                        
                        3.  TAXONOMIC ORIGIN
                        
                        Total number of species represented in this release of Swiss-Prot: 8554
                        
                        The first twenty species represent 59439 sequences:  38.6 % of the total
                        number of entries.
                        
                        
                        3.1 Table of the frequency of occurrence of species
                        
                        Species represented 1x: 4095
                        2x: 1323
                        3x:  673
                        4x:  442
                        5x:  274
                        6x:  261
                        7x:  196
                        8x:  154
                        9x:  131
                        10x:   83
                        11- 20x:  348
                        21- 50x:  269
                        51-100x:   96
                        >100x:  209
                        
                        
                        3.2  Table of the most represented species
                        
                        ------  ---------  --------------------------------------------
                        Number  Frequency  Species
                        ------  ---------  --------------------------------------------
                        1      11072  Homo sapiens (Human)
                        2       7853  Mus musculus (Mouse)
                        3       4948  Saccharomyces cerevisiae (Baker's yeast)
                        4       4837  Escherichia coli
                        5       3841  Rattus norvegicus (Rat)
                        6       2853  Arabidopsis thaliana (Mouse-ear cress)
                        7       2748  Bacillus subtilis
                        8       2482  Caenorhabditis elegans
                        9       2459  Schizosaccharomyces pombe (Fission yeast)
                        10       2048  Drosophila melanogaster (Fruit fly)
                        11       1782  Methanococcus jannaschii
                        12       1773  Haemophilus influenzae
                        13       1670  Escherichia coli O157:H7
                        14       1474  Bos taurus (Bovine)
                        15       1431  Salmonella typhimurium
                        16       1396  Mycobacterium tuberculosis
                        17       1315  Escherichia coli O6
                        18       1273  Shigella flexneri
                        19       1098  Gallus gallus (Chicken)
                        20       1086  Mycobacterium bovis
                        21       1009  Salmonella typhi
                        22        986  Pseudomonas aeruginosa
                        23        947  Archaeoglobus fulgidus
                        24        946  Synechocystis sp. (strain PCC 6803)
                        25        890  Xenopus laevis (African clawed frog)
                        26        874  Sus scrofa (Pig)
                        27        782  Rhizobium meliloti (Sinorhizobium meliloti)
                        28        765  Vibrio cholerae
                        29        740  Aquifex aeolicus
                        30        731  Oryctolagus cuniculus (Rabbit)
                        31        730  Yersinia pestis
                        32        687  Mycoplasma pneumoniae
                        33        666  Pasteurella multocida
                        34        612  Streptomyces coelicolor
                        35        606  Mycobacterium leprae
                        36        605  Treponema pallidum
                        37        600  Bacillus halodurans
                        38        597  Vibrio parahaemolyticus
                        39        572  Buchnera aphidicola (subsp. Acyrthosiphon pisum)
                        40        570  Methanobacterium thermoautotrophicum
                        41        561  Buchnera aphidicola (subsp. Schizaphis graminum)
                        42        558  Helicobacter pylori (Campylobacter pylori)
                        43        553  Anabaena sp. (strain PCC 7120)
                        44        547  Vibrio vulnificus
                        45        545  Rickettsia prowazekii
                        46        539  Helicobacter pylori J99 (Campylobacter pylori J99)
                        47        528  Staphylococcus aureus (strain Mu50 / ATCC 700699)
                        48        527  Staphylococcus aureus (strain N315)
                        49        510  Staphylococcus aureus (strain MW2)
                        50        507  Buchnera aphidicola (subsp. Baizongia pistaciae)
                        51        502  Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
                        52        501  Zea mays (Maize)
                        53        486  Mycoplasma genitalium
                        54        485  Pseudomonas putida (strain KT2440)
                        55        484  Ralstonia solanacearum (Pseudomonas solanacearum)
                        56        482  Staphylococcus epidermidis
                        57        481  Listeria monocytogenes
                        58        480  Pseudomonas syringae (pv. tomato)
                        59        476  Listeria innocua
                        60        471  Agrobacterium tumefaciens (strain C58 / ATCC 33970)
                        61        467  Neisseria meningitidis (serogroup B)
                        62        465  Neisseria meningitidis (serogroup A)
                        63        463  Rhizobium loti (Mesorhizobium loti)
                        64        460  Xanthomonas campestris (pv. campestris)
                        65        454  Clostridium acetobutylicum
                        66        453  Thermotoga maritima
                        67        448  Caulobacter crescentus
                        68        428  Oryza sativa (Rice)
                        69        426  Bradyrhizobium japonicum
                        70        425  Deinococcus radiodurans
                        71        425  Canis familiaris (Dog)
                        72        424  Streptococcus pneumoniae
                        73        424  Xylella fastidiosa
                        74        423  Xanthomonas axonopodis (pv. citri)
                        75        420  Chlamydia trachomatis
                        76        418  Pyrococcus horikoshii
                        77        417  Borrelia burgdorferi (Lyme disease spirochete)
                        78        414  Bacillus anthracis
                        79        414  Pyrococcus abyssi
                        80        412  Xylella fastidiosa (strain Temecula1 / ATCC 700964)
                        81        408  Chlamydia pneumoniae (Chlamydophila pneumoniae)
                        82        403  Rhizobium sp. (strain NGR234)
                        83        398  Chlamydia muridarum
                        84        396  Brucella melitensis
                        85        395  Brucella suis
                        86        395  Clostridium perfringens
                        87        388  Halobacterium sp. (strain NRC-1 / ATCC 700922 / JCM 11081)
                        88        387  Corynebacterium glutamicum (Brevibacterium flavum)
                        89        384  Methanosarcina acetivorans
                        90        382  Shewanella oneidensis
                        91        376  Methanosarcina mazei (Methanosarcina frisia)
                        92        371  Sulfolobus solfataricus
                        93        371  Pyrococcus furiosus
                        94        369  Campylobacter jejuni
                        95        362  Thermoanaerobacter tengcongensis
                        96        359  Bacillus cereus (strain ATCC 14579 / DSM 31)
                        97        358  Nicotiana tabacum (Common tobacco)
                        98        354  Streptococcus pyogenes
                        99        353  Rickettsia conorii
                        100        350  Lactobacillus plantarum
                        101        347  Ovis aries (Sheep)
                        102        343  Oceanobacillus iheyensis
                        103        339  Streptococcus pneumoniae (strain ATCC BAA-255 / R6)
                        104        331  Synechococcus elongatus (Thermosynechococcus elongatus)
                        105        329  Aeropyrum pernix
                        106        324  Streptococcus mutans
                        107        322  Dictyostelium discoideum (Slime mold)
                        108        318  Chlorobium tepidum
                        109        313  Neurospora crassa
                        110        313  Staphylococcus aureus
                        111        310  Streptococcus pyogenes (serotype M18)
                        112        305  Streptococcus pyogenes (serotype M3)
                        113        304  Brachydanio rerio (Zebrafish) (Danio rerio)
                        114        302  Methanopyrus kandleri
                        115        298  Vibrio vulnificus (strain YJ016)
                        116        295  Pisum sativum (Garden pea)
                        117        293  Sulfolobus tokodaii
                        118        289  Candida albicans (Yeast)
                        119        288  Photorhabdus luminescens (subsp. laumondii)
                        120        283  Thermoplasma acidophilum
                        121        279  Triticum aestivum (Wheat)
                        122        273  Enterococcus faecalis (Streptococcus faecalis)
                        123        268  Bacteriophage T4
                        124        268  Hordeum vulgare (Barley)
                        125        264  Corynebacterium efficiens
                        126        261  Glycine max (Soybean)
                        127        260  Fusobacterium nucleatum (subsp. nucleatum)
                        128        256  Lycopersicon esculentum (Tomato)
                        129        254  Vaccinia virus (strain Copenhagen)
                        130        252  Cavia porcellus (Guinea pig)
                        131        252  Solanum tuberosum (Potato)
                        132        252  Rhodobacter capsulatus (Rhodopseudomonas capsulata)
                        133        250  Haemophilus ducreyi
                        134        249  Pseudomonas putida
                        135        247  Bordetella pertussis
                        136        246  Pyrobaculum aerophilum
                        137        244  Bordetella bronchiseptica (Alcaligenes bronchisepticus)
                        138        244  Thermoplasma volcanium
                        139        244  Streptomyces avermitilis
                        140        237  Streptococcus agalactiae (serotype III)
                        141        237  Streptococcus agalactiae (serotype V)
                        142        236  Nitrosomonas europaea
                        143        235  Spinacia oleracea (Spinach)
                        144        235  Ureaplasma parvum (Ureaplasma urealyticum biotype 1)
                        145        234  Pan troglodytes (Chimpanzee)
                        146        234  Leptospira interrogans
                        147        233  Bordetella parapertussis
                        148        231  Bacillus stearothermophilus
                        149        225  Chromobacterium violaceum
                        150        220  Porphyra purpurea
                        151        215  Wigglesworthia glossinidia brevipalpis
                        152        213  Chlamydophila caviae
                        153        209  Chlamydomonas reinhardtii
                        154        203  Klebsiella pneumoniae
                        
                        
                        
                        
                        3.3  Taxonomic distribution of the sequences
                        
                        Kingdom        sequences (% of the database)
                        Archaea            8744 (  6%)
                        Bacteria          66231 ( 43%)
                        Eukaryota         70273 ( 46%)
                        Viruses            8623 (  6%)
                        
                        
                        Within Eukaryota:
                        
                        Category            sequences (% of Eukaryota) (% of the complete database)
                        Human                  11072 ( 16%)           (  7%)
                        Other Mammalia         19434 ( 28%)           ( 13%)
                        Other Vertebrata        6534 (  9%)           (  4%)
                        Viridiplantae          11178 ( 16%)           (  7%)
                        Fungi                  10000 ( 14%)           (  6%)
                        Insecta                 3901 (  6%)           (  3%)
                        Nematoda                2723 (  4%)           (  2%)
                        Other                   5431 (  8%)           (  4%)
                        
                        
                        4.  SEQUENCE SIZE
                        
                        Repartition of the sequences by size (excluding fragments)
                        
                        From   To  Number             From   To   Number
                        1-  50    2862             1001-1100     1329
                        51- 100   10453             1101-1200      958
                        101- 150   15265             1201-1300      704
                        151- 200   14332             1301-1400      521
                        201- 250   15122             1401-1500      402
                        251- 300   12991             1501-1600      260
                        301- 350   13697             1601-1700      198
                        351- 400   12349             1701-1800      137
                        401- 450    9503             1801-1900      153
                        451- 500    8093             1901-2000      123
                        501- 550    6185             2001-2100       71
                        551- 600    4179             2101-2200      120
                        601- 650    3526             2201-2300      104
                        651- 700    2493             2301-2400       68
                        701- 750    2130             2401-2500       62
                        751- 800    1808             >2500          408
                        801- 850    1443
                        851- 900    1512
                        901- 950    1080
                        951-1000     906
                        
                        
                        The average sequence length in Swiss-Prot is 367 amino acids.
                        
                        The shortest sequence is   GWA_SEPOF (P83570):     2 amino acids.
                        The longest sequence is   SNE1_HUMAN (Q8NF91):  8797 amino acids.
                        
                        
                        5.  JOURNAL CITATIONS
                        
                        Note: the following citation statistics reflect the number of distinct
                        journal citations.
                        
                        Total number of journals cited in this release of Swiss-Prot: 1474
                        
                        
                        5.1 Table of the frequency of journal citations
                        
                        Journals cited 1x:  541
                        2x:  193
                        3x:  100
                        4x:   67
                        5x:   62
                        6x:   29
                        7x:   32
                        8x:   30
                        9x:   26
                        10x:   12
                        11- 20x:  116
                        21- 50x:  109
                        51-100x:   52
                        >100x:  105
                        
                        
                        5.2  List of the most cited journals in Swiss-Prot
                        
                        Nb    Citations   Journal name
                        --    ---------   -------------------------------------------------------------
                        1        10481   Journal of Biological Chemistry
                        2         5498   Proceedings of the National Academy of Sciences of the U.S.A.
                        3         3894   Journal of Bacteriology
                        4         3715   Nucleic Acids Research
                        5         3622   Gene
                        6         2893   FEBS Letters
                        7         2877   Biochemical and Biophysical Research Communications
                        8         2651   Biochemistry
                        9         2589   European Journal of Biochemistry
                        10         2427   The EMBO Journal
                        11         2280   Nature
                        12         2215   Biochimica et Biophysica Acta
                        13         1992   Journal of Molecular Biology
                        14         1928   Genomics
                        15         1791   Cell
                        16         1749   Molecular and Cellular Biology
                        17         1384   Biochemical Journal
                        18         1327   Science
                        19         1168   Molecular Microbiology
                        20         1160   Plant Molecular Biology
                        21         1150   Molecular and General Genetics
                        22          916   Journal of Biochemistry
                        23          875   Virology
                        24          865   Human Molecular Genetics
                        25          824   Journal of Cell Biology
                        26          777   Nature Genetics
                        27          707   Genes and Development
                        28          691   Journal of Virology
                        29          680   The American Journal of Human Genetics
                        30          651   Plant Physiology
                        31          645   Human Mutation
                        32          643   Oncogene
                        33          587   Journal of Immunology
                        34          572   Infection and Immunity
                        35          557   Yeast
                        36          536   Structure
                        37          524   Journal of General Virology
                        38          519   Archives of Biochemistry and Biophysics
                        39          504   Microbiology
                        40          496   Development
                        41          486   FEMS Microbiology Letters
                        42          456   Nature Structural Biology
                        43          449   Genetics
                        44          425   Human Genetics
                        45          407   Current Genetics
                        46          405   Blood
                        47          371   Molecular and Biochemical Parasitology
                        48          356   Applied and Environmental Microbiology
                        49          340   Journal of Clinical Investigation
                        50          324   Mammalian Genome
                        51          324   Protein Science
                        52          320   Molecular Endocrinology
                        53          320   Developmental Biology
                        54          310   Cancer Research
                        55          305   Immunogenetics
                        56          299   DNA and Cell Biology
                        57          298   Journal of Molecular Evolution
                        58          295   Neuron
                        59          291   Molecular Biology of the Cell
                        60          289   The Journal of Experimental Medicine
                        61          287   Mechanisms of Development
                        62          287   Acta Crystallographica, Section D
                        63          276   The Plant Cell
                        64          273   Biological Chemistry Hoppe-Seyler
                        65          268   Journal of Cell Science
                        66          260   Endocrinology
                        67          245   DNA Sequence
                        68          244   The Plant Journal
                        69          237   Journal of Neuroscience
                        70          233   Journal of General Microbiology
                        71          225   Molecular Biology and Evolution
                        72          220   The Journal of Clinical Endocrinology and Metabolism
                        73          218   Journal of Neurochemistry
                        74          217   Brain Research. Molecular Brain Research
                        75          213   Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
                        76          204   Toxicon
                        77          201   Cytogenetics and Cell Genetics
                        78          194   Molecular Cell
                        79          188   Comparative Biochemistry and Physiology
                        80          187   American Journal of Physiology
                        81          186   Bioscience, Biotechnology, and Biochemistry
                        82          172   Molecular Pharmacology
                        83          169   Current Biology
                        84          169   Antimicrobial Agents and Chemotherapy
                        85          156   DNA
                        86          150   Proteins
                        87          147   Journal of Investigative Dermatology
                        88          145   Journal of Medical Genetics
                        89          145   Tissue Antigens
                        90          143   DNA Research
                        91          139   Biochimie
                        92          138   Molecular Plant-Microbe Interactions
                        93          138   Peptides
                        94          136   Virus Research
                        95          134   American Journal of Medical Genetics
                        96          133   Genome Research
                        97          131   Bioorganicheskaia Khimiia
                        98          121   Hemoglobin
                        99          119   Molecular and Cellular Endocrinology
                        100          119   European Journal of Immunology
                        101          114   Agricultural and Biological Chemistry
                        102          113   Biology of Reproduction
                        103          111   Plant and Cell Physiology
                        104          105   Archives of Microbiology
                        105          105   Insect Biochemistry and Molecular Biology
                        
                        
                        6.  STATISTICS FOR SOME LINE TYPES
                        
                        The following table summarizes the total number of some Swiss-Prot lines,
                        as well as the number of entries with at least one such line, and the
                        frequency of the lines.
                        
                        Total    Number of  Average
                        Line type / subtype                number   entries    per entry
                        ---------------------------------  -------- ---------  ---------
                        
                        References (RL)                     298130              1.94
                        Journal                          263016    143235    1.71
                        Submitted to EMBL/GenBank/DDBJ    32324     27279    0.21
                        Submitted to Swiss-Prot             553       551   <0.01
                        Unpublished observations            550       546   <0.01
                        Plant Gene Register                 487       476   <0.01
                        Book citation                       468       456   <0.01
                        Thesis                              267       265   <0.01
                        Submitted to other databases        224       223   <0.01
                        Unpublished results                 127       125   <0.01
                        Patent                              112       110   <0.01
                        Worm Breeder's Gazette                2         2   <0.01
                        
                        Comments (CC)                       545146              3.54
                        SIMILARITY                       156437    135011    1.02
                        FUNCTION                          99698     97752    0.65
                        SUBCELLULAR LOCATION              73990     73990    0.48
                        CATALYTIC ACTIVITY                54307     51090    0.35
                        SUBUNIT                           47288     47288    0.31
                        PATHWAY                           25931     24787    0.17
                        COFACTOR                          18085     18085    0.12
                        TISSUE SPECIFICITY                16988     16988    0.11
                        PTM                                9783      8735    0.06
                        MISCELLANEOUS                      9207      8477    0.06
                        ALTERNATIVE PRODUCTS               5686      5686    0.04
                        DOMAIN                             5590      4956    0.04
                        CAUTION                            4756      4427    0.03
                        INDUCTION                          4253      4253    0.03
                        DEVELOPMENTAL STAGE                4031      4031    0.03
                        DISEASE                            2664      1959    0.02
                        ENZYME REGULATION                  2177      2177    0.01
                        DATABASE                           1395      1318    0.01
                        MASS SPECTROMETRY                  1374      1221    0.01
                        POLYMORPHISM                        471       460   <0.01
                        ALLERGEN                            355       355   <0.01
                        RNA EDITING                         309       309   <0.01
                        TOXIC DOSE                          240       239   <0.01
                        BIOTECHNOLOGY                        81        81   <0.01
                        PHARMACEUTICAL                       50        50   <0.01
                        
                        Features (FT)                       873999              5.68
                        DOMAIN                           124150     38268    0.81
                        TRANSMEM                          97932     21284    0.64
                        TURN                              62471      4664    0.41
                        METAL                             57798     14418    0.38
                        CONFLICT                          57695     20265    0.37
                        STRAND                            57258      4167    0.37
                        CARBOHYD                          52677     12962    0.34
                        DISULFID                          49344     12983    0.32
                        HELIX                             45101      4521    0.29
                        REPEAT                            33410      4770    0.22
                        ACT_SITE                          33214     20127    0.22
                        VARIANT                           28763      5294    0.19
                        CHAIN                             27263     22076    0.18
                        NP_BIND                           20649     14463    0.13
                        SIGNAL                            17074     17072    0.11
                        MOD_RES                           15649      8688    0.10
                        SITE                              12072      7397    0.08
                        BINDING                           11742      8456    0.08
                        VARSPLIC                          10969      4897    0.07
                        NON_TER                           10714      8182    0.07
                        ZN_FING                            9852      3597    0.06
                        MUTAGEN                            7514      2088    0.05
                        INIT_MET                           6428      6383    0.04
                        PROPEP                             5402      4594    0.04
                        DNA_BIND                           4758      4461    0.03
                        LIPID                              4440      2981    0.03
                        TRANSIT                            2858      2833    0.02
                        PEPTIDE                            2849      1155    0.02
                        CA_BIND                            2092       868    0.01
                        NON_CONS                            889       446    0.01
                        CROSSLNK                            472       373   <0.01
                        UNSURE                              328       137   <0.01
                        SE_CYS                              172       117   <0.01
                        
                        Cross-references (DR)              1487862              9.67
                        InterPro                         316271    139768    2.06
                        EMBL                             298547    146929    1.94
                        Pfam                             181303    132718    1.18
                        PROSITE                          135505     85173    0.88
                        PIR                               89978     82514    0.58
                        HSSP                              65647     65647    0.43
                        PRINTS                            56437     45801    0.37
                        GO                                53141     15932    0.35
                        TIGRFAMs                          51354     44911    0.33
                        HAMAP                             44550     44433    0.29
                        ProDom                            38396     36940    0.25
                        SMART                             37035     28184    0.24
                        PDB                               22660      6143    0.15
                        TIGR                              15410     15336    0.10
                        Genew                             10039      9991    0.07
                        MIM                                9767      8104    0.06
                        MGD                                7506      7486    0.05
                        SGD                                4988      4935    0.03
                        GermOnline                         4928      4877    0.03
                        EcoGene                            4228      4226    0.03
                        EchoBASE                           4159      4127    0.03
                        MEROPS                             3903      3805    0.03
                        PIRSF                              3226      3226    0.02
                        WormPep                            2765      2462    0.02
                        SubtiList                          2698      2697    0.02
                        TRANSFAC                           2682      2395    0.02
                        FlyBase                            2616      2554    0.02
                        GeneDB_SPombe                      2468      2438    0.02
                        RGD                                2450      2448    0.02
                        IntAct                             1898      1898    0.01
                        TubercuList                        1424      1388    0.01
                        StyGene                            1385      1382    0.01
                        SWISS-2DPAGE                       1105      1105    0.01
                        ListiList                           958       897    0.01
                        Reactome                            707       707   <0.01
                        Leproma                             610       606   <0.01
                        Gramene                             560       555   <0.01
                        MaizeDB                             412       407   <0.01
                        HIV                                 370       354   <0.01
                        GeneFarm                            362       361   <0.01
                        REBASE                              361       356   <0.01
                        ECO2DBASE                           351       299   <0.01
                        OGP                                 351       351   <0.01
                        DictyBase                           324       322   <0.01
                        PhotoList                           288       288   <0.01
                        ZFIN                                281       281   <0.01
                        GlycoSuiteDB                        259       259   <0.01
                        SagaList                            238       237   <0.01
                        PHCI-2DPAGE                         237       237   <0.01
                        MypuList                            165       165   <0.01
                        Aarhus/Ghent-2DPAGE                 128        98   <0.01
                        Siena-2DPAGE                        103       103   <0.01
                        HSC-2DPAGE                           85        85   <0.01
                        COMPLUYEAST-2DPAGE                   59        59   <0.01
                        PhosSite                             53        53   <0.01
                        PMMA-2DPAGE                          51        51   <0.01
                        Maize-2DPAGE                         39        39   <0.01
                        Rat-heart-2DPAGE                     28        28   <0.01
                        ANU-2DPAGE                           13        13   <0.01
                        
                        
                        7.  MISCELLANEOUS STATISTICS
                        
                        Total number of distinct authors cited in Swiss-Prot: 185909
                        
                        Total number of entries encoded on a chloroplast: 3584
                        Total number of entries encoded on a mitochondrion: 2892
                        Total number of entries encoded on a cyanelle: 145
                        Total number of entries encoded on a plasmid: 2761
                        
                        Number of fragments: 8324
                        Number of additional sequences encoded on splice variants: 8638
                        
                        
                        
                        
                     
We need your help

We welcome feedback from our users. We would especially appreciate your notifying us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available. To facilitate this feedback we offer, on the ExPASy WWW server, a form that allows the submission of updates and/or corrections to Swiss-Prot:

It is also possible, from any entry in Swiss-Prot displayed by the ExPASy server, to submit updates and/or corrections for that particular entry. Finally, you can also send your comments by electronic mail to the address:

Note that all update requests are assigned a unique identifier of the form UR-Xnnnn (example: UR-A0123). This identifier is used internally by the Swiss-Prot staff at SIB and EBI to track requests and is also used in e-mail exchanges with the persons who have submitted a request.

Swiss-Prot release 43.0

Published March 29, 2004
Swiss-Prot Protein Knowledgebase
Release Notes

Release 43.0 of 29-Mar-2004

Swiss-Prot release 43.0 release notes
Content


Introduction
Status of the model organisms
Swiss-Prot protein knowledgebase release 43.0 statistics
We need your help


See also Recent changes and Forthcoming changes.

Introduction

Release 43.0 of 29-Mar-2004 of Swiss-Prot contains 146'720 sequence entries, comprising 54'093'154 amino acids abstracted from 113'719 references. 10'760 sequences have been added since release 42, the sequence data of 663 existing entries has been updated and the annotations of 44'948 entries have been revised. This represents an increase of 8%.

Release Date Number of entries Number of amino acids
2.0 09/86 3'939 900'163
3.0 11/86 4'160 969'641
4.0 04/87 4'387 1'036'010
5.0 09/87 5'205 1'327'683
6.0 01/88 6'102 1'653'982
7.0 04/88 6'821 1'885'771
8.0 08/88 7'724 2'224'465
9.0 11/88 8'702 2'498'140
10.0 03/89 10'008 2'952'613
11.0 07/89 10'856 3'265'966
12.0 10/89 12'305 3'797'482
13.0 01/90 13'837 4'347'336
14.0 04/90 15'409 4'914'264
15.0 08/90 16'941 5'486'399
16.0 11/90 18'364 5'986'949
17.0 02/91 20'024 6'524'504
18.0 05/91 20'772 6'792'034
19.0 08/91 21'795 7'173'785
20.0 11/91 22'654 7'500'130
21.0 03/92 23'742 7'866'596
22.0 05/92 25'044 8'375'696
23.0 08/92 26'706 9'011'391
24.0 12/92 28'154 9'545'427
25.0 04/93 29'955 10'214'020
26.0 07/93 31'808 10'875'091
27.0 10/93 33'329 11'484'420
28.0 02/94 36'000 12'496'420
29.0 06/94 38'303 13'464'008
30.0 10/94 40'292 14'147'368
31.0 02/95 43'470 15'335'248
32.0 11/95 49'340 17'385'503
33.0 02/96 52'205 18'531'384
34.0 10/96 59'021 21'210'389
35.0 11/97 69'113 25'083'768
36.0 07/98 74'019 26'840'295
37.0 12/98 77'977 28'268'293
38.0 07/99 80'000 29'085'965
39.0 05/00 86'593 31'411'114
40.0 10/01 101'602 37'315'215
41.0 02/03 122'564 44'986'459
42.0 10/03 135'850 50'046'799
43.0 03/04 146'720 54'093'154

Status of the model organisms

We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:

  • be as complete as possible. All sequences available at a given time should be immediately included in Swiss-Prot. This also includes sequence corrections and updates;
  • provide a higher level of annotation;
  • provide cross-references to specialized database(s) that contain, among other data, some information about the genes that code for these proteins;
  • provide specific indexes and documents.
From our efforts to annotate human sequence entries as completely as possible arose the HPI project, and the bacterial model organisms became the focus of the HAMAP project. Here is the current status of the model organisms which are not covered by these two projects:

Organism Database cross-references Index file Number of sequences
A.thaliana None yet arath.txt 2'591
C.albicans None yet calbican.txt 286
C.elegans Wormpep celegans.txt 2'458
D.discoideum DictyDB dicty.txt 319
D.melanogaster FlyBase fly.txt 1'967
M.musculus MGD mgdtosp.txt 7'326
S.cerevisiae SGD yeast.txt 4'930
S.pombe GeneDB_SPombe pombe.txt 2'386

Swiss-Prot protein knowledgebase release 43.0 statistics

                        
                        
                        1.  INTRODUCTION
                        
                        Release 43.0 of 29-Mar-2004 of Swiss-Prot contains 146720 sequence entries,
                        comprising 54093154 amino acids abstracted from 113719 references.
                        
                        10760 sequences have been added since release 42, the sequence data of
                        663 existing entries has been updated and the annotations of
                        44948 entries have been revised. This represents an increase of 8%.
                        
                        
                        2.  AMINO ACID COMPOSITION
                        
                        2.1  Composition in percent for the complete database
                        
                        Ala (A) 7.79   Gln (Q) 3.92   Leu (L) 9.60   Ser (S) 6.89
                        Arg (R) 5.28   Glu (E) 6.59   Lys (K) 5.93   Thr (T) 5.47
                        Asn (N) 4.23   Gly (G) 6.93   Met (M) 2.37   Trp (W) 1.16
                        Asp (D) 5.30   His (H) 2.27   Phe (F) 4.03   Tyr (Y) 3.09
                        Cys (C) 1.56   Ile (I) 5.91   Pro (P) 4.85   Val (V) 6.70
                        
                        Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.01
                        
                        
                        2.2  Classification of the amino acids by their frequency
                        
                        Leu, Ala, Gly, Ser, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
                        Gln, Tyr, Met, His, Cys, Trp
                        
                        
                        3.  TAXONOMIC ORIGIN
                        
                        Total number of species represented in this release of Swiss-Prot: 8424
                        
                        The first twenty species represent 57715 sequences:  39.3 % of the total
                        number of entries.
                        
                        
                        3.1 Table of the frequency of occurrence of species
                        
                        Species represented 1x: 4065
                        2x: 1283
                        3x:  661
                        4x:  431
                        5x:  269
                        6x:  262
                        7x:  197
                        8x:  149
                        9x:  127
                        10x:   88
                        11- 20x:  344
                        21- 50x:  250
                        51-100x:   93
                        >100x:  205
                        
                        
                        3.2  Table of the most represented species
                        
                        ------  ---------  --------------------------------------------
                        Number  Frequency  Species
                        ------  ---------  --------------------------------------------
                        1      10691  Homo sapiens (Human)
                        2       7326  Mus musculus (Mouse)
                        3       4930  Saccharomyces cerevisiae (Baker's yeast)
                        4       4835  Escherichia coli
                        5       3726  Rattus norvegicus (Rat)
                        6       2712  Bacillus subtilis
                        7       2591  Arabidopsis thaliana (Mouse-ear cress)
                        8       2458  Caenorhabditis elegans
                        9       2386  Schizosaccharomyces pombe (Fission yeast)
                        10       1967  Drosophila melanogaster (Fruit fly)
                        11       1773  Haemophilus influenzae
                        12       1772  Methanococcus jannaschii
                        13       1647  Escherichia coli O157:H7
                        14       1438  Bos taurus (Bovine)
                        15       1406  Salmonella typhimurium
                        16       1393  Mycobacterium tuberculosis
                        17       1284  Escherichia coli O6
                        18       1210  Shigella flexneri
                        19       1090  Gallus gallus (Chicken)
                        20       1080  Mycobacterium bovis
                        21        980  Salmonella typhi
                        22        962  Pseudomonas aeruginosa
                        23        941  Synechocystis sp. (strain PCC 6803)
                        24        937  Archaeoglobus fulgidus
                        25        873  Xenopus laevis (African clawed frog)
                        26        850  Sus scrofa (Pig)
                        27        766  Rhizobium meliloti (Sinorhizobium meliloti)
                        28        743  Vibrio cholerae
                        29        738  Aquifex aeolicus
                        30        725  Oryctolagus cuniculus (Rabbit)
                        31        695  Yersinia pestis
                        32        687  Mycoplasma pneumoniae
                        33        647  Pasteurella multocida
                        34        605  Mycobacterium leprae
                        35        603  Treponema pallidum
                        36        601  Streptomyces coelicolor
                        37        586  Bacillus halodurans
                        38        572  Buchnera aphidicola (subsp. Acyrthosiphon pisum)
                        39        570  Vibrio parahaemolyticus
                        40        560  Buchnera aphidicola (subsp. Schizaphis graminum)
                        41        557  Methanobacterium thermoautotrophicum
                        42        557  Helicobacter pylori (Campylobacter pylori)
                        43        543  Rickettsia prowazekii
                        44        542  Anabaena sp. (strain PCC 7120)
                        45        538  Helicobacter pylori J99 (Campylobacter pylori J99)
                        46        518  Vibrio vulnificus
                        47        504  Staphylococcus aureus (strain Mu50 / ATCC 700699)
                        48        503  Staphylococcus aureus (strain N315)
                        49        499  Zea mays (Maize)
                        50        495  Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
                        51        487  Staphylococcus aureus (strain MW2)
                        52        486  Mycoplasma genitalium
                        53        467  Ralstonia solanacearum (Pseudomonas solanacearum)
                        54        464  Staphylococcus epidermidis
                        55        463  Listeria monocytogenes
                        56        459  Neisseria meningitidis (serogroup B)
                        57        457  Listeria innocua
                        58        457  Neisseria meningitidis (serogroup A)
                        59        449  Pseudomonas putida (strain KT2440)
                        60        448  Thermotoga maritima
                        61        447  Rhizobium loti (Mesorhizobium loti)
                        62        447  Agrobacterium tumefaciens (strain C58 / ATCC 33970)
                        63        443  Xanthomonas campestris (pv. campestris)
                        64        443  Clostridium acetobutylicum
                        65        438  Pseudomonas syringae (pv. tomato)
                        66        434  Caulobacter crescentus
                        67        424  Oryza sativa (Rice)
                        68        419  Deinococcus radiodurans
                        69        417  Chlamydia trachomatis
                        70        416  Streptococcus pneumoniae
                        71        414  Borrelia burgdorferi (Lyme disease spirochete)
                        72        412  Xylella fastidiosa
                        73        411  Canis familiaris (Dog)
                        74        407  Xanthomonas axonopodis (pv. citri)
                        75        406  Pyrococcus horikoshii
                        76        405  Chlamydia pneumoniae (Chlamydophila pneumoniae)
                        77        403  Rhizobium sp. (strain NGR234)
                        78        400  Buchnera aphidicola (subsp. Baizongia pistaciae)
                        79        400  Pyrococcus abyssi
                        80        400  Xylella fastidiosa (strain Temecula1 / ATCC 700964)
                        81        395  Chlamydia muridarum
                        82        382  Clostridium perfringens
                        83        377  Brucella melitensis
                        84        375  Brucella suis
                        85        374  Bradyrhizobium japonicum
                        86        371  Corynebacterium glutamicum (Brevibacterium flavum)
                        87        365  Halobacterium sp. (strain NRC-1 / ATCC 700922 / JCM 11081)
                        88        362  Campylobacter jejuni
                        89        361  Methanosarcina acetivorans
                        90        356  Methanosarcina mazei (Methanosarcina frisia)
                        91        355  Nicotiana tabacum (Common tobacco)
                        92        355  Pyrococcus furiosus
                        93        353  Sulfolobus solfataricus
                        94        353  Thermoanaerobacter tengcongensis
                        95        348  Streptococcus pyogenes
                        96        343  Rickettsia conorii
                        97        342  Ovis aries (Sheep)
                        98        330  Lactobacillus plantarum
                        99        330  Shewanella oneidensis
                        100        321  Aeropyrum pernix
                        
                        
                        
                        3.3  Taxonomic distribution of the sequences
                        
                        Kingdom        sequences (% of the database)
                        Archaea            8393 (  6%)
                        Bacteria          62334 ( 42%)
                        Eukaryota         67392 ( 46%)
                        Viruses            8601 (  6%)
                        
                        
                        Within Eukaryota:
                        
                        Category            sequences (% of Eukaryota) (% of the complete database)
                        Human                  10691 ( 16%)           (  7%)
                        Other Mammalia         18197 ( 27%)           ( 12%)
                        Other Vertebrata        6287 (  9%)           (  4%)
                        Viridiplantae          10743 ( 16%)           (  7%)
                        Fungi                   9849 ( 15%)           (  7%)
                        Insecta                 3685 (  5%)           (  3%)
                        Nematoda                2692 (  4%)           (  2%)
                        Other                   5248 (  8%)           (  4%)
                        
                        
                        4.  SEQUENCE SIZE
                        
                        Repartition of the sequences by size (excluding fragments)
                        
                        From   To  Number             From   To   Number
                        1-  50    2672             1001-1100     1294
                        51- 100    9908             1101-1200      921
                        101- 150   14435             1201-1300      686
                        151- 200   13501             1301-1400      496
                        201- 250   14264             1401-1500      388
                        251- 300   12453             1501-1600      250
                        301- 350   12935             1601-1700      185
                        351- 400   11893             1701-1800      135
                        401- 450    9066             1801-1900      150
                        451- 500    7801             1901-2000      120
                        501- 550    5961             2001-2100       70
                        551- 600    3965             2101-2200      108
                        601- 650    3391             2201-2300      100
                        651- 700    2385             2301-2400       59
                        701- 750    2073             2401-2500       62
                        751- 800    1741             >2500          386
                        801- 850    1359
                        851- 900    1418
                        901- 950    1006
                        951-1000     862
                        
                        
                        The average sequence length in Swiss-Prot is 368 amino acids.
                        
                        The shortest sequence is   GWA_SEPOF (P83570):     2 amino acids.
                        The longest sequence is   SNE1_HUMAN (Q8NF91):  8797 amino acids.
                        
                        
                        5.  JOURNAL CITATIONS
                        
                        Note: the following citation statistics reflect the number of distinct
                        journal citations.
                        
                        Total number of journals cited in this release of Swiss-Prot: 1437
                        
                        
                        5.1 Table of the frequency of journal citations
                        
                        Journals cited 1x:  529
                        2x:  181
                        3x:  103
                        4x:   66
                        5x:   56
                        6x:   35
                        7x:   32
                        8x:   26
                        9x:   24
                        10x:   15
                        11- 20x:  110
                        21- 50x:  110
                        51-100x:   46
                        >100x:  104
                        
                        
                        5.2  List of the most cited journals in Swiss-Prot
                        
                        Nb    Citations   Journal name
                        --    ---------   -------------------------------------------------------------
                        1        10139   Journal of Biological Chemistry
                        2         5380   Proceedings of the National Academy of Sciences of the U.S.A.
                        3         3835   Journal of Bacteriology
                        4         3693   Nucleic Acids Research
                        5         3568   Gene
                        6         2843   FEBS Letters
                        7         2789   Biochemical and Biophysical Research Communications
                        8         2573   Biochemistry
                        9         2543   European Journal of Biochemistry
                        10         2361   The EMBO Journal
                        11         2233   Nature
                        12         2157   Biochimica et Biophysica Acta
                        13         1949   Journal of Molecular Biology
                        14         1886   Genomics
                        15         1728   Cell
                        16         1700   Molecular and Cellular Biology
                        17         1353   Biochemical Journal
                        18         1293   Science
                        19         1153   Plant Molecular Biology
                        20         1147   Molecular Microbiology
                        21         1141   Molecular and General Genetics
                        22          887   Journal of Biochemistry
                        23          868   Virology
                        24          834   Human Molecular Genetics
                        25          788   Journal of Cell Biology
                        26          745   Nature Genetics
                        27          682   Genes and Development
                        28          657   Journal of Virology
                        29          641   The American Journal of Human Genetics
                        30          639   Plant Physiology
                        31          626   Human Mutation
                        32          621   Oncogene
                        33          568   Infection and Immunity
                        34          566   Journal of Immunology
                        35          551   Yeast
                        36          519   Journal of General Virology
                        37          517   Structure
                        38          505   Archives of Biochemistry and Biophysics
                        39          488   Microbiology
                        40          475   FEMS Microbiology Letters
                        41          475   Development
                        42          436   Nature Structural Biology
                        43          423   Genetics
                        44          416   Human Genetics
                        45          399   Current Genetics
                        46          383   Blood
                        47          367   Molecular and Biochemical Parasitology
                        48          345   Applied and Environmental Microbiology
                        49          336   Journal of Clinical Investigation
                        50          318   Mammalian Genome
                        51          316   Molecular Endocrinology
                        52          314   Developmental Biology
                        53          310   Protein Science
                        54          297   Immunogenetics
                        55          297   DNA and Cell Biology
                        56          293   Cancer Research
                        57          291   Journal of Molecular Evolution
                        58          279   Neuron
                        59          274   The Journal of Experimental Medicine
                        60          274   Molecular Biology of the Cell
                        61          271   Biological Chemistry Hoppe-Seyler
                        62          269   Mechanisms of Development
                        63          265   Acta Crystallographica, Section D
                        64          265   The Plant Cell
                        65          250   Endocrinology
                        66          246   Journal of Cell Science
                        67          239   DNA Sequence
                        68          234   The Plant Journal
                        69          232   Journal of General Microbiology
                        70          223   Journal of Neuroscience
                        71          222   Molecular Biology and Evolution
                        72          213   Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
                        73          212   Journal of Neurochemistry
                        74          208   Brain Research. Molecular Brain Research
                        75          206   The Journal of Clinical Endocrinology and Metabolism
                        76          194   Cytogenetics and Cell Genetics
                        77          182   Toxicon
                        78          177   Comparative Biochemistry and Physiology
                        79          175   American Journal of Physiology
                        80          174   Bioscience, Biotechnology, and Biochemistry
                        81          167   Molecular Cell
                        82          163   Molecular Pharmacology
                        83          160   Antimicrobial Agents and Chemotherapy
                        84          158   Current Biology
                        85          156   DNA
                        86          144   Journal of Investigative Dermatology
                        87          142   Tissue Antigens
                        88          141   DNA Research
                        89          140   Proteins
                        90          136   Molecular Plant-Microbe Interactions
                        91          136   Biochimie
                        92          132   Peptides
                        93          132   Virus Research
                        94          132   Journal of Medical Genetics
                        95          129   Bioorganicheskaia Khimiia
                        96          125   American Journal of Medical Genetics
                        97          124   Genome Research
                        98          120   Hemoglobin
                        99          117   Molecular and Cellular Endocrinology
                        100          114   Agricultural and Biological Chemistry
                        101          108   Biology of Reproduction
                        102          107   Plant and Cell Physiology
                        103          105   European Journal of Immunology
                        104          102   Archives of Microbiology
                        
                        
                        6.  STATISTICS FOR SOME LINE TYPES
                        
                        The following table summarizes the total number of some Swiss-Prot lines,
                        as well as the number of entries with at least one such line, and the
                        frequency of the lines.
                        
                        Total    Number of  Average
                        Line type / subtype                number   entries    per entry
                        ---------------------------------  -------- ---------  ---------
                        
                        References (RL)                     283423              1.93
                        Journal                          250489    136613    1.71
                        Submitted to EMBL/GenBank/DDBJ    30227     25422    0.21
                        Unpublished observations            536       532   <0.01
                        Submitted to Swiss-Prot             527       525   <0.01
                        Plant Gene Register                 487       476   <0.01
                        Book citation                       465       453   <0.01
                        Thesis                              263       261   <0.01
                        Submitted to other databases        203       202   <0.01
                        Unpublished results                 127       125   <0.01
                        Patent                               97        96   <0.01
                        Worm Breeder's Gazette                2         2   <0.01
                        
                        Comments (CC)                       512278              3.49
                        SIMILARITY                       147106    127783    1.00
                        FUNCTION                          93944     92443    0.64
                        SUBCELLULAR LOCATION              69668     69668    0.47
                        CATALYTIC ACTIVITY                51230     48248    0.35
                        SUBUNIT                           44297     44297    0.30
                        PATHWAY                           24285     23209    0.17
                        COFACTOR                          17058     17058    0.12
                        TISSUE SPECIFICITY                16130     16130    0.11
                        PTM                                9012      8155    0.06
                        MISCELLANEOUS                      8738      8050    0.06
                        ALTERNATIVE PRODUCTS               5272      5272    0.04
                        DOMAIN                             4975      4457    0.03
                        CAUTION                            4387      4105    0.03
                        INDUCTION                          4067      4067    0.03
                        DEVELOPMENTAL STAGE                3868      3868    0.03
                        DISEASE                            2514      1876    0.02
                        ENZYME REGULATION                  2012      2012    0.01
                        DATABASE                           1294      1217    0.01
                        MASS SPECTROMETRY                  1213      1078    0.01
                        POLYMORPHISM                        454       444   <0.01
                        ALLERGEN                            335       335   <0.01
                        RNA EDITING                         295       295   <0.01
                        BIOTECHNOLOGY                        77        77   <0.01
                        PHARMACEUTICAL                       47        47   <0.01
                        
                        Features (FT)                       831689              5.67
                        DOMAIN                           116075     35669    0.79
                        TRANSMEM                          91827     20000    0.63
                        TURN                              62474      4662    0.43
                        STRAND                            57252      4163    0.39
                        CONFLICT                          55195     19373    0.38
                        METAL                             54792     13543    0.37
                        CARBOHYD                          50364     12429    0.34
                        DISULFID                          46118     12311    0.31
                        HELIX                             45117      4520    0.31
                        REPEAT                            31810      4634    0.22
                        ACT_SITE                          31418     19008    0.21
                        VARIANT                           27420      5089    0.19
                        CHAIN                             26007     21096    0.18
                        NP_BIND                           18909     13304    0.13
                        SIGNAL                            16306     16304    0.11
                        MOD_RES                           14982      8452    0.10
                        NON_TER                           10597      8092    0.07
                        SITE                              10154      6238    0.07
                        VARSPLIC                           9968      4490    0.07
                        BINDING                            9770      7652    0.07
                        ZN_FING                            9215      3288    0.06
                        MUTAGEN                            6587      1880    0.04
                        INIT_MET                           6135      6090    0.04
                        PROPEP                             5196      4409    0.04
                        DNA_BIND                           4618      4327    0.03
                        LIPID                              4175      2790    0.03
                        PEPTIDE                            2806      1101    0.02
                        TRANSIT                            2791      2766    0.02
                        CA_BIND                            1840       792    0.01
                        NON_CONS                            835       428    0.01
                        CROSSLNK                            458       360   <0.01
                        UNSURE                              315       131   <0.01
                        SE_CYS                              163       108   <0.01
                        
                        Cross-references (DR)              1336204              9.11
                        EMBL                             284537    140087    1.94
                        InterPro                         264209    129846    1.80
                        Pfam                             168429    124579    1.15
                        PROSITE                          128678     81216    0.88
                        PIR                               88842     81354    0.61
                        PRINTS                            47994     42345    0.33
                        GO                                47413     14722    0.32
                        SMART                             42918     32422    0.29
                        HAMAP                             40549     40436    0.28
                        TIGRFAMs                          40318     37407    0.27
                        HSSP                              38738     38738    0.26
                        ProDom                            36531     35107    0.25
                        PDB                               22244      6010    0.15
                        TIGR                              14632     14556    0.10
                        Genew                              9613      9565    0.07
                        MIM                                9433      7904    0.06
                        MGD                                6973      6952    0.05
                        SGD                                4973      4919    0.03
                        GermOnline                         4927      4876    0.03
                        EcoGene                            4227      4225    0.03
                        MEROPS                             3454      3339    0.02
                        WormPep                            2730      2439    0.02
                        SubtiList                          2667      2666    0.02
                        TRANSFAC                           2648      2373    0.02
                        FlyBase                            2520      2446    0.02
                        GeneDB_SPombe                      2399      2369    0.02
                        RGD                                2297      2297    0.02
                        TubercuList                        1421      1385    0.01
                        StyGene                            1362      1359    0.01
                        PIRSF                              1168      1168    0.01
                        SWISS-2DPAGE                       1075      1075    0.01
                        ListiList                           921       860    0.01
                        Leproma                             609       605   <0.01
                        GK                                  594       594   <0.01
                        Gramene                             556       552   <0.01
                        MaizeDB                             411       406   <0.01
                        HIV                                 370       354   <0.01
                        REBASE                              361       356   <0.01
                        ECO2DBASE                           351       299   <0.01
                        DictyBase                           321       319   <0.01
                        ZFIN                                260       260   <0.01
                        GlycoSuiteDB                        259       259   <0.01
                        PHCI-2DPAGE                         214       214   <0.01
                        SagaList                            205       204   <0.01
                        PhotoList                           175       175   <0.01
                        MypuList                            159       159   <0.01
                        Aarhus/Ghent-2DPAGE                 128        98   <0.01
                        Siena-2DPAGE                        103       103   <0.01
                        HSC-2DPAGE                           85        85   <0.01
                        PhosSite                             53        53   <0.01
                        COMPLUYEAST-2DPAGE                   50        50   <0.01
                        PMMA-2DPAGE                          48        48   <0.01
                        Maize-2DPAGE                         39        39   <0.01
                        ANU-2DPAGE                           13        13   <0.01
                        
                        
                        7.  MISCELLANEOUS STATISTICS
                        
                        Total number of distinct authors cited in Swiss-Prot: 180569
                        
                        Total number of entries encoded on a chloroplast: 3494
                        Total number of entries encoded on a mitochondrion: 2886
                        Total number of entries encoded on a cyanelle: 145
                        Total number of entries encoded on a plasmid: 2736
                        
                        Number of fragments: 8221
                        Number of additional sequences encoded on splice variants: 7776
                        
                     
We need your help

We welcome feedback from our users. We would especially appreciate your notifying us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available. To facilitate this feedback we offer, on the ExPASy WWW server, a form that allows the submission of updates and/or corrections to Swiss-Prot:

It is also possible, from any entry in Swiss-Prot displayed by the ExPASy server, to submit updates and/or corrections for that particular entry. Finally, you can also send your comments by electronic mail to the address:

Note that all update requests are assigned a unique identifier of the form UR-Xnnnn (example: UR-A0123). This identifier is used internally by the Swiss-Prot staff at SIB and EBI to track requests and is also used in e-mail exchanges with the persons who have submitted a request.

Swiss-Prot release 42.0

Published October 1, 2003
  ------------------------------------------------------------------------
                                          Swiss-Prot Protein Knowledgebase
                                                             Release Notes
                                                 Release 42, October  2003
  ------------------------------------------------------------------------


                             Table of contents

1) Warning
2) Introduction
3) Status of the model organisms
4) We need your help
5) Some statistics


1) WARNING

Please note that the format of the release notes changed. We now make
updated documents available also between major Swiss-Prot releases. The
distinct sections of this document have moved to the following sites:

    * Description of the changes made to Swiss-Prot since the last
      release: http://www.expasy.org/sprot/relnotes/sp_news.html. This
      new document contains all recent modifications in Swiss-Prot
      including minor changes with no impact on the work of software
      developpers. Thus this document contains more information than
      announced in the document 'sp_soon.html' (see below).

    * Forthcoming changes: all modifications, which have an impact on
      the Swiss-Prot format are announced in the document:
      http://www.expasy.org/sprot/relnotes/sp_soon.html.

    * Status of the documentation files:
      http://www.expasy.org/sprot/userman.html#documentation

    * The ExPASy World-Wide Web server:
          o Explicit general and continuously updated documentation:
            http://www.expasy.org/doc/expasy.pdf
          o History of changes, improvements and new features:
            http://www.expasy.org/history.html
          o Swiss-Flash, a service that reports news of databases,
            software and service developments:
            http://www.expasy.org/swiss-flash/

    * TrEMBL - a supplement to Swiss-Prot:
      ftp://ftp.ebi.ac.uk/pub/databases/trembl/relnotes.txt

    * FTP access to Swiss-Prot and TrEMBL:
      http://www.expasy.org/sprot/userman.html#ftp and
      http://www.expasy.org/sprot/download.html

    * PROSITE release notes:
      http://www.expasy.org/prosite/psrelnot.html

    * Release statistics: http://www.expasy.org/sprot/relnotes/relstat.html

    * Relationships between Swiss-Prot and some biomolecular databases:
      http://www.expasy.org/sprot/userman.html#relship


2) INTRODUCTION

Release 42.0 of Swiss-Prot contains 135'850 sequence entries, comprising
50'046'799 amino acids abstracted from 109'694 references. This
represents an increase of 11% over release 41.0. The growth of the
database is summarized below.

      Release    Date   Number of   Number of
                         entries   amino acids
        2.0     09/86      3'939      900'163
        3.0     11/86      4'160      969'641
        4.0     04/87      4'387    1'036'010
        5.0     09/87      5'205    1'327'683
        6.0     01/88      6'102    1'653'982
        7.0     04/88      6'821    1'885'771
        8.0     08/88      7'724    2'224'465
        9.0     11/88      8'702    2'498'140
        10.0    03/89     10'008    2'952'613
        11.0    07/89     10'856    3'265'966
        12.0    10/89     12'305    3'797'482
        13.0    01/90     13'837    4'347'336
        14.0    04/90     15'409    4'914'264
        15.0    08/90     16'941    5'486'399
        16.0    11/90     18'364    5'986'949
        17.0    02/91     20'024    6'524'504
        18.0    05/91     20'772    6'792'034
        19.0    08/91     21'795    7'173'785
        20.0    11/91     22'654    7'500'130
        21.0    03/92     23'742    7'866'596
        22.0    05/92     25'044    8'375'696
        23.0    08/92     26'706    9'011'391
        24.0    12/92     28'154    9'545'427
        25.0    04/93     29'955   10'214'020
        26.0    07/93     31'808   10'875'091
        27.0    10/93     33'329   11'484'420
        28.0    02/94     36'000   12'496'420
        29.0    06/94     38'303   13'464'008
        30.0    10/94     40'292   14'147'368
        31.0    02/95     43'470   15'335'248
        32.0    11/95     49'340   17'385'503
        33.0    02/96     52'205   18'531'384
        34.0    10/96     59'021   21'210'389
        35.0    11/97     69'113   25'083'768
        36.0    07/98     74'019   26'840'295
        37.0    12/98     77'977   28'268'293
        38.0    07/99     80'000   29'085'965
        39.0    05/00     86'593   31'411'114
        40.0    10/01    101'602   37'315'215
        41.0    02/03    122'564   44'986'459
        42.0    10/03    135'850   50'046'799


3) STATUS OF THE MODEL ORGANISMS

We have selected a number of organisms that are the target of genome
sequencing and/or mapping projects and for which we intend to:

    * be as complete as possible. All sequences available at a given
      time should be immediately included in Swiss-Prot. This also
      includes sequence corrections and updates;
    * provide a higher level of annotation;
    * provide cross-references to specialized database(s) that contain,
      among other data, some information about the genes that code for
      these proteins;
    * provide specific indexes and documents.

From our efforts to annotate human sequence entries as completely as
possible arose the HPI project <http://www.expasy.org/sprot/hpi/>, and
the bacterial model organisms became the focus of the HAMAP project
<http://www.expasy.org/sprot/hamap/>. Here is the current status of the
model organisms which are not covered by these two projects:


      Organism        Database           Index file      Number of
                      cross-references                   sequences
      ------------    ----------------   --------------  ---------
      A.thaliana      None yet           arath.txt           2'294
      C.albicans      None yet           calbican.txt          272
      C.elegans       Wormpep            celegans.txt        2'405
      D.discoideum    DictyDB            dicty.txt             317
      D.melanogaster  FlyBase            fly.txt             1'907
      M.musculus      MGD                mgdtosp.txt         6'890
      S.cerevisiae    SGD                yeast.txt           4'920
      S.pombe         GeneDB_SPombe      pombe.txt           2'267


4) WE NEED YOUR HELP

We welcome feedback from our users. We would especially appreciate your
notifying us if you find that sequences belonging to your field of
expertise are missing from the database. We also would like to be
notified about annotations to be updated, if, for example, the function
of a protein has been clarified or if new information about
post-translational modifications has become available. To facilitate
this feedback we offer, on the ExPASy WWW server, a form that allows the
submission of updates and/or corrections to Swiss-Prot:

      http://www.expasy.org/sprot/update.html

It is also possible, from any entry in Swiss-Prot displayed by the
ExPASy server, to submit updates and/or corrections for that particular
entry. Finally, you can also send your comments by electronic mail to
the address:

      swiss-prot@expasy.org <mailto:swiss-prot@expasy.org>

Note that all update requests are assigned a unique identifier of the
form UR-Xnnnn (example: UR-A0123). This identifier is used internally by
the Swiss-Prot staff at SIB and EBI to track requests and is also used
in e-mail exchanges with the persons who have submitted a request.


5) SOME STATISTICS

  AMINO ACID COMPOSITION

   1.1  Composition in percent for the complete database

   Ala (A) 7.76   Gln (Q) 3.92   Leu (L) 9.60   Ser (S) 6.94
   Arg (R) 5.25   Glu (E) 6.55   Lys (K) 5.94   Thr (T) 5.49
   Asn (N) 4.25   Gly (G) 6.90   Met (M) 2.37   Trp (W) 1.18
   Asp (D) 5.29   His (H) 2.27   Phe (F) 4.05   Tyr (Y) 3.11
   Cys (C) 1.57   Ile (I) 5.90   Pro (P) 4.87   Val (V) 6.67

   Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.01

   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp


2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of Swiss-Prot: 8294

   The first twenty species represent 55464 sequences:  40.8 % of the total
   number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x: 4024
                            2x: 1258
                            3x:  656
                            4x:  417
                            5x:  269
                            6x:  257
                            7x:  195
                            8x:  153
                            9x:  131
                           10x:   74
                       11- 20x:  342
                       21- 50x:  246
                       51-100x:   90
                         >100x:  182


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1      10159  Homo sapiens (Human)
       2       6890  Mus musculus (Mouse)
       3       4920  Saccharomyces cerevisiae (Baker's yeast)
       4       4832  Escherichia coli
       5       3592  Rattus norvegicus (Rat)
       6       2653  Bacillus subtilis
       7       2405  Caenorhabditis elegans
       8       2294  Arabidopsis thaliana (Mouse-ear cress)
       9       2267  Schizosaccharomyces pombe (Fission yeast)
      10       1907  Drosophila melanogaster (Fruit fly)
      11       1773  Haemophilus influenzae
      12       1603  Escherichia coli O157:H7
      13       1591  Methanococcus jannaschii
      14       1425  Bos taurus (Bovine)
      15       1385  Mycobacterium tuberculosis
      16       1367  Salmonella typhimurium
      17       1234  Escherichia coli O6
      18       1083  Gallus gallus (Chicken)
      19       1082  Shigella flexneri
      20       1002  Mycobacterium bovis
      21        934  Synechocystis sp. (strain PCC 6803)
      22        933  Salmonella typhi
      23        929  Pseudomonas aeruginosa
      24        916  Archaeoglobus fulgidus
      25        865  Xenopus laevis (African clawed frog)
      26        831  Sus scrofa (Pig)
      27        746  Rhizobium meliloti (Sinorhizobium meliloti)
      28        730  Aquifex aeolicus
      29        714  Oryctolagus cuniculus (Rabbit)
      30        710  Vibrio cholerae
      31        687  Mycoplasma pneumoniae
      32        650  Yersinia pestis
      33        615  Pasteurella multocida
      34        600  Treponema pallidum
      35        600  Mycobacterium leprae
      36        578  Streptomyces coelicolor
      37        572  Buchnera aphidicola (subsp. Acyrthosiphon pisum)
      38        560  Buchnera aphidicola (subsp. Schizaphis graminum)
      39        551  Helicobacter pylori (Campylobacter pylori)
      40        543  Bacillus halodurans
      41        541  Rickettsia prowazekii
      42        532  Helicobacter pylori J99 (Campylobacter pylori J99)
      43        522  Vibrio parahaemolyticus
      44        521  Methanobacterium thermoautotrophicum
      45        516  Anabaena sp. (strain PCC 7120)
      46        499  Zea mays (Maize)
      47        486  Mycoplasma genitalium
      48        473  Vibrio vulnificus
      49        463  Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
      50        443  Neisseria meningitidis (serogroup B)
      51        440  Neisseria meningitidis (serogroup A)
      52        435  Thermotoga maritima
      53        432  Ralstonia solanacearum (Pseudomonas solanacearum)
      54        422  Oryza sativa (Rice)
      55        419  Rhizobium loti (Mesorhizobium loti)
      56        418  Agrobacterium tumefaciens (strain C58 / ATCC 33970)
      57        417  Listeria monocytogenes
      58        416  Caulobacter crescentus
      59        414  Chlamydia trachomatis
      60        411  Xanthomonas campestris (pv. campestris)
      61        411  Borrelia burgdorferi (Lyme disease spirochete)
      62        410  Listeria innocua
      63        407  Clostridium acetobutylicum
      64        403  Rhizobium sp. (strain NGR234)
      65        402  Canis familiaris (Dog)
      66        401  Chlamydia pneumoniae (Chlamydophila pneumoniae)
      67        398  Xylella fastidiosa
      68        392  Chlamydia muridarum
      69        391  Pyrococcus horikoshii
      70        389  Streptococcus pneumoniae
      71        387  Xylella fastidiosa (strain Temecula1 / ATCC 700964)
      72        384  Pyrococcus abyssi
      73        375  Xanthomonas axonopodis (pv. citri)
      74        374  Deinococcus radiodurans
      75        370  Staphylococcus aureus (strain N315)
      76        368  Staphylococcus aureus (strain Mu50 / ATCC 700699)
      77        353  Clostridium perfringens
      78        352  Nicotiana tabacum (Common tobacco)
      79        351  Campylobacter jejuni
      80        347  Staphylococcus aureus (strain MW2)
      81        347  Corynebacterium glutamicum (Brevibacterium flavum)
      82        346  Halobacterium sp. (strain NRC-1 / ATCC 700922 / JCM 11081)
      83        341  Ovis aries (Sheep)
      84        332  Sulfolobus solfataricus
      85        329  Brucella melitensis
      86        329  Rickettsia conorii
      87        325  Pyrococcus furiosus
      88        317  Dictyostelium discoideum (Slime mold)
      89        317  Streptococcus pyogenes
      90        315  Thermoanaerobacter tengcongensis
      91        314  Methanosarcina mazei (Methanosarcina frisia)
      92        307  Aeropyrum pernix
      93        306  Neurospora crassa
      94        305  Methanosarcina acetivorans
      95        291  Pisum sativum (Garden pea)
      96        285  Staphylococcus aureus
      97        276  Chlorobium tepidum
      98        272  Candida albicans (Yeast)
      99        270  Streptococcus pyogenes (serotype M18)
     100        269  Bradyrhizobium japonicum


   2.3  Taxonomic distribution of the sequences

   Kingdom        sequences (% of the database)
    Archaea            7773 (  6%)
    Bacteria          54879 ( 40%)
    Eukaryota         64641 ( 48%)
    Viruses            8557 (  6%)

   Within Eukaryota:

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                  10159 ( 16%)           (  7%)
     Other Mammalia         17276 ( 27%)           ( 13%)
     Other Vertebrata        6106 (  9%)           (  4%)
     Viridiplantae          10227 ( 16%)           (  8%)
     Fungi                   9648 ( 15%)           (  7%)
     Insecta                 3568 (  6%)           (  3%)
     Nematoda                2633 (  4%)           (  2%)
     Other                   5024 (  8%)           (  4%)


3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50    2456             1001-1100     1205
                 51- 100    9273             1101-1200      849
                101- 150   13503             1201-1300      610
                151- 200   12518             1301-1400      413
                201- 250   12918             1401-1500      331
                251- 300   11315             1501-1600      236
                301- 350   11678             1601-1700      179
                351- 400   11033             1701-1800      132
                401- 450    8414             1801-1900      138
                451- 500    7190             1901-2000      120
                501- 550    5591             2001-2100       68
                551- 600    3670             2101-2200      103
                601- 650    3131             2201-2300      100
                651- 700    2227             2301-2400       58
                701- 750    1974             2401-2500       57
                751- 800    1649             >2500          362
                801- 850    1278
                851- 900    1280
                901- 950     901
                951-1000     794

   The average sequence length in Swiss-Prot is 368 amino acids.

   The shortest sequence is  LUXE_VIBFI (P24272):     3 amino acids.
   The longest sequence is   SNE1_HUMAN (Q8NF91):  8797 amino acids.


4.  JOURNAL CITATIONS

   Note: the following citation statistics reflect the number of distinct
         journal citations.

   Total number of journals cited in this release of Swiss-Prot: 1392


   4.1 Table of the frequency of journal citations

        Journals cited 1x:  511
                       2x:  186
                       3x:   97
                       4x:   66
                       5x:   49
                       6x:   35
                       7x:   31
                       8x:   27
                       9x:   20
                      10x:   14
                  11- 20x:  111
                  21- 50x:  104
                  51-100x:   40
                    >100x:  101


   4.2  List of the most cited journals in Swiss-Prot

   Nb    Citations   Journal name
   --    ---------   -------------------------------------------------------------
    1         9693   Journal of Biological Chemistry
    2         5221   Proceedings of the National Academy of Sciences of the U.S.A.
    3         3748   Journal of Bacteriology
    4         3667   Nucleic Acids Research
    5         3493   Gene
    6         2744   FEBS Letters
    7         2707   Biochemical and Biophysical Research Communications
    8         2489   Biochemistry
    9         2486   European Journal of Biochemistry
   10         2279   The EMBO Journal
   11         2143   Nature
   12         2093   Biochimica et Biophysica Acta
   13         1893   Journal of Molecular Biology
   14         1831   Genomics
   15         1668   Cell
   16         1633   Molecular and Cellular Biology
   17         1309   Biochemical Journal
   18         1232   Science
   19         1132   Molecular and General Genetics
   20         1131   Plant Molecular Biology
   21         1109   Molecular Microbiology
   22          874   Journal of Biochemistry
   23          851   Virology
   24          790   Human Molecular Genetics
   25          748   Journal of Cell Biology
   26          696   Nature Genetics
   27          651   Genes and Development
   28          628   Journal of Virology
   29          615   Plant Physiology
   30          604   Human Mutation
   31          597   Oncogene
   32          587   The American Journal of Human Genetics
   33          551   Infection and Immunity
   34          546   Journal of Immunology
   35          542   Yeast
   36          510   Journal of General Virology
   37          489   Archives of Biochemistry and Biophysics
   38          486   Structure
   39          466   FEMS Microbiology Letters
   40          458   Microbiology
   41          448   Development
   42          404   Nature Structural Biology
   43          399   Human Genetics
   44          397   Genetics
   45          391   Current Genetics
   46          362   Molecular and Biochemical Parasitology
   47          359   Blood
   48          335   Applied and Environmental Microbiology
   49          321   Journal of Clinical Investigation
   50          310   Molecular Endocrinology
   51          304   Mammalian Genome
   52          299   Developmental Biology
   53          297   Protein Science
   54          291   DNA and Cell Biology
   55          287   Immunogenetics
   56          286   Journal of Molecular Evolution
   57          271   Biological Chemistry Hoppe-Seyler
   58          270   Cancer Research
   59          263   Neuron
   60          263   Journal of Experimental Medicine
   61          254   Mechanisms of Development
   62          248   Molecular Biology of the Cell
   63          240   Acta Crystallographica, Section D
   64          238   The Plant Cell
   65          237   Endocrinology
   66          232   DNA Sequence
   67          231   Journal of General Microbiology
   68          228   Journal of Cell Science
   69          213   Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
   70          211   The Plant Journal
   71          209   Molecular Biology and Evolution
   72          204   Brain Research. Molecular Brain Research
   73          202   Journal of Neuroscience
   74          202   Journal of Neurochemistry
   75          189   The Journal of Clinical Endocrinology and Metabolism
   76          177   Cytogenetics and Cell Genetics
   77          167   Comparative Biochemistry and Physiology
   78          163   Toxicon
   79          162   Bioscience, Biotechnology, and Biochemistry
   80          156   DNA
   81          153   Molecular Pharmacology
   82          151   American Journal of Physiology
   83          150   Antimicrobial Agents and Chemotherapy
   84          141   Current Biology
   85          140   Tissue Antigens
   86          135   Molecular Cell
   87          133   Biochimie
   88          132   Proteins
   89          132   Virus Research
   90          132   DNA Research
   91          130   Molecular Plant-Microbe Interactions
   92          127   Bioorganicheskaia Khimiia
   93          124   Journal of Investigative Dermatology
   94          124   Peptides
   95          120   Hemoglobin
   96          115   Molecular and Cellular Endocrinology
   97          115   Genome Research
   98          114   Agricultural and Biological Chemistry
   99          112   American Journal of Medical Genetics
  100          108   Journal of Medical Genetics
  101          102   European Journal of Immunology


5.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some Swiss-Prot lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                     262333              1.93
   Journal                          228944    125635    1.69
   Submitted to EMBL/GenBank/DDBJ    30739     25552    0.23
   Unpublished observations            521       517   <0.01
   Submitted to Swiss-Prot             517       515   <0.01
   Plant Gene Register                 485       473   <0.01
   Book citation                       465       453   <0.01
   Thesis                              248       246   <0.01
   Submitted to other databases        198       197   <0.01
   Unpublished results                 122       120   <0.01
   Patent                               92        91   <0.01
   Worm Breeder's Gazette                2         2   <0.01

Comments (CC)                       463600              3.41
   SIMILARITY                       134388    116832    0.99
   FUNCTION                          85939     84536    0.63
   SUBCELLULAR LOCATION              63137     63137    0.46
   CATALYTIC ACTIVITY                45791     43051    0.34
   SUBUNIT                           39346     39346    0.29
   PATHWAY                           21282     20412    0.16
   TISSUE SPECIFICITY                15006     15006    0.11
   COFACTOR                          14929     14929    0.11
   MISCELLANEOUS                      8291      7646    0.06
   PTM                                8152      7401    0.06
   ALTERNATIVE PRODUCTS               4673      4673    0.03
   DOMAIN                             4127      3805    0.03
   CAUTION                            3947      3715    0.03
   INDUCTION                          3856      3856    0.03
   DEVELOPMENTAL STAGE                3667      3667    0.03
   DISEASE                            2401      1885    0.02
   ENZYME REGULATION                  1913      1913    0.01
   MASS SPECTROMETRY                  1021       918    0.01
   DATABASE                            884       812    0.01
   POLYMORPHISM                        435       426   <0.01
   ALLERGEN                            313       313   <0.01
   BIOTECHNOLOGY                        55        55   <0.01
   PHARMACEUTICAL                       47        47   <0.01

Features (FT)                       720924              5.31
   DOMAIN                           107190     32438    0.79
   TRANSMEM                          85992     18759    0.63
   CONFLICT                          52547     18442    0.39
   METAL                             49380     12298    0.36
   CARBOHYD                          48156     11914    0.35
   DISULFID                          43940     11664    0.32
   TURN                              39112      2952    0.29
   STRAND                            36250      2640    0.27
   ACT_SITE                          28670     17375    0.21
   HELIX                             27705      2842    0.20
   VARIANT                           26136      4877    0.19
   CHAIN                             25029     20292    0.18
   REPEAT                            24898      4114    0.18
   NP_BIND                           17064     12177    0.13
   SIGNAL                            15697     15695    0.12
   MOD_RES                           14364      8131    0.11
   NON_TER                           10462      7968    0.08
   BINDING                            8851      6875    0.07
   VARSPLIC                           8719      3948    0.06
   ZN_FING                            8643      3067    0.06
   SITE                               8594      5257    0.06
   INIT_MET                           5915      5873    0.04
   MUTAGEN                            5538      1665    0.04
   PROPEP                             4936      4178    0.04
   DNA_BIND                           4398      4139    0.03
   LIPID                              3812      2533    0.03
   PEPTIDE                            2714      1085    0.02
   TRANSIT                            2693      2670    0.02
   CA_BIND                            1799       764    0.01
   NON_CONS                            823       425    0.01
   CROSSLNK                            438       345   <0.01
   UNSURE                              304       128   <0.01
   SE_CYS                              155       101   <0.01

Cross-references (DR)              1236729              9.10
   EMBL                             260342    129417    1.92
   InterPro                         245720    120841    1.81
   Pfam                             156422    115928    1.15
   PROSITE                          117863     74642    0.87
   PIR                               86853     79511    0.64
   PRINTS                            45909     40532    0.34
   GO                                44199     13630    0.33
   SMART                             40856     30862    0.30
   HSSP                              38180     38180    0.28
   TIGRFAMs                          35999     33456    0.26
   ProDom                            34326     32956    0.25
   HAMAP                             32758     32659    0.24
   PDB                               19661      5456    0.14
   TIGR                              12981     12906    0.10
   Genew                              8961      8913    0.07
   MIM                                8833      7540    0.07
   MGD                                6547      6528    0.05
   SGD                                4964      4910    0.04
   GermOnline                         4924      4874    0.04
   EcoGene                            4228      4226    0.03
   MEROPS                             3434      3319    0.03
   TRANSFAC                           2629      2357    0.02
   WormPep                            2619      2383    0.02
   SubtiList                          2608      2607    0.02
   FlyBase                            2425      2351    0.02
   GeneDB_SPombe                      2282      2252    0.02
   TubercuList                        1414      1377    0.01
   StyGene                            1323      1320    0.01
   SWISS-2DPAGE                        949       948    0.01
   ListiList                           828       769    0.01
   PIRSF                               679       679   <0.01
   GK                                  671       671   <0.01
   Leproma                             604       600   <0.01
   Gramene                             415       413   <0.01
   MaizeDB                             411       406   <0.01
   HIV                                 370       354   <0.01
   REBASE                              358       353   <0.01
   ECO2DBASE                           351       299   <0.01
   DictyDb                             320       317   <0.01
   GlycoSuiteDB                        259       259   <0.01
   ZFIN                                245       245   <0.01
   PHCI-2DPAGE                         212       212   <0.01
   MypuList                            145       145   <0.01
   SagaList                            134       134   <0.01
   Aarhus/Ghent-2DPAGE                 128        98   <0.01
   Siena-2DPAGE                        103       103   <0.01
   HSC-2DPAGE                           85        85   <0.01
   PhosSite                             53        53   <0.01
   COMPLUYEAST-2DPAGE                   50        50   <0.01
   PMMA-2DPAGE                          47        47   <0.01
   Maize-2DPAGE                         39        39   <0.01
   ANU-2DPAGE                           13        13   <0.01


6.  MISCELLANEOUS STATISTICS

Total number of distinct authors cited in Swiss-Prot: 173952

Total number of entries encoded on a chloroplast: 3319
Total number of entries encoded on a mitochondrion: 2765
Total number of entries encoded on a cyanelle: 145
Total number of entries encoded on a plasmid: 2705

Number of fragments: 8096
Number of additional sequences encoded on splice variants: 6909


--End of document--

  

Swiss-Prot release 41.0

Published February 1, 2003
  ------------------------------------------------------------------------
                                          Swiss-Prot Protein Knowledgebase
                                                             Release Notes
                                                 Release 41, February 2003
  ------------------------------------------------------------------------

                             Table of contents

 1   Introduction
 2   Description of the changes made to Swiss-Prot since release 40
 3   Forthcoming changes
 4   Status of the documentation files
 5   The ExPASy World-Wide Web server
 6   TrEMBL - a supplement to Swiss-Prot
 7   FTP access to Swiss-Prot and TrEMBL
 8   ENZYME and PROSITE
 9   We need your help!
 A   Appendix A


                             1   Introduction

Release 41.0 of Swiss-Prot contains 122'564 sequence entries, comprising
44'986'459 amino acids abstracted from 103'486 references. This represents
an increase of 20% over release 40.0. The growth of the database is
summarized below.

      Release    Date   Number of   Number of
                         entries   amino acids
        2.0     09/86      3'939      900'163
        3.0     11/86      4'160      969'641
        4.0     04/87      4'387    1'036'010
        5.0     09/87      5'205    1'327'683
        6.0     01/88      6'102    1'653'982
        7.0     04/88      6'821    1'885'771
        8.0     08/88      7'724    2'224'465
        9.0     11/88      8'702    2'498'140
        10.0    03/89     10'008    2'952'613
        11.0    07/89     10'856    3'265'966
        12.0    10/89     12'305    3'797'482
        13.0    01/90     13'837    4'347'336
        14.0    04/90     15'409    4'914'264
        15.0    08/90     16'941    5'486'399
        16.0    11/90     18'364    5'986'949
        17.0    02/91     20'024    6'524'504
        18.0    05/91     20'772    6'792'034
        19.0    08/91     21'795    7'173'785
        20.0    11/91     22'654    7'500'130
        21.0    03/92     23'742    7'866'596
        22.0    05/92     25'044    8'375'696
        23.0    08/92     26'706    9'011'391
        24.0    12/92     28'154    9'545'427
        25.0    04/93     29'955   10'214'020
        26.0    07/93     31'808   10'875'091
        27.0    10/93     33'329   11'484'420
        28.0    02/94     36'000   12'496'420
        29.0    06/94     38'303   13'464'008
        30.0    10/94     40'292   14'147'368
        31.0    02/95     43'470   15'335'248
        32.0    11/95     49'340   17'385'503
        33.0    02/96     52'205   18'531'384
        34.0    10/96     59'021   21'210'389
        35.0    11/97     69'113   25'083'768
        36.0    07/98     74'019   26'840'295
        37.0    12/98     77'977   28'268'293
        38.0    07/99     80'000   29'085'965
        39.0    05/00     86'593   31'411'114
        40.0    10/01    101'602   37'315'215
        41.0    02/03    122'564   44'986'459


    2   Description of the changes made to Swiss-Prot since release 40

     2.1   Sequences and annotations

21'133 sequences have been added since release 40, the sequence data of
3'251 existing entries has been updated and the annotations of 57'525
entries have been revised.


     2.2   The HPI project

The Human Proteomics Initiative (HPI) puts a major effort on the annotation
of all known human sequences according to the quality standards of
Swiss-Prot. This means that, for each known protein, a wealth of
information is provided, which includes the description of its function,
its domain structure, subcellular location, post-translational
modifications (PTMs), variants, similarities to other proteins, etc. This
not only implies the annotation of newly detected proteins, but also the
integration of new research data into the existing entries by specialized
biologists, who are in close contact with experts all over the world.

There are currently 9'172 annotated human sequences in Swiss-Prot.
Up-to-date detailed statistics concerning the HPI project are available at:

     http://www.expasy.org/sprot/hpi/hpi_stat.html

Simultaneously, two further efforts were increased: the description of
human diseases associated with deficiency(ies) in the protein, and
mammalian orthologs of human proteins are annotated at a level equivalent
to that of the cognate human sequences.

For all aspects of the HPI project, we would appreciate the help and
collaboration of the scientific community. Information concerning the human
proteome is highly critical to a large section of the life science
community. We therefore appeal to the user community to fully participate
in this initiative by providing all the necessary information to define and
to speed up the comprehensive annotation of the human proteome.

For a detailed description of the HPI project please consult:

     http://www.expasy.org/sprot/hpi/


     2.3   The HAMAP project

The first complete microbial genome sequence was that of the bacterium
Haemophilus influenzae, which became available in 1995. Since then, more
than 100 bacterial and archaeal genomes have been sequenced and many more
sequencing projects of pathogenic and nonpathogenic microbes are in
progress. To date, the publicly available microbial genomes encode more
than 230'000 different proteins.

In order to handle the large amount of "raw" data coming from microbial
genome sequencing, the High quality Automated Microbial Annotation of
Proteomes (HAMAP) project was initiated. The project aims to automatically
annotate a significant percentage of protein sequences, which originate
from microbial genome sequencing projects.

To maintain a high level quality of annotation, specific tools are
developed to deal with two completely separate subsets of bacterial and
archaeal proteins: proteins that have no recognizable similarity to any
other microbial or non-microbial proteins ("ORFans") and proteins that are
part of well-defined families or subfamilies. This is done by using a rule
system that describes the level and extent of annotations that can be
assigned by similarity with a prototype manually annotated entry. The
result is a curated entry whose quality is identical to that produced
manually by an expert annotator.

Programs under development are designed to recognize protein peculiarities,
and only proteins which match the defined criteria are processed
automatically. Protein sequences which fail to fit into the rule system are
further analyzed by Swiss-Prot expert annotators.

For a detailed description of the HAMAP project and its current status
please consult:

     http://www.expasy.org/sprot/hamap/

and:

Gattiker A., Michoud K., Rivoire C., Auchincloss A.H., Coudert E., Lima T.,
Kersey P., Pagni M., Sigrist C.J.A., Lachaize C., Veuthey A.-L., Bairoch A.
Automatic annotation of microbial proteomes in Swiss-Prot.
Comput. Biol. Chem. 27:49-58(2003).


     2.4   What's happening with the model organisms?

We have selected a number of organisms that are the target of genome
sequencing and/or mapping projects and for which we intend to:

   * be as complete as possible. All sequences available at a given time
     should be immediately included in Swiss-Prot. This also includes
     sequence corrections and updates;
   * provide a higher level of annotation;
   * provide cross-references to specialized database(s) that contain,
     among other data, some information about the genes that code for these
     proteins;
   * provide specific indexes and documents.

From our efforts to annotate human sequence entries as completely as
possible arose the HPI project (see 2.2), and the bacterial model organisms
became the focus of the HAMAP project (see 2.3). Here is the current status
of the model organisms which are not covered by these two projects:

      Organism        Database           Index file      Number of
                      cross-references                   sequences
      ------------    ----------------   --------------  ---------
      A.thaliana      None yet           arath.txt           1'952
      C.albicans      None yet           calbican.txt          264
      C.elegans       Wormpep            celegans.txt        2'291
      D.discoideum    DictyDB            dicty.txt             316
      D.melanogaster  FlyBase            fly.txt             1'764
      M.musculus      MGD                mgdtosp.txt         6'169
      S.cerevisiae    SGD                yeast.txt           4'892
      S.pombe         GeneDB_SPombe      pombe.txt           2'116


     2.5   'Nucleomorph' added to the OrGanelle (OG) line

The OG (OrGanelle) line indicates from which genome a gene for a protein
originates. Until now, defined terms in the OG line where 'Chloroplast',
'Cyanelle', 'Mitochondrion' and 'Plasmid'. The term 'Nucleomorph' has been
added, which is the residual nucleus of an algal endosymbiont that resides
inside its host cell.


     2.6   Progress in the conversion of Swiss-Prot to mixed-case
     characters

We are gradually converting Swiss-Prot entries from all 'UPPER CASE' to
'MiXeD CaSe'. With this release the RC (Reference Comment) line topic
STRAIN and the CC line topic 'CATALYTIC ACTIVITY' have been converted.

As described in section 3.2, the process of converting all of Swiss-Prot to
mixed case continues.


     2.7   Multiple RP lines

Starting with release 41, there can be more than one RP (Reference
Position) line per reference in a Swiss-Prot entry. The RP line describes
the extent of the work carried out by the authors of the reference, e.g.
the type of molecule that has been sequenced, protein characterization, PTM
characterization, protein structure analysis, variation detection, etc.

As the number of experimental results per publication has increased over
the years, the limitation of using a single RP line per reference no longer
allowed to add all the information while maintaining a consistent format.
Therefore we decided to permit multiple RP lines.

Example:

RP   SEQUENCE FROM N.A., SEQUENCE OF 23-42 AND 351-365, AND
RP   CHARACTERIZATION.


     2.8   Changes concerning cross-references (DR line)

     2.8.1   Schizosaccharomyces pombe GeneDB database

We have added cross-references to the Schizosaccharomyces pombe GeneDB
database (available at http://www.genedb.org/genedb/pombe/index.jsp), which
contains all S. pombe known and predicted protein coding genes, pseudogenes
and tRNAs. It is hosted by the Sanger Institute.

The identifiers of the appropriate DR line are:

 Data bank identifier: GeneDB_SPombe
 Primary identifier:   GeneDB's unique identifier for a S. pombe gene.
 Secondary identifier: None; a dash '-' is stored in that field.
 Example:              DR   GeneDB_SPombe; SPAC9E9.12c; -.


     2.8.2   Genew

We have added cross-references to the Human Gene Nomenclature Database
Genew (available at http://www.gene.ucl.ac.uk/nomenclature/searchgenes.pl),
which provides data for all human genes which have approved symbols. It is
managed by the HUGO Gene Nomenclature Committee (HGNC).

The identifiers of the appropriate DR line are:

 Data bank identifier: Genew
 Primary identifier:   HGNC's unique identifier for a human gene
 Secondary identifier: HGNC's approved gene symbol.
 Example:              DR   Genew; HGNC:5217; HSD3B1.


     2.8.3   Gramene

We have added cross-references to the Gramene database, a comparative
mapping resource for grains (available at http://www.gramene.org/). The
format for the explicit links are:

 Data bank
 identifier:           Gramene
 Primary identifier:   Unique identifier for a protein, which is identical
                       to the Swiss-Prot primary AC number of that protein.
 Secondary identifier: None; a dash '-' is stored in that field.
 Example:              DR   Gramene; Q06967; -.


     2.8.4   HAMAP

We have added cross-references to the collection of orthologous microbial
protein families, generated manually by expert curators of the HAMAP
(High-quality Automated and Manual Annotation of microbial Proteomes)
project in the framework of the Swiss-Prot protein knowledgebase. The data
is accessible at http://www.expasy.org/sprot/hamap/families.html.

The identifiers of the appropriate DR line are:

 Data bank
 identifier:         HAMAP
 Primary identifier: HAMAP unique identifier for a microbial protein
                     family
 Secondary           The values are either '-', 'fused', 'atypical' or
 identifier:         'atypical/fused'. The value '-' is a placeholder for
                     an empty field; the 'fused' value indicates that the
                     family rule does not cover the entire protein; the
                     value 'atypical' points out that the protein is
                     divergent in sequence or has mutated functional
                     sites, and should not be included in family datasets.
                     The value 'atypical/fused' indicates both latter
                     findings.
 Tertiary            Number of domains found in the protein, generally
 identifier:         '1', rarely '2' for the fusion of 2 identical
                     domains.
 Example:            DR   HAMAP; MF_00012; -; 1.


     2.8.5   Phosphorylation Site Database

We have added cross-references to the Phosphorylation Site Database,
PhosSite (available at http://vigen.biochem.vt.edu/xpd/xpd.htm), which
provides access to information from scientific literature concerning
prokaryotic proteins that undergo covalent phosphorylation on the hydroxyl
side chains of serine, threonine or tyrosine residues. The identifiers of
the appropriate DR line are:

 Data bank identifier: PhosSite
 Primary identifier:   Unique identifier for a phosphoprotein, which is
                       identical to the Swiss-Prot primary AC number of
                       that protein.
 Secondary identifier: None; a dash '-' is stored in that field.
 Example:              DR   PhosSite; P00955; -.


     2.8.6   TIGRFAMs

We have added cross-references to TIGRFAMs, a protein family database
available at http://www.tigr.org/TIGRFAMs/. The identifiers of the
appropriate DR line are:

 Data bank identifier: TIGRFAMs
 Primary identifier:   TIGRFAMs unique identifier for a protein family.
 Secondary identifier: TIGRFAMs entry name for a protein family.
 Tertiary identifier:  Number of hits found in the sequence.
 Example:              DR   TIGRFAMs; TIGR00630; uvra; 1.


     2.8.7   CarbBank

We have removed the Swiss-Prot cross-references to CarbBank.


     2.8.8   GCRDb

We have removed the Swiss-Prot cross-references to GCRDb.


     2.8.9   Mendel

We have removed the Swiss-Prot cross-references to Mendel.


     2.8.10   YEPD

We have removed the Swiss-Prot cross-references to the yeast
electrophoresis protein database (YEPD).


     2.9   Explicit links to dbSNP in FT VARIANT lines of human sequence
     entries

In human protein sequence entries we have introduced explicit links to the
Single Nucleotide Polymorphism database (dbSNP) from the feature
description of FT VARIANT keys. The format of such links is:

FT   VARIANT    from     to       description (IN dbSNP:accession_number).
FT                                /FTId=VAR_number.

Example:

FT   VARIANT      65     65       T -> I (IN dbSNP:1065419).
FT                                /FTId=VAR_012009.


     2.10   Feature key 'SIMILAR' became obsolete

The feature key 'SIMILAR' was used to describe the extent of a similarity
with another protein sequence. Nowadays, most domains with similarity to
other proteins are known regions described in domain and family databases,
which are annotated in Swiss-Prot with the feature key 'DOMAIN' or 'REPEAT'
and the comment (CC) line topic 'SIMILARITY'; thus the feature key
'SIMILAR' became obsolete and will not be used again.


     2.11   Version of SP in XML format

A distribution version of Swiss-Prot and TrEMBL in XML format is being
developed. The first draft of the XML specification was released for public
review on February 21, 2002.

For more information see http://www.ebi.ac.uk/swissprot/SP-ML/.

Please send comments and suggestions by electronic mail to sp-ml@ebi.ac.uk.


                          3   Forthcoming changes

 Please note that these are the last release notes in this format. In
 future, forthcoming changes and recent modifications are announced to
 users also between major Swiss-Prot releases. The distinct sections of
 this document will move to the following sites:

    * 2. Description of the changes made to Swiss-Prot since the last
      release: http://www.expasy.org/sprot/relnotes/sp_news.html. This new
      document contains all recent modifications in Swiss-Prot including
      minor changes with no impact on the work of software developpers.
      Thus this document contains more information than announced in the
      document 'sp_soon.html' (see below).
    * 3. Forthcoming changes:
      http://www.expasy.org/sprot/relnotes/sp_soon.html. All
      modifications, which have an impact on the Swiss-Prot format are
      announced in this document.
    * 4. Status of the documentation files:
      http://www.expasy.org/sprot/userman.html#documentation
    * 5. The ExPASy World-Wide Web server:
         o Explicit general and continuously updated documentation:
           http://www.expasy.org/doc/expasy.pdf
         o History of changes, improvements and new features:
           http://www.expasy.org/history.html
         o Swiss-Flash, a service that reports news of databases, software
           and service developments: http://www.expasy.org/swiss-flash/
    * 6. TrEMBL - a supplement to Swiss-Prot:
      ftp://ftp.ebi.ac.uk/pub/databases/trembl/relnotes.txt
    * 7. FTP access to Swiss-Prot and TrEMBL:
      http://www.expasy.org/sprot/userman.html#ftp and
      http://www.expasy.org/sprot/download.html
    * 8. ENZYME and PROSITE: Enzyme release notes (not yet) and
      http://www.expasy.org/prosite/psrelnot.html
    * Appendix A (Release statistics):
      http://www.expasy.org/sprot/relnotes/relstat.html
    * Appendix B (Relationships between Swiss-Prot and some biomolecular
      databases): http://www.expasy.org/sprot/userman.html#relship


     3.1   Extension of the entry name format

We endeavor to assign meaningful entry names that facilitate the
identification of the proteins and the species of origin. Currently the
entry name consists of up to ten uppercase alphanumeric characters.
Swiss-Prot uses a general purpose naming convention that can be symbolized
as X_Y, where X is a mnemonic code of at most 4 alphanumeric characters
representing the protein name, the '_' sign serves as a separator, and the
Y is a mnemonic species identification code of at most 5 alphanumeric
characters representing the biological source of the protein.

We are planning to elongate the mnemonic code for the protein name from up
to 4 characters to up to 5 characters. E.g. the mnemonic code for the
meiotic recombination protein rec10 is currently 'RE10'. After the
introduction of extended entry names it could be modified to the 5-letter
code 'REC10'.


     3.2   Continuation of the conversion of Swiss-Prot to mixed-case
     characters

We will continue to convert Swiss-Prot entries from all 'UPPER CASE' to
'MiXeD CaSe'. We are proceeding in the conversion of CC (Comment) lines, we
will start to convert the GN (Gene Name) lines to mixed case, but also any
other line type might be effected.


     3.3    Reference Comment (RC) line topics may span lines

The RC (Reference Comment) line store comments relevant to the reference
cited, in currently 5 distinct topics: PLASMID, SPECIES, STRAIN, TISSUE and
TRANSPOSON. It is not always possible to list all information within one
line. Therefore we will allow multiple RC lines, in which one topic might
span over a line. Example:

RC   STRAIN=Various strains;

could become

RC   STRAIN=AZ.026, DC.005, GA.039, GA2181, IL.014, IN.018, KY.172, KY2.37,
RC   LA.013, MN.001, MNb027, MS.040, NY.016, OH.036, TN.173, TN2.38,
RC   UT.002, AL.012, AZ.180, MI.035, VA.015, and IL2.17;


     3.4   New format of comment line (CC) topics

We are continuing a major overhaul of various comment line topics. We would
like the majority of the information stored to be usable by computer
programs (while remaining human-readable). We are therefore standardizing
the format of the topics.


     3.4.1   ALTERNATIVE PRODUCTS

We are gradually restructuring the CC (comment) line topic ALTERNATIVE
PRODUCTS and introducing unique identifiers for each described isoform.
Qualifiers, which will be introduced are described in the table below:

  Topic        Description

  Event        Biological process that results in the
               production of the alternative forms (Alternative
               promoter, Alternative splicing, Alternative
               initiation).

               Format: Event=controlled vocabulary;
               Example: Event=Alternative splicing;

  Named        Number of isoforms listed in the topics 'Name'
  isoforms     below the topic 'Event=Alternative splicing'.

               Format: Named isoforms=number;
               Example: Named isoforms=6;

  Comment      Any comments concerning one or more isoforms;
               optional; may be longer than 1 line.

               Format: Comment=free text;
               Example: Comment=Experimental confirmation may
                                be lacking for some isoforms;

  Name         A common name for an isoform used in the
               literature or assigned by Swiss-Prot (currenty
               only available for spliced isoforms).

               Format: Name=common name;
               Example: Name=Alpha;

  Synonyms     Synonyms for an isoform as used in the
               literature; optional.

               Format: Synonyms=synonym_1[, synonym_n];
               Example: Synonyms=B, KL5;

  IsoId        Unique identifier for an isoform, consisting of
               the Swiss-Prot accession number, followed by a
               dash and an identifier for this isoform.

               Format: IsoId=acc#-isoform_number[,acc#-isoform_number];
               Example: IsoId=P05067-1;

  Sequence     Lists all FT VARSPLIC identifiers (VSP_#), which
               are needed to build the sequence for a specific
               isoform. If the accession number of the IsoId
               does not correspond to the accession number of
               the current entry, this topic contains the term
               'External'.

               Format: Sequence=VSP_#[,VSP_#]|Displayed|External|Not described;
               Example: Sequence=Displayed;
               Example: Sequence=VSP_000013, VSP_000014;

  Note         Notes concerning current isoform; optional;

               Format: Note=free text;
               Example: Note=Predicted;


In the case of 'Alternative initiation' the topic 'Event' can be followed
by a 'Comment' of free text. Format:

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative initiation;
CC         Comment=Optional free text with information on alternative
CC         initiation or the products retrieved from this event. In the
CC         case of alternative initiation there will be no other topics;

In the case of 'Alternative splicing' the topic 'Event' can be followed by
a 'Comment' of free text and a listing of all described isoforms. Format:

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing;
CC         Comment=Optional free text with information on alternative
CC         splicing or the products retrieved from this event;
CC       Name=isoform_1; Synonyms=synonym_1[, synonym_n];
CC         IsoId=isoform_identifier_1[, isoform_identifer_n];
CC         Sequence=VSP_identifier_1 [, VSP_identifier_n];
CC         Note=Optional note concerning isoform_1;
CC       Name=isoform_n; Synonyms=synonym_1[, synonym_n];
CC         IsoId=isoform_identifier_1[, isoform_identifer_n];
CC         Sequence=VSP_identifier_1 [, VSP_identifier_n];
CC         Note=Optional note concerning isoform_n;

Example for new format of the CC lines and the corresponding FT lines for
an entry with alternative splicing:

...
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=9;
CC         Comment=Additional isoforms seem to exist. APP695, APP751 and
CC         APP770 are the major isoforms. The L-isoforms are referred to as
CC         appicans. Experimental confirmation may be lacking for some
CC         isoforms;
CC       Name=APP770; Synonyms=Prea4 770;
CC         IsoId=P05067-1; Sequence=Displayed;
CC       Name=APP305;
CC         IsoId=P05067-2; Sequence=VSP_000005, VSP_000006;
CC       Name=L-APP677;
CC         IsoId=P05067-3; Sequence=VSP_000002, VSP_000004, VSP_000009;
CC       Name=APP695; Synonyms=Prea4 695;
CC         IsoId=P05067-4; Sequence=VSP_000002, VSP_000004;
CC       Name=L-APP696;
CC         IsoId=P05067-5; Sequence=VSP_000002, VSP_000003, VSP_000009;
CC       Name=APP714;
CC         IsoId=P05067-6; Sequence=VSP_000002, VSP_000003;
CC       Name=L-APP733;
CC         IsoId=P05067-7; Sequence=VSP_000007, VSP_000008, VSP_000009;
CC       Name=APP751; Synonyms=Prea4 751;
CC         IsoId=P05067-8; Sequence=VSP_000007, VSP_000008;
CC       Name=L-APP752;
CC         IsoId=P05067-9; Sequence=VSP_000009;
...
FT   VARSPLIC    289    289       E -> V (in isoform APP695, isoform
FT                                L-APP696, isoform L-APP677 and isoform
FT                                APP714).
FT                                /FTId=VSP_000002.
FT   VARSPLIC    290    345       Missing (in isoform L-APP696 and isoform
FT                                APP714).
FT                                /FTId=VSP_000003.
FT   VARSPLIC    290    364       Missing (in isoform APP695 and isoform
FT                                L-APP677).
FT                                /FTId=VSP_000004.
FT   VARSPLIC    290    305       VCSEQAETGPCRAMIS -> KWYKEVHSGQARWLML (in
FT                                isoform APP305).
FT                                /FTId=VSP_000005.
FT   VARSPLIC    306    770       Missing (in isoform APP305).
FT                                /FTId=VSP_000006.
FT   VARSPLIC    345    345       M -> I (in isoform L-APP733 and isoform
FT                                APP751).
FT                                /FTId=VSP_000007.
FT   VARSPLIC    346    364       Missing (in isoform L-APP733 and isoform
FT                                APP751).
FT                                /FTId=VSP_000008.
FT   VARSPLIC    637    654       Missing (in isoform L-APP677, isoform
FT                                L-APP696, isoform L-APP733 and isoform
FT                                L-APP752).
FT                                /FTId=VSP_000009.
...


     3.4.2   PATHWAY

We are gradually structuring the comment line topic PATHWAY. To describe
the biochemical pathway in which the protein is involved, we use the
following format:

CC   -!- PATHWAY: biochemical pathway; nth step.[ Comment.]

Example:

CC   -!- PATHWAY: Coenzyme A (CoA) biosynthesis; first step.


     3.4.3   COFACTOR

The comment line topic COFACTOR is gradually being modified to the
following format:

CC   -!- COFACTOR: cofactor1[, cofactor2 and cofactor3].[ Comment.]

Examples:

CC   -!- COFACTOR: Magnesium.
CC   -!- COFACTOR: Copper, Manganese and Nickel.


     3.5   Changes concerning cross-references (DR line)

We will add cross-references to the Gene Ontology (GO) database (available
at http://www.geneontology.org/), which provides controlled vocabularies
for the description of the molecular function, biological process and
cellular component of gene products.

The identifiers of the appropriate DR line are:

 Data bank identifier: GO
 Primary identifier:   GO's unique identifier for a GO term.
 Secondary identifier: A 1-letter abbreviation for one of the 3 ontology
                       aspects, separated from the GO term by a column. If
                       the term is longer than 45 characters, the first 43
                       characters are indicated followed by 3 dots ('...').
                       The abbreviations for the 3 distinct aspects of the
                       ontology are P (biological Process), F (molecular
                       Function) and C (cellular Component).
 Tertiary identifier:  3-character GO evidence code.
 Example:              DR   GO; GO:0003677; F:DNA binding; TAS.


     3.6   Modifications concerning the feature table (FT line)

We are investigating a major effort in the annotation of posttranslational
modifications, which has an effect on various feature keys and feature
descriptions. Major format changes are described below.


     3.6.1   New feature key 'CROSSLNK'

The feature key 'CROSSLNK' will be introduced to describe bonds between
amino acids, which are formed posttranslationally within a peptide or
between peptides, such as isopeptidic bonds, carbon-carbon linkages,
carbon-nitrogen linkages and backbone condensations. It will also include
the description of tioether bonds and thiolester bonds and thus the feature
keys 'THIOETH' and 'THIOLEST' will be removed.

Note: Disulfide bonds occur so often in proteins, that we will keep the
special feature key 'DISULFID' to describe this kind of linkage.

Format:

FT   CROSSLNK    from     to      Description.


     3.6.2   Removal of the feature key 'THIOETH'

See section 3.6.1.


     3.6.3   Removal of the feature key 'THIOLEST'

See section 3.6.1.


                   4   Status of the documentation files

Swiss-Prot is distributed with a large number of documentation files. Some
of these files have been available for a long time (the user manual,
release notes, the various indexes for authors, citations, keywords, etc.),
but many have been created recently and we are continuously adding new
files, and updating and modifying existing files. Please note that the
header in many documentation files has changed. The following table lists
all the documents that are currently available.

See also section 7.3 for information on how to access updated versions of
all documents between major releases.

 userman.txt    User manual
 relnotes.txt   Release notes for the current release (41)
 shortdes.txt   Short description of entries in Swiss-Prot

 jourlist.txt   List of cited journals
 keywlist.txt   List of keywords
 plasmid.txt    List of plasmids
 speclist.txt   List of organism (species) identification codes
 tisslist.txt   List of tissues
 experts.txt    List of on-line experts for PROSITE and Swiss-Prot
 dbxref.txt     List of databases cross-referenced in Swiss-Prot
 submit.txt     Submission of sequence data to Swiss-Prot

 acindex.txt    Accession number index
 autindex.txt   Author index
 citindex.txt   Citation index
 keyindex.txt   Keyword index
 speindex.txt   Species index
 deleteac.txt   Deleted accession number index
 7tmrlist.txt   List of 7-transmembrane G-linked receptor entries
 aatrnasy.txt   List of aminoacyl-tRNA synthetases
 allergen.txt   Nomenclature and index of allergen sequences
 annbioch.txt   Swiss-Prot annotation: how is biochemical information
                assigned to sequence entries
 arath.txt      Index of Arabidopsis thaliana entries and their
                corresponding gene designations [see 2]
 bacsu.txt      Index of Bacillus subtilis strain 168 chromosomal entries
                and their corresponding SubtiList cross-references [see 1]
 bloodgrp.txt   Blood group antigen proteins
 bucai.txt      Index of Buchnera aphidicola (subsp. Acyrthosiphon pisum)
                entries [see 2]
 bucap.txt      Index of Buchnera aphidicola (subsp. Schizaphis graminum)
                entries[see 2]
 calbican.txt   Index of Candida albicans entries and their corresponding
                gene designations
 cdlist.txt     CD nomenclature for surface proteins of human leucocytes
                Index of Caenorhabditis elegans entries and their
 celegans.txt   corresponding gene designations and WormPep
                cross-references
                Index of Dictyostelium discoideum entries and their
 dicty.txt      corresponding gene designations and DictyDB
                cross-references
 ec2dtosp.txt   Index of Escherichia coli Gene-protein database
                (ECO2DBASE) entries referenced in Swiss-Prot
 ecoli.txt      Index of Escherichia coli strain K12 chromosomal entries
                and their corresponding EcoGene cross-references
 embltosp.txt   Index of EMBL Nucleotide Sequence Database entries
                referenced in Swiss-Prot
 extradom.txt   Nomenclature of extracellular domains
 fly.txt        Index of Drosophila entries and their corresponding
                FlyBase cross-references
 glycosid.txt   Classification of glycosyl hydrolase families and index of
                glycosyl hydrolase entries in Swiss-Prot
 haein.txt      Index of Haemophilus influenzae strain Rd chromosomal
                entries [see 1]
 helpy.txt      Index of Helicobacter pylori strain 26695 chromosomal
                entries [see 1]
 hoxlist.txt    Vertebrate homeotic Hox proteins: nomenclature and index
 humchr01.txt   Index of proteins encoded on human chromosome 1
 humchr02.txt   Index of proteins encoded on human chromosome 2
 humchr03.txt   Index of proteins encoded on human chromosome 3
 humchr04.txt   Index of proteins encoded on human chromosome 4
 humchr05.txt   Index of proteins encoded on human chromosome 5
 humchr06.txt   Index of proteins encoded on human chromosome 6
 humchr07.txt   Index of proteins encoded on human chromosome 7
 humchr08.txt   Index of proteins encoded on human chromosome 8
 humchr09.txt   Index of proteins encoded on human chromosome 9
 humchr10.txt   Index of proteins encoded on human chromosome 10
 humchr11.txt   Index of proteins encoded on human chromosome 11
 humchr12.txt   Index of proteins encoded on human chromosome 12
 humchr13.txt   Index of proteins encoded on human chromosome 13
 humchr14.txt   Index of proteins encoded on human chromosome 14
 humchr15.txt   Index of proteins encoded on human chromosome 15
 humchr16.txt   Index of proteins encoded on human chromosome 16
 humchr17.txt   Index of proteins encoded on human chromosome 17
 humchr18.txt   Index of proteins encoded on human chromosome 18
 humchr19.txt   Index of proteins encoded on human chromosome 19
 humchr20.txt   Index of proteins encoded on human chromosome 20
 humchr21.txt   Index of proteins encoded on human chromosome 21
 humchr22.txt   Index of proteins encoded on human chromosome 22
 humchrx.txt    Index of proteins encoded on human chromosome X
 humchry.txt    Index of proteins encoded on human chromosome Y
 humpvar.txt    Index of human proteins with sequence variants
 initfact.txt   List and index of translation initiation factors
 intein.txt     Index of intein-containing entries referenced in
                Swiss-Prot
 metallo.txt    Classification of metallothioneins and index of the
                entries in Swiss-Prot
 metja.txt      Index of Methanococcus jannaschii entries [see 1]
 mgdtosp.txt    Index of MGD entries referenced in Swiss-Prot
 mimtosp.txt    Index of MIM entries referenced in Swiss-Prot
 mycge.txt      Index of Mycoplasma genitalium strain G-37 chromosomal
                entries [see 1]
 mycpn.txt      Index of Mycoplasma pneumoniae strain M129 chromosomal
                entries [see 2]
 ngr234.txt     Table of predicted proteins in Rhizobium plasmid pNGR234a
 nomlist.txt    List of nomenclature related references for proteins
 pdbtosp.txt    Index of Protein Data Bank (PDB) entries referenced in
                Swiss-Prot
 peptidas.txt   Classification of peptidase families and index of
                peptidase entries in Swiss-Prot
 plastid.txt    List of chloroplast and cyanelle encoded proteins
 pombe.txt      Index of Schizosaccharomyces pombe entries and their
                corresponding gene designations
 restric.txt    List of restriction enzyme and methylase entries
 ribosomp.txt   Index of ribosomal proteins classified by families on the
                basis of sequence similarities
 ricpr.txt      Index of Rickettsia prowazekii strain Madrid E entries
                [see 1]
 salty.txt      Index of Salmonella typhimurium strain LT2 chromosomal
                entries and their corresponding StyGene cross-references
 syny3.txt      Index of Synechocystis sp. strain PCC 6803 entries [see 1]
 upflist.txt    List of UPF (Uncharacterized Protein Families) and index
                of members
 yeast.txt      Index of Saccharomyces cerevisiae entries in Swiss-Prot
                and their corresponding gene designations
 yeast1.txt     Yeast chromosome I entries
 yeast2.txt     Yeast chromosome II entries
 yeast3.txt     Yeast chromosome III entries
 yeast5.txt     Yeast chromosome V entries
 yeast6.txt     Yeast chromosome VI entries
 yeast7.txt     Yeast chromosome VII entries
 yeast8.txt     Yeast chromosome VIII entries
 yeast9.txt     Yeast chromosome IX entries
 yeast10.txt    Yeast chromosome X entries
 yeast11.txt    Yeast chromosome XI entries
 yeast13.txt    Yeast chromosome XIII entries
 yeast14.txt    Yeast chromosome XIV entries

Notes:

 1)  The filenames for indexes of microbe-specific entries have been
     renamed; the filename is now composed of the 5-letter code used for
     the species in the Swiss-Prot entry name and the extension 'txt'.
     This modification concerns the following files:

     'bacsu.txt' (formerly: 'subtilis.txt'), 'haein.txt' (formerly:
     'haeinflu.txt'), 'helpy.txt' (formerly: 'hpylori.txt'), 'metja.txt'
     (formerly: 'mjannasc.txt'), 'mycge.txt' (formerly: 'mgenital.txt'),
     'ricpr.txt' (formerly: 'rprowaze.txt'), 'syny3.txt' (formerly:
     'pcc6803.txt').

 2)  The files 'arath.txt', 'bucai.txt', 'bucap.txt' and 'mycpn.txt' are
     new documents introduced since release 40.

We have continued to include in some Swiss-Prot documentation files the
references to Web sites relevant to the subject under consideration. There
are now 89 documents that include such links.


      5   New features of the ExPASy World-Wide Web server related to
                                Swiss-Prot

Explicit general and continuously updated documentation about the ExPASy
server is available at http://www.expasy.org/doc/expasy.pdf.

ExPASy is constantly modified and improved. If you wish to be informed on
the changes made to the server you can either:

   * Read the document 'History of changes, improvements and new features'
     which is available at the address: http://www.expasy.org/history.html
   * Subscribe to Swiss-Flash, a service that reports news of databases,
     software and service developments. By subscribing to this service, you
     will automatically get Swiss-Flash bulletins by electronic mail. To
     subscribe, use the address: http://www.expasy.org/swiss-flash/.

Among all the improvements and the new features introduced since the last
Swiss-Prot release, here are those that we believe are specifically useful
to Swiss-Prot users:

  1. The NiceProt view of Swiss-Prot has been further improved: access to
     documentation has been facilitated by adding "mouse-over" hypertext
     links from various sections in NiceProt to the corresponding
     information in the user manual. Those hypertext links, which give
     access to documentation rather than the data related to the protein
     entry, are visually different from the ordinary hyperlinks. While they
     are not immediately recognizable as such, the user can see that they
     are clickable by moving the mouse pointer over the section headings
     such as "References" or "Keywords". A short description of the linked
     information appears at the bottom of the web browser, and when
     clicked, a small additional window is opened with related information
     extracted from the user manual.

     Similarly, in the "Cross-references" section, the names of the
     databases to which an entry is cross-referenced are linked to the
     corresponding sections in the document dbxref.txt (List of databases
     cross-referenced in Swiss-Prot).

  2. Implicit links have been added to the resources AraC-XylS, Ensembl and
     ModBase. We have removed the implicit links to DOMO, which is no
     longer maintained.

     For more details on Swiss-Prot cross-references, implicit and explicit
     links, you can read:

     Gasteiger E., Jung E., Bairoch A.
     Swiss-Prot: connecting biological knowledge via a protein database.
     Curr. Issues Mol. Biol. 3:47-55(2001)

  3. A few improvements have been applied to the pages describing the Human
     Proteomics Initiative (HPI). For each human chromosome a link is
     provided to the corresponding index of Swiss-Prot entries, to relevant
     information in the EBI Proteome database, in Ensembl, in the Human
     Genome Resources at NCBI and in euGenes at Indiana University.

     The HPI status report has been modified to include, for each of the
     counted items (e.g. splice variants, variants, references) not only
     the absolute number, but also the maximal and average number of
     occurrences per entry, and the number of entries concerned by the
     counted item.


                   6 TrEMBL - a supplement to Swiss-Prot

The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into
Swiss-Prot. Since we do not want to dilute the quality standards of
Swiss-Prot by incorporating sequences into the database without proper
sequence analysis and annotation, we cannot speed up the incorporation of
new incoming data indefinitely. But as we also want to make the sequences
available as quickly as possible, we introduced in 1995 a computer
annotated supplement to Swiss-Prot. This supplement consists of entries in
Swiss-Prot-like format derived from the translation of all coding sequences
(CDS) in the EMBL nucleotide sequence database, except those already
included in Swiss-Prot.

This supplement is named TrEMBL (Translation from EMBL). It can be
considered as a preliminary section of Swiss-Prot. This Swiss-Prot release
is supplemented by TrEMBL release 21.

TrEMBL is available by FTP from the EBI and ExPASy servers in the directory
'/databases/trembl'. It can be queried on WWW by the EBI and ExPASy SRS
servers. It is distributed with its own set of release notes.


                  7   FTP access to Swiss-Prot and TrEMBL

     7.1   Generalities

Swiss-Prot is available for download on the following anonymous FTP
servers:

 Organization Swiss Institute of Bioinformatics (SIB)
              ftp.expasy.org, au.expasy.org, bo.expasy.org,
 Address      ca.expasy.org, cn.expasy.org, kr.expasy.org,
              tw.expasy.org, us.expasy.org
 Directory    /databases/swiss-prot/

 Organization European Bioinformatics Institute (EBI)
 Address      ftp.ebi.ac.uk
 Directory    /pub/databases/swissprot/


     7.2   Non-redundant database

On the ExPASy and EBI FTP servers we distribute files that make up a
non-redundant and complete protein sequence database consisting of three
components:

1) Swiss-Prot
2) TrEMBL
3) New entries to be integrated later into TrEMBL (hereafter known as
   TrEMBL_New)

Every week three files are completely rebuilt. These files are named:
sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz. As indicated by their '.
gz' extension, these are gzip-compressed files which, when decompressed,
produce ASCII files in Swiss-Prot format.

Three other files are also available (sprot.fas.gz, trembl.fas.gz and
trembl_new.fas.gz) which are compressed 'fasta' format sequence files
useful for building the databases used by FASTA, BLAST and other sequence
similarity search programs. Please do not use these files for any other
purpose, as you will lose all annotations by using this stripped-down
format.

The files for the non-redundant database are stored in the directory
'/databases/sp_tr_nrdb' on the ExPASy FTP server (ftp.expasy.org) and in
the directory '/pub/databases/sp_tr_nrdb' on the EBI FTP server
(ftp.ebi.ac.uk).

Additional notes:

   * The Swiss-Prot file continuously grows as new annotated sequences are
     added.

   * The TrEMBL file decreases in size as sequences are moved out of that
     section after being annotated and moved into Swiss-Prot. Four times a
     year a new release of TrEMBL is built at EBI, at this point the TrEMBL
     file increases in size as it then includes all of the new data (see
     next section) that has accumulated since the last release.

   * The TrEMBL_New file starts as a very small file and grows in size
     until a new release of TrEMBL is available.

   * Swiss-Prot and TrEMBL share the same system of accession numbers.
     Therefore you will not find any primary accession number duplicated
     between the two sections. A TrEMBL entry (and its associated accession
     number(s)) can either move to Swiss-Prot as a new entry or be merged
     with an existing Swiss-Prot entry. In the latter case, the accession
     number(s) of that TrEMBL entry are added to that of the Swiss-Prot
     entry.

   * TrEMBL_New does not have real accession numbers. However it was
     necessary to have an 'AC' line so as to be able to use it with
     different software products. This AC line contains a temporary
     identifier which consists of the protein_ID (protein sequence
     identifier) of the coding sequence in the parent nucleotide sequence.

   * TrEMBL_New is quite messy! You will of course find new sequence
     entries but you will also encounter sequences that are going to be
     used to update existing TrEMBL or Swiss-Prot entries. None of the
     "cleaning" steps that are applied to produce a TrEMBL release are run
     on TrEMBL_New nor are any of the computer-annotation software tools
     that are used to enhance the information content of TrEMBL. TrEMBL_New
     is provided only so that users can be sure not to miss any important
     new sequences when they run similarity searches.

   * While these three files allow you to build what we call a
     'non-redundant' database, it must be noted that this is not completely
     a true statement. Without going into a long explanation we can say
     that this is currently the best attempt in providing a complete
     selection of protein sequence entries while trying to eliminate
     redundancies. While Swiss-Prot is completely (well 99.994% !)
     non-redundant, TrEMBL is far from being non-redundant and the addition
     of Swiss-Prot + TrEMBL is even less so.

   * To describe to your users the version of the non-redundant database
     that you are providing them with, you should use a statement of the
     form:

          Swiss-Prot release 41.x of xx-yyy-2003;
          TrEMBL release 23.x of xx-yyy-2003;
          TrEMBL_New of xx-yyy-2003.


     7.3   Weekly updates of Swiss-Prot documents

Whilst the ExPASy FTP server so far only allowed FTP access to the
Swiss-Prot documents and indexes in their versions at the time of the last
full release, all documents are now updated with every weekly release of
Swiss-Prot. They are available for FTP download from the directory
/databases/swiss-prot/updated_doc/.


     7.4   Weekly updates of Swiss-Prot

Weekly updates of Swiss-Prot are available by anonymous FTP. Three files
are generated at each update:

 new_seq.dat Contains all the new entries since the last full
             release;

 upd_seq.dat Contains the entries for which the sequence data has
             been updated since the last release;

 upd_ann.dat Contains the entries for which one or more annotation
             fields have been updated since the last release.

Important notes

   * Although we try to follow a regular schedule, we do not promise to
     update these files every week. In most cases two weeks may elapse
     between two updates.
   * Instead of using the above files, you can, every week, download an
     updated copy of the Swiss-Prot database. This file is available in the
     directory containing the non-redundant database (see section 7.2).


                          8   ENZYME and PROSITE

     8.1   The ENZYME nomenclature database

Release 30.0 of the ENZYME nomenclature database is distributed with
release 41 of Swiss-Prot. ENZYME release 30.0 contains information relative
to 4'136 enzymes. In this release, we have added a significant number of
new entries and we also updated many entries.


     8.2   The PROSITE database

PROSITE now comes with its own release notes.


                          9   We need your help!

We welcome feedback from our users. We would especially appreciate your
notifying us if you find that sequences belonging to your field of
expertise are missing from the database. We also would like to be notified
about annotations to be updated, if, for example, the function of a protein
has been clarified or if new information about post-translational
modifications has become available. To facilitate this feedback we offer,
on the ExPASy WWW server, a form that allows the submission of updates
and/or corrections to Swiss-Prot:

     http://www.expasy.org/sprot/update.html

It is also possible, from any entry in Swiss-Prot displayed by the ExPASy
server, to submit updates and/or corrections for that particular entry.
Finally, you can also send your comments by electronic mail to the address:

     swiss-prot@expasy.org

Note that all update requests are assigned a unique identifier of the form
UR-Xnnnn (example: UR-A0123). This identifier is used internally by the
Swiss-Prot staff at SIB and EBI to track requests and is also used in
e-mail exchanges with the persons who have submitted a request.


                       APPENDIX A:   Some statistics

     A.1   Amino acid composition

     A.1.1   Composition in percent for the complete database

   Ala (A) 7.72   Gln (Q) 3.92   Leu (L) 9.56   Ser (S) 6.98
   Arg (R) 5.24   Glu (E) 6.54   Lys (K) 5.96   Thr (T) 5.51
   Asn (N) 4.28   Gly (G) 6.90   Met (M) 2.36   Trp (W) 1.18
   Asp (D) 5.27   His (H) 2.26   Phe (F) 4.06   Tyr (Y) 3.13
   Cys (C) 1.60   Ile (I) 5.88   Pro (P) 4.88   Val (V) 6.66

   Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.01


     A.1.2   Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp


     A.2   Taxonomic origin

Total number of species represented in this release of Swiss-Prot: 7'778
The first twenty species represent 51'656 sequences: 42.1% of the total
number of entries.


     A.2.1   Table of the frequency of occurrence of species

        Species represented 1x: 3679
                            2x: 1206
                            3x:  619
                            4x:  403
                            5x:  273
                            6x:  251
                            7x:  192
                            8x:  146
                            9x:  120
                           10x:   66
                       11- 20x:  331
                       21- 50x:  250
                       51-100x:   84
                         >100x:  158


     A.2.2   Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1       9172  Homo sapiens (Human)
       2       6169  Mus musculus (Mouse)
       3       4892  Saccharomyces cerevisiae (Baker's yeast)
       4       4832  Escherichia coli
       5       3442  Rattus norvegicus (Rat)
       6       2402  Bacillus subtilis
       7       2291  Caenorhabditis elegans
       8       2116  Schizosaccharomyces pombe (Fission yeast)
       9       1952  Arabidopsis thaliana (Mouse-ear cress)
      10       1773  Haemophilus influenzae
      11       1764  Drosophila melanogaster (Fruit fly)
      12       1529  Methanococcus jannaschii
      13       1485  Escherichia coli O157:H7
      14       1389  Bos taurus (Bovine)
      15       1371  Mycobacterium tuberculosis
      16       1240  Salmonella typhimurium
      17       1062  Gallus gallus (Chicken)
      18        942  Shigella flexneri
      19        919  Synechocystis sp. (strain PCC 6803)
      20        914  Escherichia coli O6
      21        876  Archaeoglobus fulgidus
      22        839  Pseudomonas aeruginosa
      23        838  Xenopus laevis (African clawed frog)
      24        822  Sus scrofa (Pig)
      25        771  Salmonella typhi
      26        716  Aquifex aeolicus
      27        704  Oryctolagus cuniculus (Rabbit)
      28        687  Mycoplasma pneumoniae
      29        670  Rhizobium meliloti (Sinorhizobium meliloti)
      30        609  Vibrio cholerae
      31        599  Treponema pallidum
      32        581  Mycobacterium leprae
      33        572  Buchnera aphidicola (subsp. Acyrthosiphon pisum)
      34        560  Buchnera aphidicola (subsp. Schizaphis graminum)
      35        536  Helicobacter pylori (Campylobacter pylori)
      36        535  Rickettsia prowazekii
      37        524  Yersinia pestis
      38        519  Helicobacter pylori J99 (Campylobacter pylori J99)
      39        519  Streptomyces coelicolor
      40        494  Bacillus halodurans
      41        491  Zea mays (Maize)
      42        491  Methanobacterium thermoautotrophicum
      43        486  Mycoplasma genitalium
      44        480  Pasteurella multocida
      45        454  Anabaena sp. (strain PCC 7120)
      46        432  Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
      47        419  Thermotoga maritima
      48        416  Oryza sativa (Rice)
      49        405  Borrelia burgdorferi (Lyme disease spirochete)
      50        404  Chlamydia trachomatis
      51        403  Rhizobium sp. (strain NGR234)
      52        393  Canis familiaris (Dog)
      53        391  Chlamydia pneumoniae (Chlamydophila pneumoniae)
      54        390  Neisseria meningitidis (serogroup B)
      55        386  Neisseria meningitidis (serogroup A)
      56        381  Chlamydia muridarum
      57        366  Caulobacter crescentus
      58        365  Pyrococcus horikoshii
      59        359  Listeria monocytogenes
      60        359  Clostridium acetobutylicum
      61        357  Pyrococcus abyssi
      62        354  Ralstonia solanacearum (Pseudomonas solanacearum)
      63        352  Listeria innocua
      64        352  Rhizobium loti (Mesorhizobium loti)
      65        350  Streptococcus pneumoniae
      66        346  Agrobacterium tumefaciens (strain C58 / ATCC 33970)
      67        341  Nicotiana tabacum (Common tobacco)
      68        337  Xylella fastidiosa
      69        335  Deinococcus radiodurans
      70        332  Ovis aries (Sheep)
      71        326  Xanthomonas campestris (pv. campestris)
      72        325  Halobacterium sp. (strain NRC-1)
      73        320  Staphylococcus aureus (strain N315)
      74        320  Campylobacter jejuni
      75        317  Staphylococcus aureus (strain Mu50 / ATCC 700699)
      76        316  Dictyostelium discoideum (Slime mold)
      77        311  Clostridium perfringens
      78        299  Sulfolobus solfataricus
      79        297  Staphylococcus aureus (strain MW2)
      80        290  Corynebacterium glutamicum (Brevibacterium flavum)
      81        288  Pisum sativum (Garden pea)
      82        287  Xanthomonas axonopodis (pv. citri)
      83        285  Streptococcus pyogenes
      84        283  Aeropyrum pernix
      85        278  Pyrococcus furiosus
      86        278  Staphylococcus aureus
      87        269  Brucella melitensis
      88        268  Bacteriophage T4
      89        266  Neurospora crassa
      90        265  Triticum aestivum (Wheat)
      91        264  Candida albicans (Yeast)
      92        261  Rickettsia conorii
      93        258  Hordeum vulgare (Barley)
      94        254  Vaccinia virus (strain Copenhagen)
      95        251  Glycine max (Soybean)
      96        250  Lycopersicon esculentum (Tomato)
      97        248  Rhodobacter capsulatus (Rhodopseudomonas capsulata)
      98        247  Thermoanaerobacter tengcongensis
      99        246  Solanum tuberosum (Potato)
     100        244  Pseudomonas putida


     A.2.3   Taxonomic distribution of the sequences

   Kingdom       Sequences (% of the database)
    Archaea            7119 (  6%)
    Bacteria          46344 ( 38%)
    Eukaryota         60623 ( 49%)
    Viruses            8478 (  7%)

   Within Eukaryota:

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                   9172 ( 15%)           (  7%)
     Other Mammalia         16041 ( 26%)           ( 13%)
     Other Vertebrata        5806 ( 10%)           (  5%)
     Viridiplantae           9581 ( 16%)           (  8%)
     Fungi                   9337 ( 15%)           (  8%)
     Insecta                 3352 (  6%)           (  3%)
     Nematoda                2504 (  4%)           (  2%)
     Other                   4830 (  8%)           (  4%)


     A.3   Sequence size

     A.3.1   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50    2283             1001-1100     1127
                 51- 100    8420             1101-1200      796
                101- 150   12542             1201-1300      550
                151- 200   11267             1301-1400      379
                201- 250   11387             1401-1500      305
                251- 300   10019             1501-1600      213
                301- 350   10039             1601-1700      166
                351- 400    9804             1701-1800      118
                401- 450    7435             1801-1900      128
                451- 500    6547             1901-2000      106
                501- 550    5067             2001-2100       59
                551- 600    3400             2101-2200       96
                601- 650    2753             2201-2300       99
                651- 700    2015             2301-2400       57
                701- 750    1766             2401-2500       56
                751- 800    1474             >2500          326
                801- 850    1101
                851- 900    1142
                901- 950     817
                951-1000     704


     A.3.2   Longest and shortest sequences

   The shortest sequence is  GRWM_HUMAN (P24272) :     3 amino acids.
   The longest sequence is   NEBU_HUMAN (P20929) :  6669 amino acids.


     A.4   Journal citations

Note: the following citation statistics reflect the number of distinct
journal citations.

Total number of journals cited in this release of Swiss-Prot: 1'316


     A.4.1   Table of the frequency of journal citations

        Journals cited 1x:  496
                       2x:  167
                       3x:   84
                       4x:   61
                       5x:   46
                       6x:   47
                       7x:   26
                       8x:   25
                       9x:   22
                      10x:   11
                  11- 20x:   98
                  21- 50x:   98
                  51-100x:   39
                    >100x:   96


     A.4.2   List of the most cited journals in Swiss-Prot

   Nb    Citations   Journal name
   --    ---------   -------------------------------------------------------------
    1         9138   Journal of Biological Chemistry
    2         5013   Proceedings of the National Academy of Sciences of the U.S.A.
    3         3631   Nucleic Acids Research
    4         3612   Journal of Bacteriology
    5         3381   Gene
    6         2663   FEBS Letters
    7         2598   Biochemical and Biophysical Research Communications
    8         2429   European Journal of Biochemistry
    9         2383   Biochemistry
   10         2171   The EMBO Journal
   11         2045   Nature
   12         2024   Biochimica et Biophysica Acta
   13         1821   Journal of Molecular Biology
   14         1752   Genomics
   15         1579   Cell
   16         1542   Molecular and Cellular Biology
   17         1243   Biochemical Journal
   18         1146   Science
   19         1123   Plant Molecular Biology
   20         1117   Molecular and General Genetics
   21         1068   Molecular Microbiology
   22          855   Journal of Biochemistry
   23          830   Virology
   24          748   Human Molecular Genetics
   25          693   Journal of Cell Biology
   26          645   Nature Genetics
   27          597   Journal of Virology
   28          588   Plant Physiology
   29          582   Human Mutation
   30          579   Genes and Development
   31          550   Oncogene
   32          538   The American Journal of Human Genetics
   33          530   Infection and Immunity
   34          529   Yeast
   35          516   Journal of Immunology
   36          494   Journal of General Virology
   37          469   Archives of Biochemistry and Biophysics
   38          454   Structure
   39          446   FEMS Microbiology Letters
   40          433   Microbiology
   41          394   Development
   42          379   Human Genetics
   43          376   Current Genetics
   44          376   Nature Structural Biology
   45          347   Genetics
   46          343   Molecular and Biochemical Parasitology
   47          335   Blood
   48          317   Applied and Environmental Microbiology
   49          313   Journal of Clinical Investigation
   50          299   Molecular Endocrinology
   51          283   DNA and Cell Biology
   52          282   Protein Science
   53          281   Journal of Molecular Evolution
   54          276   Developmental Biology
   55          276   Mammalian Genome
   56          271   Biological Chemistry Hoppe-Seyler
   57          251   Cancer Research
   58          248   Journal of Experimental Medicine
   59          246   Neuron
   60          241   Immunogenetics
   61          240   Mechanisms of Development
   62          229   Journal of General Microbiology
   63          228   Endocrinology
   64          221   DNA Sequence
   65          217   Acta Crystallographica, Section D
   66          213   Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
   67          209   Molecular Biology of the Cell
   68          207   The Plant Cell
   69          203   Journal of Cell Science
   70          191   Molecular Biology and Evolution
   71          190   Brain Research. Molecular Brain Research
   72          187   The Plant Journal
   73          183   Journal of Neurochemistry
   74          180   Journal of Neuroscience
   75          160   Comparative Biochemistry and Physiology
   76          158   Cytogenetics and Cell Genetics
   77          156   DNA
   78          154   Bioscience, Biotechnology, and Biochemistry
   79          152   The Journal of Clinical Endocrinology and Metabolism
   80          145   Toxicon
   81          144   Molecular Pharmacology
   82          143   Antimicrobial Agents and Chemotherapy
   83          140   American Journal of Physiology
   84          131   Biochimie
   85          127   Bioorganicheskaia Khimiia
   86          125   Virus Research
   87          125   Proteins
   88          122   DNA Research
   89          121   Molecular Plant-Microbe Interactions
   90          119   Hemoglobin
   91          116   Peptides
   92          114   Agricultural and Biological Chemistry
   93          112   Current Biology
   94          111   Journal of Investigative Dermatology
   95          110   Molecular and Cellular Endocrinology
   96          106   Genome Research

     A.5   Statistics for some line types

The following table summarizes the total number of some Swiss-Prot lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                     232571              1.90
   Journal                          195556    111991    1.60
   Submitted to EMBL/GenBank/DDBJ    34500     27873    0.28
   Unpublished observations            536       532   <0.01
   Submitted to Swiss-Prot             464       462   <0.01
   Plant Gene Register                 463       453   <0.01
   Book citation                       460       450   <0.01
   Thesis                              190       188   <0.01
   Submitted to other databases        190       189   <0.01
   Unpublished results                 123       121   <0.01
   Patent                               87        86   <0.01
   Worm Breeder's Gazette                2         2   <0.01

Comments (CC)                       405433              3.31
   SIMILARITY                       117866    103489    0.96
   FUNCTION                          77092     75796    0.63
   SUBCELLULAR LOCATION              55038     55038    0.45
   CATALYTIC ACTIVITY                39528     37138    0.32
   SUBUNIT                           33846     33846    0.28
   PATHWAY                           17449     16966    0.14
   TISSUE SPECIFICITY                13626     13626    0.11
   COFACTOR                          12141     12141    0.10
   MISCELLANEOUS                      7816      7190    0.06
   PTM                                7140      6571    0.06
   ALTERNATIVE PRODUCTS               3946      3946    0.03
   INDUCTION                          3558      3558    0.03
   DOMAIN                             3535      3241    0.03
   DEVELOPMENTAL STAGE                3362      3362    0.03
   CAUTION                            3342      3172    0.03
   DISEASE                            2244      1868    0.02
   ENZYME REGULATION                  1753      1753    0.01
   MASS SPECTROMETRY                   893       810    0.01
   DATABASE                            818       751    0.01
   POLYMORPHISM                        343       334   <0.01
   BIOTECHNOLOGY                        50        50   <0.01
   PHARMACEUTICAL                       47        47   <0.01

Features (FT)                       655938              5.35
   DOMAIN                            95401     28727    0.78
   TRANSMEM                          77067     16988    0.63
   CONFLICT                          47337     16661    0.39
   CARBOHYD                          45507     11138    0.37
   DISULFID                          41846     10872    0.34
   TURN                              39177      2956    0.32
   METAL                             36827     10004    0.30
   STRAND                            36304      2644    0.30
   HELIX                             27742      2845    0.23
   ACT_SITE                          24322     15216    0.20
   CHAIN                             23456     19176    0.19
   VARIANT                           23307      4423    0.19
   REPEAT                            22336      3704    0.18
   NP_BIND                           15500     10893    0.13
   SIGNAL                            14828     14826    0.12
   MOD_RES                           13336      7528    0.11
   NON_TER                           10321      7875    0.08
   BINDING                            8145      6285    0.07
   ZN_FING                            7821      2770    0.06
   VARSPLIC                           6951      3249    0.06
   SITE                               6265      4319    0.05
   INIT_MET                           5574      5545    0.05
   PROPEP                             4686      4026    0.04
   MUTAGEN                            4273      1337    0.03
   DNA_BIND                           4193      3949    0.03
   CA_BIND                            4049      1149    0.03
   LIPID                              2946      2395    0.02
   TRANSIT                            2582      2562    0.02
   PEPTIDE                            2517      1001    0.02
   NON_CONS                            804       411    0.01
   UNSURE                              290       123   <0.01
   SE_CYS                              111        73   <0.01
   THIOETH                              94        32   <0.01
   THIOLEST                             23        23   <0.01

Cross-references (DR)               999237              8.15
   EMBL                             230657    116257    1.88
   InterPro                         195677    104236    1.60
   Pfam                             133012     99557    1.09
   PROSITE                          105218     66696    0.86
   PIR                               47040     35736    0.38
   PRINTS                            39413     34822    0.32
   SMART                             38729     29473    0.32
   HSSP                              38069     38069    0.31
   TIGRFAMs                          31394     29063    0.26
   ProDom                            30120     28820    0.25
   HAMAP                             23868     23778    0.19
   PDB                               11737      3547    0.10
   TIGR                              11065     11020    0.09
   MIM                                8171      7086    0.07
   Genew                              7836      7788    0.06
   MGD                                5820      5805    0.05
   SGD                                4936      4882    0.04
   EcoGene                            4228      4226    0.03
   MEROPS                             3316      3222    0.03
   TRANSFAC                           2464      2214    0.02
   WormPep                            2413      2239    0.02
   SubtiList                          2362      2361    0.02
   FlyBase                            2236      2173    0.02
   GeneDB_SPombe                      2131      2101    0.02
   TubercuList                        1400      1363    0.01
   StyGene                            1196      1193    0.01
   SWISS-2DPAGE                        810       809    0.01
   ListiList                           712       658    0.01
   Leproma                             585       581   <0.01
   Gramene                             411       411   <0.01
   MaizeDB                             405       401   <0.01
   HIV                                 370       354   <0.01
   REBASE                              358       353   <0.01
   ECO2DBASE                           351       299   <0.01
   DictyDb                             319       316   <0.01
   GlycoSuiteDB                        259       259   <0.01
   ZFIN                                225       225   <0.01
   PHCI-2DPAGE                         211       211   <0.01
   MypuList                            131       131   <0.01
   Aarhus/Ghent-2DPAGE                 128        98   <0.01
   Siena-2DPAGE                        104       104   <0.01
   HSC-2DPAGE                           85        85   <0.01
   PhosSite                             53        53   <0.01
   COMPLUYEAST-2DPAGE                   50        50   <0.01
   PMMA-2DPAGE                          47        47   <0.01
   Maize-2DPAGE                         39        39   <0.01
   SagaList                             25        25   <0.01
   ANU-2DPAGE                           15        15   <0.01


     A.6   Miscellaneous statistics

Total number of distinct authors cited in Swiss-Prot: 164'410

Total number of chloroplast-encoded sequences: 3'131
Total number of mitochondrial-encoded sequences: 2'385
Total number of cyanelle-encoded sequences: 145
Total number of plasmid-encoded sequences: 2'624

Number of additional sequences encoded in splice variants : 5'661

--End of document--

  

Swiss-Prot release 40.0

Published October 1, 2001
  -------------------------------------------------------------------------
                                           SWISS-PROT Protein Knowledgebase
                                                              Release Notes
                                                   Release 40, October 2001
  -------------------------------------------------------------------------

                             Table of contents

 1   Introduction
 2   Description of the changes made to SWISS-PROT since release 38
 3   Forthcoming changes
 4   Status of the documentation files
 5   The ExPASy World-Wide Web server
 6   TrEMBL - a supplement to SWISS-PROT
 7   FTP access to SWISS-PROT and TrEMBL
 8   ENZYME and PROSITE
 9   We need your help!
 A   Appendix A

                             1   Introduction

Release 40.0 of SWISS-PROT contains 101'602 sequence entries, comprising
37'315'215 amino acids abstracted from 91'880 references. This represents
an increase of 18% over release 39. The growth of the data bank is
summarized below.

      Release     Date    Number of   Number of
                           entries   amino acids
        2.0      09/86      3'939       900'163
        3.0      11/86      4'160       969'641
        4.0      04/87      4'387     1'036'010
        5.0      09/87      5'205     1'327'683
        6.0      01/88      6'102     1'653'982
        7.0      04/88      6'821     1'885'771
        8.0      08/88      7'724     2'224'465
        9.0      11/88      8'702     2'498'140
        10.0     03/89     10'008     2'952'613
        11.0     07/89     10'856     3'265'966
        12.0     10/89     12'305     3'797'482
        13.0     01/90     13'837     4'347'336
        14.0     04/90     15'409     4'914'264
        15.0     08/90     16'941     5'486'399
        16.0     11/90     18'364     5'986'949
        17.0     02/91     20'024     6'524'504
        18.0     05/91     20'772     6'792'034
        19.0     08/91     21'795     7'173'785
        20.0     11/91     22'654     7'500'130
        21.0     03/92     23'742     7'866'596
        22.0     05/92     25'044     8'375'696
        23.0     08/92     26'706     9'011'391
        24.0     12/92     28'154     9'545'427
        25.0     04/93     29'955    10'214'020
        26.0     07/93     31'808    10'875'091
        27.0     10/93     33'329    11'484'420
        28.0     02/94     36'000    12'496'420
        29.0     06/94     38'303    13'464'008
        30.0     10/94     40'292    14'147'368
        31.0     02/95     43'470    15'335'248
        32.0     11/95     49'340    17'385'503
        33.0     02/96     52'205    18'531'384
        34.0     10/96     59'021    21'210'389
        35.0     11/97     69'113    25'083'768
        36.0     07/98     74'019    26'840'295
        37.0     12/98     77'977    28'268'293
        38.0     07/99     80'000    29'085'965
        39.0     05/00     86'593    31'411'114
        40.0     10/01    101'602    37'315'215

    2   Description of the changes made to SWISS-PROT since release 38

The name of the database changed from 'SWISS-PROT protein sequence
database' to 'SWISS-PROT knowledgebase' to emphasize the fact that
SWISS-PROT collects, by far, more than just information on protein
sequences and that it is a central linking and linked database which
connects the various findings in the diverse fields of proteomics research.

We apologize that due to technical problems we never posted the release
notes of release 39. Therefore this document describes the changes that
took place not only since release 39 but also those between releases 38 and
39.

     2.1   Sequences and annotations

15'184 sequences have been added since release 39, the sequence data of
2'908 existing entries has been updated and the annotations of 44' 684
entries have been revised. With this release SWISS-PROT has passed the
symbolic mark of 100 thousand entries.

     2.2   The HPI project

The Human Proteomics Initiative (HPI) has been introduced to put a major
effort on the annotation of all known human sequences according to the
quality standards of SWISS-PROT. This means that, for each known protein, a
wealth of information is provided, which includes the description of its
function, its domain structure, subcellular location, posttranslational
modifications, variants, similarities to other proteins, etc. This not only
implies the annotation of newly detected proteins, but also the integration
of new research data to the existing entries by specialized biologists, who
are in close contact with experts all over the world.

There are currently 7'471 annotated human sequences in SWISS-PROT. These
entries are associated with 19'922 literature references, 18' 974
experimental or predicted PTM's, 1'697 splice variants and 12'061
polymorphisms (most of which are linked with disease states).

Simultaneously, two further efforts were increased: the description of
human diseases associated with deficiency(ies) in the protein and mammalian
orthologs of human proteins are annotated at a level equivalent to that of
the cognate human sequences.

For all aspects of the HPI projects, we would appreciate the help and
collaboration of the scientific community. Information concerning the human
proteome is highly critical to a large section of the life science
community. We therefore appeal to the user community to fully participate
in this initiative by providing all the necessary information to help and
to speed up the comprehensive annotation of the human proteome.

For a detailed description of the HPI project and its current status please
consult:

     http://www.expasy.org/sprot/hpi/

     2.3   The HAMAP project

The first complete microbial genomic sequence was that of the bacterium
Haemophilus influenzae, which became available in 1995. Since then more
than 50 bacterial and archaeal genomes have been sequenced and many more
sequencing projects of pathogenic as well as nonpathogenic microbes are in
progress. To date, the publicly available microbial genomes collectively
encode more than 100'000 different proteins.

In order to handle the large amount of "raw" data coming from the microbial
genomic sequencing, the High quality Automated Microbial Annotation of
Proteomes (HAMAP) project was initiated. The latter aims to automatically
annotate a significant percentage of proteins which originate from
microbial genome sequencing projects.

To maintain a high level quality of annotation, specific tools are
developed to deal with two completely separate subsets of bacterial and
archaeal proteins: proteins that have no recognizable similarity to any
other microbial or non-microbial proteins ("ORFans") and proteins that are
part of well-defined families or subfamilies. This is done by using a rule
system that describes the level and extent of annotations that can be
assigned by similarity with a prototype manually-annotated entry. The
result is a curated entry whose quality is identical to that produced
manually by an expert annotator.

The programs in development are designed to recognize protein
peculiarities, and only proteins which match the defined criteria will be
processed automatically. Protein sequences which fail to fit into that rule
system will be further analyzed by SWISS-PROT expert annotators.

For a detailed description of the HAMAP project and its current status
please consult:

     http://www.expasy.org/sprot/hamap/

     2.4   What's happening with the model organisms?

We have selected a number of organisms that are the target of genome
sequencing and/or mapping projects and for which we intend to:

   * be as complete as possible. All sequences available at a given time
     should be immediately included in SWISS-PROT. This also includes
     sequence corrections and updates;
   * provide a higher level of annotation;
   * provide cross-references to specialized database(s) that contain,
     among other data, some genetic information about the genes that code
     for these proteins;
   * provide specific indices or documents.

From our efforts to annotate human sequence entries as complete as possible
arose the HPI project (see 2.2), and the bacterial model organisms became
part of the HAMAP project (see 2.3). Here is the current status of the
model organisms which are not covered by these two projects:

      Organism        Database          Index file     Number of
                      cross-references                 sequences
      ------------    ----------------  -------------- ---------
      A.thaliana      None yet          In preparation     1'409
      C.albicans      None yet          CALBICAN.TXT         256
      C.elegans       Wormpep           CELEGANS.TXT       2'184
      D.discoideum    DictyDB           DICTY.TXT            311
      D.melanogaster  FlyBase           FLY.TXT            1'514
      M.musculus      MGD               MGDTOSP.TXT        4'816
      S.cerevisiae    SGD               YEAST.TXT          4'859
      S.pombe         None yet          POMBE.TXT          1'782

     2.5   Progress in the conversion of SWISS-PROT to mixed-case
           characters

We are gradually converting SWISS-PROT entries from all 'UPPER CASE' to
'MiXeD CaSe'. The line-types that have been converted between release 38
and 40 are: DE (DEscription), most RC (Reference Comment) topics (SPECIES,
TISSUE, PLASMID and TRANSPOSON) and DR (Database cross-Reference). The new
OX line (Organism cross-reference; see section 2.8) and the new CC topics
PHARMACEUTICAL and BIOTECHNOLOGY (described in section 2.11) have been
introduced in mixed case. The CC topic MASS SPECTROMETRY has been converted
to mixed case. As described in section 3.5, the process of converting all
of SWISS-PROT to mixed case continues.

     2.6   Extension of the accession number system

With the creation of the TrEMBL database and the rapid increase in the
amount of sequence data, we were faced with a problem of availability of
accession numbers. We used a system based on a one-letter prefix followed
by 5 digits. This system was also used by the nucleotide sequence databases
which had originally reserved for SWISS-PROT the prefix letters 'O', 'P'
and 'Q'. Having run out of space (due mainly to EST's), the nucleotide
sequence databases have been forced to choose a new format, which became a
two-letter prefix followed by 6 digits.

We have now used up all possible numbers with 'O', 'P' and 'Q'. As we
believe that changing the format of the accession numbers to that used now
by the nucleotide database would have created havoc on the numerous
software packages using SWISS-PROT, we decided to keep a system of
accession numbers based on a 6-character code, but with the following
format extension:

  1       2     3          4          5          6
  [O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]

What the above means is that we kept a 6-character code, but that in
positions 3, 4 and 5 of this code any combination of letters and numbers
can be present. This format allows a total of 14 million accession numbers
(compared with only 300'000 with the former system).

We only allow numbers in positions 2 and 6 so that the SWISS-PROT accession
numbers cannot be mistaken with gene names, acronyms, other type of
accession numbers or any kind of word!

Examples: P0A3S2, Q2ASD4, O13YX2, P9B123.

     2.7   Multiple AC lines

Starting from release 39, there can be more than one AC (ACcession) line
per SWISS-PROT entry. Strictly speaking this was not a format change and
the SWISS-PROT user's manual always indicated that there could be more than
one AC line per entry. Until recently, a single line was sufficient and the
majority of entries contained only a single accession number. But, in the
process of providing an optimally non-redundant database, we are merging
information from TrEMBL entries into SWISS-PROT entries. When we merge a
TrEMBL entry to a SWISS-PROT one, we add to the latter the accession
number(s) of the TrEMBL entry. The repetition of such a process sometimes
produces an accession number list which can no longer fit in a single AC
line. Therefore there are now some entries with two, three (as shown below)
or more AC lines.

AC   P16070; P22511; Q04858; Q13419; Q13957; Q13958; Q13959; Q13960;
AC   Q13961; Q13967; Q13968; Q13980; Q15861; Q16064; Q16065; Q16066;
AC   Q16208; Q16522;

     2.8   Introduction of the new line type OX: Organism taxonomy
     cross-reference

The OX (Organism taXonomy cross-reference) line has been introduced to
indicate the identifier to a specific organism in a taxonomic database. The
number of taxonomic codes is identical to the number of species given in
the OS line. There can be more than one OX line in an entry and its format
is:

OX   Taxonomy-database_Qualifier=Taxonomic code[, Taxonomic code...];

There are cross-references to the taxonomic database of NCBI, which is
associated with the qualifier 'TaxID' and a one-to six-digit taxonomic
code.

Examples of its usage:

OX   NCBI_TaxID=10116;

OX   NCBI_TaxID=9606, 10090, 9913, 9823, 10141, 10029, 10030, 10116, 9986,

OX   9031, 8355, 7227, 7213, 7108, 7130;

     2.9   Changes concerning the RC line

We are gradually implementing controlled vocabularies for the different
type of RC tokens. To complement the tissue list (TISSLIST.TXT), we have
now added a plasmid list (PLASMID.TXT) and are in the process of creating a
strain list. Controlled vocabularies are part of the SWISS-PROT
documentation files that are all described in section 4.

     2.10   Changes concerning the RX line

The RX line format changed, and it now provides identifiers also to the
bibliographic database PubMed.

The old format was:

RX   MEDLINE; unique_identifier.

The new format is:

RX   BIBLIOGRAPHIC_DATABASE=IDENTIFIER[; BIBLIOGRAPHIC_DATABASE=IDENTIFIER...];

Example of RX lines:

RX   PubMed=9145897;
RX   MEDLINE=79012484; PubMed=358200;

     2.11   Introduction of two new CC line topics: BIOTECHNOLOGY and
     PHARMACEUTICAL

We have introduced two new 'topics' for the comments (CC) line type.

The topic 'BIOTECHNOLOGY' has been introduced to describe the use of a
specific protein in the biotechnological industry. This topic contains the
name(s) of the compani(es) that produce the protein or the genetically
manipulated organism as well as a short description of the biotechnological
function of the protein. The brand name(s), under which a protein is
available, is added, if applicable.

Examples of the usage:

CC   -!- BIOTECHNOLOGY: Introduced by genetic manipulation and
CC       expressed in improved ripening tomato by Monsanto. ACC is the
CC       immediate precursor of the phytohormone ethylene who is
CC       involved in the control of ripening. ACC deaminase reduces
CC       ethylene biosynthesis and thus extend the shelf life of fruits
CC       and vegetables.

CC   -!- BIOTECHNOLOGY: Used in the food industry for high temperature
CC       liquefaction of starch-containing mashes and in the detergent
CC       industry to remove starch. Sold under the name Termamyl by
CC       Novozymes.

The topic 'PHARMACEUTICAL' has been introduced to describe the use of a
specific protein as a pharmaceutical drug. The information provided by such
a topic will include the brand name(s) under which a protein is available,
the name(s) of the compani(es) that produce it as well as a short
description of the therapeutic usage of the protein. It should be noted
that any entries containing such a comment field will also be tagged with
the keyword 'Pharmaceutical'.

Examples of the usage:

CC   -!- PHARMACEUTICAL: Available under the names Avonex (Biogen),
CC       Betaseron (Berlex) and Rebif (Serono). Used in the treatment
CC       of multiple sclerosis (MS). Betaseron is a slightly modified
CC       form of IFNB1 with two residue substitutions.

CC   -!- PHARMACEUTICAL: Available under the name Proleukin (Chiron).
CC       Used in patients with renal cell carcinoma or metastatic
CC       melanoma.

     2.12   Cleaning up of comment line (CC) topics

We are continuing a major overhaul of various comment line topics. We would
like the majority of the information stored to be usable by computer
programs (while being human-readable). We are therefore standardizing the
format of the topics.

The two sub-formats of the topic ALTERNATIVE PRODUCTS:

CC   -!- ALTERNATIVE PRODUCTS:  isoforms;  (shown here),
CC       ,  and ; are produced by alternative splicing.
CC       [Comment.]

CC   -!- ALTERNATIVE PRODUCTS:  isoforms;  (shown here),
CC        and ; are produced by alternative
CC       initiation. [Comment.]

Examples:

CC   -!- ALTERNATIVE PRODUCTS: At least 5 isoforms; 1 (shown here), 2, 3, 4
CC       and 5; are produced by alternative splicing. They differ in their
CC       acetylcholine receptor clustering activity.

CC   -!- ALTERNATIVE PRODUCTS: 3 isoforms; TRAC-2 (shown here), TRAC-3 and
CC       TRAC-4; are produced by alternative initiation.

We are gradually cleaning up the comment line topic SIMILARITY. To describe
the similarity of the protein to a protein family, we use the following
subformat:

CC   -!- SIMILARITY: Belongs to the <family_name>[. <sub-family_name>].

Examples:

CC   -!- SIMILARITY: Belongs to the 14-3-3 family.

CC   -!- SIMILARITY: Belongs to the glucosamine/galactosamine-6-phosphate
CC       isomerase family. 6-phosphogluconolactonase subfamily.

To describe conserved domains within a protein sequence, we use the
subformat:

CC   -!- SIMILARITY: Contains n <domain_name>.

Examples:

CC   -!- SIMILARITY: Contains 10 HEAT repeats.
CC   -!- SIMILARITY: Contains 1 FKBP-type PPIase domain.

     2.13   Changes concerning cross-references (DR line)

We have added cross-references from SWISS-PROT to the following databases:

     2.13.1   GlycoSuiteDB

GlycoSuiteDB, a database of glycan structures available at
http://www.glycosuite.com/ (see Cooper C.A., Harrison M.J., Wilkins M.R.
and Packer N.H.; Nucleic Acids Res. 29:332-335(2001)). The identifiers of
the appropriate DR line are:

 Data bank
 identifier:         GlycoSuiteDB
 Primary identifier: GlycoSuiteDB unique identifier for a glycoprotein,
                     which is identical to the SWISS-PROT primary AC
                     number of that protein.
 Secondary
 identifier:         None; a dash '-' is stored in that field.
 Example:            DR   GlycoSuiteDB; P05067; -.

     2.13.2   SMART

The Simple Modular Architecture Research Tool (SMART), a database of
functional sites available at http://smart.embl-heidelberg.de/ (see Schultz
J., Copley R.R., Doerks T., Ponting C.P. and Bork P.; Nucleic Acids Res.
28:231-234(2000)). The cross-references for this database are composed of
the following items:

 Data bank identifier: SMART
 Primary identifier:   SMART unique identifier for a domain.
 Secondary identifier: Abbreviation for the name of a domain or module.
 Fourth item:          Number of hits of the domain in the entry.
 Example:              DR   SMART; SM00370; LRR; 6.

     2.13.3   Leproma

The Mycobacterium leprae genome database Leproma, which is available at
http://genolist.pasteur.fr/Leproma/. The information is available in the DR
line:

 Data bank identifier: Leproma
 Primary identifier:   Leproma unique identifer for an ORF.
 Secondary identifier: None; a dash '-' is stored in that field.
 Example:              DR   Leproma; ML0485; -.

     2.13.4   MEROPS

MEROPS, the protease database available at http://www.merops.co.uk/ (see
Rawlings N.D. and Barrett A.J.; Nucleic Acids Res. 28:323-325(2000)). The
following information is available in the two qualifiers of the DR line:

 Data bank identifier: MEROPS
 Primary identifier:   The MEROPS unique identifier for a peptidase.
 Secondary identifier: None; a dash '-' is stored in that field.
 Example:              DR   MEROPS; M41.001; -.

     2.13.5   MypuList

The Mycoplasma pulmonis genome database MypuList, available at
http://genolist.pasteur.fr/MypuList/. The following information is
available in the two identifiers of the DR line:

 Data bank identifier: MypuList
 Primary identifier:   The MypuList unique identifier for an ORF.
 Secondary identifier: None; a dash '-' is stored in that field.
 Example:              DR   MypuList; MYPU_4900; -.

     2.13.6   ProDom

Cross-references to the ProDom protein domain database used to be provided
as implicit links; links are now also available as explicit links:

 Data bank identifier:  ProDom
 Primary identifier:    The ProDom unique identifier for a domain.
 Secondary identifier:  The ProDom entry name.
 Fourth item:           Number of hits of the domain in the entry.
 Example for an         DR   ProDom; PD000600; 14-3-3; 1.
 explicit link:

     2.13.7   ANU-2DPAGE

The Australian National University Two-Dimensional Polyacrylamide Gel
Electrophoresis Database (ANU-2DPAGE) is available at
http://semele.anu.edu.au/2d/2d.html (see Imin N., Kerim T., Weinman J.J.
and Rolfe B.G.; Proteomics 1:1149-1161(2001)). The following information is
available in the DR line:

 Data bank
 identifier:          ANU-2DPAGE
 Primary identifier:  ANU-2DPAGE unique identifier, which is identical to
                      the SWISS-PROT primary AC number of that protein.
 Secondary
 identifier:          None; a dash '-' is stored in that field.
 Example:             DR   ANU-2DPAGE; Q9XEA8; -.

     2.13.8   COMPLUYEAST-2DPAGE

Two-dimensional polyacrylamide gel electrophoresis database at Universidad
Complutense de Madrid (COMPLUYEAST-2DPAGE) is available at
http://babbage.csc.ucm.es/2d/2d.html. The following informaiton is
available in the DR line:

 Data bank
 identifier:        COMPLUYEAST-2DPAGE
 Primary            COMPLUYEAST-2DPAGE unique identifier, which is
 identifier:        identical to the SWISS-PROT primary AC number of that
                    protein.
 Secondary
 identifier:        None; a dash '-' is stored in that field.
 Example:           DR   COMPLUYEAST-2DPAGE; P43067; -.

     2.13.9   PHCI-2DPAGE

The Parasite Host Cell Interaction 2D-PAGE database (PHCI-2DPAGE) is
available at http://www.gram.au.dk/2d/2d.html. The cross-references for
this database are composed of the following items:

 Data bank
 identifier:          PHCI-2DPAGE
 Primary identifier:  PHCI-2DPAGE unique identifier, which is identical to
                      the SWISS-PROT primary AC number of that protein.
 Secondary
 identifier:          None; a dash '-' is stored in that field.
 Example:             DR   PHCI-2DPAGE; Q9Z6V3; -.

     2.13.10   PMMA-2DPAGE

The Purkyne Military Medical Academy 2D-PAGE database (PMMA-2DPAGE) is
available at http://www.pmma.pmfhk.cz/2d/2d.html. The identifers of the
appropriate DR line are:

 Data bank
 identifier:          PMMA-2DPAGE
 Primary identifier:  PMMA-2DPAGE unique identifier, which is identical to
                      the SWISS-PROT primary AC number of that protein.
 Secondary
 identifier:          None; a dash '-' is stored in that field.
 Example:             DR   PMMA-2DPAGE; Q01995; -.

     2.13.11   Siena-2DPAGE

The 2D-PAGE database from the Department of Molecular Biology, University
of Siena, Italy, is available at http://www.bio-mol.unisi.it/2d/2d.html.
The components of the corresponding DR line are:

 Data bank
 identifier:         Siena-2DPAGE
 Primary identifier: Siena-2DPAGE unique identifier, which is identical to
                     the SWISS-PROT primary AC number of that protein.
 Secondary
 identifier:         None; a dash '-' is stored in that field.
 Example:            DR   Siena-2DPAGE; P01591; -.

     2.14   Introduction of a new FT key: SE_CYS

Selenocysteine is the 21st 'natural' amino acid. It is now known to occur
in several prokaryotic and eukaryotic proteins. Its mRNA codon is UGA,
which usually serves as a stop codon, but with a specific downstream
sequence forming a loop and a specific translational elongation factor. It
is recognized as the site of selenocysteine incorporation into proteins.

The joint nomenclature committee of the IUPAC/IUBMB (see
http://www.chem.qmw.ac.uk/iupac/jcbn/) officially recommended
(http://www.chem.qmw.ac.uk/iubmb/newsletter/1999/item3.html) a three-letter
and a one-letter symbol for selenocysteine, namely 'Sec' and 'U'.

Introducing a new one-letter code in the sequence records would have
disrupt most, if not all, sequence analysis software. We therefore decided
to change, in SWISS-PROT, the rules used to annotate the presence of
selenocysteine residues in sequence entries in the manner described below.

Selenocysteines were stored, in the sequence records, using the one-letter
symbol 'C' for cysteine and are indicated in the feature table (FT) by a
line of the type:

FT   BINDING       x      x       SELENIUM.

The one-letter code has not been changed (for the reason explained above),
but we introduced a specific feature key (SE_CYS) to indicate the presence
of a selenocysteine at a given sequence position. The above example has
therefore been changed to:

FT   SE_CYS       x       x

We also want to remind users that the keyword ' Selenocysteine' continues
to be used to tag sequence entries that contain at least one such residue.

     2.15   Introduction of feature identifiers to the feature keys
     CARBOHYD and VARIANT

We have introduced unique and stable feature identifiers (FTId) which allow
to construct links directly from position-specific annotation in the
feature table to specialized protein-related databases. Examples are
databases specialized in certain types of posttranslational modifications
of proteins, or in mutations. The FTId is always the last component in the
feature description.

     2.15.1   Feature identifiers in FT VARIANT lines of human sequence
     entries

The feature identifiers in the FT VARIANT lines of human sequence entries
allow to refer to a sequence variation and serve as anchors for
specifically directed links. A federated single human mutation database
(HmutDB; http://www2.ebi.ac.uk/mutations/central/proposal.html) has been
proposed, and the complete set of all FT VARIANT lines has been indexed for
SRS at EBI (http://srs.ebi.ac.uk/), under the name SWISSCHANGE. The
database SWISSCHANGE can be queried by SWISS-PROT FTIds.

The format of FT VARIANT lines with feature identifiers is:

FT   VARIANT       x      x        Description.
FT                                 /FTId=VAR_number.

Example:

FT   VARIANT       3      3        A -> L.
FT                                 /FTId=VAR_000001.

     2.15.2   Feature identifiers in FT CARBOHYD lines

The same principle is used to further enhance the links to GlycoSuiteDB, an
annotated database of glycan structures (see section 2.13.1). So in
addition the explicit global link in the DR line, we create unique feature
identifiers for each of the FT CARBOHYD lines, which will allow direct
access to the glycan structure.

The format of FT CARBOHYD lines with feature identifiers is:

FT   CARBOHYD      x        x       Description.
FT                                  /FTId=CAR_number.

Example:

FT   CARBOHYD    251      251       N-LINKED (GLCNAC...).
FT                                  /FTId=CAR_000070.

     2.16   Change in the syntax of the SQ line

The SQ (SeQuence header) line marks the beginning of the sequence data and
gives a quick summary of its content. The format of the SQ line was:

SQ   SEQUENCE  XXXX AA; XXXXXX MW;  XXXXXXXX CRC32;

The last information item in the SQ line was a 32-bit CRC (Cyclic
Redundancy Check) value which is computed from the sequence. As the number
of available sequences is increasing rapidly, there are now a few cases
where two sequences can share the same CRC32 (but none, which also share
the same molecular weight 'MW' or number of amino acids 'AA' ). To address
this issue we replaced the 32-bit CRC value by a 64-bit CRC. The format of
the SQ line changed therefore to:

SQ   SEQUENCE  XXXX AA; XXXXXX MW;  XXXXXXXXXXXXXXXX CRC64;

Example:

SQ   SEQUENCE   233 AA;  25630 MW;  146A1B48A1475C86 CRC64;


                          3   Forthcoming changes

     3.1   Version of SP in XML format

A distribution version of SWISS-PROT and TrEMBL in XML format is being
developed. The specifications of this new format will be described when it
will be first implemented in TrEMBL.

     3.2   Extension of the entry name format

We endeavor to assign meaningful entry names that facilitate the
identification of the proteins and the species of origin concerning an
entry. Currently the entry name consists of up to ten uppercase
alphanumeric characters. SWISS-PROT uses a general purpose naming
convention that can be symbolized as X_Y, where X is a mnemonic code of at
most 4 alphanumeric characters representing the protein name, the '_' sign
serves as a separator, and the Y is a mnemonic species identification code
of at most 5 alphanumeric characters representing the biological source of
the protein.

We are planning to elongate the mnemonic code for the protein name from up
to 4 characters to up to 5 characters. E.g. the mnemonic code for the
meiotic recombination protein rec10 is currently 'RE10'. After the
introduction of extended entry names it could be modified to the 5-letter
code 'REC10'.

     3.3   Multiple RP lines

Starting with release 41, there can be more than one RP (Reference
Position) line per reference in a SWISS-PROT entry. The RP line describes
the extent of the work carried out by the authors of the reference, e.g.
molecule type that has been sequenced, the characterization of the protein,
characterization of PTMs, analysis of the protein structure, detection of
variants, etc.

As the number of experimental results per publication increased over the
years the limitation of using a single RP line per reference became more
and more often insufficient to add all the information while being
consistent in format. So we decided to have multiple RP lines.

Example:

RP   SEQUENCE FROM N.A., PARTIAL SEQUENCE, AND CHARACTERIZATION.

could become

RP   SEQUENCE FROM N.A., SEQUENCE OF 23-42 AND 351-365, AND
RP   CHARACTERIZATION.

     3.4   Cleaning up of comment line (CC) topics

We are continuing a major overhaul of various comment line topics. We would
like the majority of the information stored to be usable by computer
programs (while being human-readable). We are therefore standardizing the
format of the topics.

We are gradually cleaning up the comment line topic PATHWAY. To describe
the biochemical pathway in which the protein is involved, we use the
following format:

CC   -!- PATHWAY: biochemical pathway; nth step[. Comment].

Example:

CC   -!- PATHWAY: Coenzyme A (CoA) biosynthesis; first step.

The comment line topic COFACTOR will be modified gradually to the following
format:

CC   -!- COFACTOR: cofactor1[, cofactor2 and cofactor3][. Comment].

Examples:

CC   -!- COFACTOR: Magnesium.
CC   -!- COFACTOR: Copper, Manganese, and Nickel.

     3.5   Continuation of the conversion of SWISS-PROT to mixed-case
     characters

We will continue to convert SWISS-PROT entries from all 'UPPER CASE' to
'MiXeD CaSe'. In release 41 we are planning to convert the GN (Gene Name)
line, the RC (Reference Comment) line topic STRAIN, and the CC (Comment)
line topics CATALYTIC ACTIVITY and PATHWAY.

Here is an example of what a SWISS-PROT entry will look like in release 41:

ID   GSA_ECOLI      STANDARD;      PRT;   426 AA.
AC   P23893; P78277;
DT   01-NOV-1991 (Rel. 20, Created)
DT   01-NOV-1997 (Rel. 35, Last sequence update)
DT   01-MAR-2002 (Rel. 41, Last annotation update)
DE   Glutamate-1-semialdehyde 2,1-aminomutase (EC 5.4.3.8) (GSA)
DE   (Glutamate-1-semialdehyde aminotransferase) (GSA-AT).
GN   hemL or gsa or popC or B0154.
OS   Escherichia coli.
OC   Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
OC   Escherichia.
OX   NCBI_TaxID=562;
RN   [1]
RP   SEQUENCE FROM N.A.
RX   MEDLINE=91155920; PubMed=1900346;
RA   Grimm B., Bull A., Breu V.;
RT   "Structural genes of glutamate 1-semialdehyde aminotransferase for
RT   porphyrin synthesis in a cyanobacterium and Escherichia coli.";
RL   Mol. Gen. Genet. 225:1-10(1991).
RN   [2]
RP   SEQUENCE FROM N.A.
RC   STRAIN=K12 / W3110;
RX   MEDLINE=94261430; PubMed=8202364;
RA   Fujita N., Mori H., Yura T., Ishihama A.;
RT   "Systematic sequencing of the Escherichia coli genome: analysis of
RT   the 2.4-4.1 min (110,917-193,643 bp) region.";
RL   Nucleic Acids Res. 22:1637-1639(1994).
RN   [3]
RP   SEQUENCE FROM N.A.
RC   STRAIN=K12 / MG1655;
RX   MEDLINE=97426617; PubMed=9278503;
RA   Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V.,
RA   Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F.,
RA   Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J.,
RA   Mau B., Shao Y.;
RT   "The complete genome sequence of Escherichia coli K-12.";
RL   Science 277:1453-1474(1997).
RN   [4]
RP   SEQUENCE FROM N.A.
RA   Schramm S., Duncan M., Allen E., Araujo R., Aparicio A., Chung E.,
RA   Davis K., Federspiel N., Hyman R., Kalman S., Komp C., Kurdi O.,
RA   Lashkari D., Lew H., Lin D., Namath A., Oefner P., Roberts D.,
RA   Davis R.W.;
RL   Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases.
RN   [5]
RP   CHARACTERIZATION.
RX   MEDLINE=91258321; PubMed=2045363;
RA   Ilag L.L., Jahn D., Eggertsson G., Soell D.;
RT   "The Escherichia coli hemL gene encodes glutamate 1-semialdehyde
RT   aminotransferase.";
RL   J. Bacteriol. 173:3408-3413(1991).
RN   [6]
RP   MUTAGENESIS OF LYS-265.
RX   MEDLINE=92353044; PubMed=1643048;
RA   Ilag L.L., Jahn D.;
RT   "Activity and spectroscopic properties of the Escherichia coli
RT   glutamate 1-semialdehyde aminotransferase and the putative active
RT   site mutant K265R.";
RL   Biochemistry 31:7143-7151(1992).
CC   -!- CATALYTIC ACTIVITY: (S)-4-amino-5-oxopentanoate =
CC       5-aminolevulinate.
CC   -!- COFACTOR: PYRIDOXAL PHOSPHATE.
CC   -!- PATHWAY: Porphyrin biosynthesis by the C5 pathway; second step.
CC   -!- SUBUNIT: HOMODIMER.
CC   -!- SUBCELLULAR LOCATION: CYTOPLASMIC (POTENTIAL).
CC   -!- SIMILARITY: BELONGS TO CLASS-III OF PYRIDOXAL-PHOSPHATE-DEPENDENT
CC       AMINOTRANSFERASES.
DR   EMBL; X53696; CAA37734.1; -.
DR   EMBL; D26562; CAB20274.1; -.
DR   EMBL; AE000125; AAC73265.1; -.
DR   EMBL; U70214; AAB08584.1; -.
DR   PIR; S13327; S13327.
DR   PIR; S45223; S45223.
DR   HSSP; P24630; 2GSA.
DR   EcoGene; EG10432; hemL.
DR   InterPro; IPR000954; Aminotran_3.
DR   Pfam; PF00202; aminotran_3; 1.
DR   PROSITE; PS00600; AA_TRANSFER_CLASS_3; 1.
KW   Porphyrin biosynthesis; Isomerase; Pyridoxal phosphate;
KW   Complete proteome.
FT   BINDING     265    265       PYRIDOXAL PHOSPHATE (PROBABLE).
FT   MUTAGEN     265    265       K->R: 2% OF WILD-TYPE ACTIVITY.
FT   CONFLICT      2      2       S -> R (IN REF. 1 AND 2).
FT   CONFLICT      9      9       S -> Q (IN REF. 1 AND 2).
SQ   SEQUENCE   426 AA;  45366 MW;  BED817E100468CF2 CRC64;
     MSKSENLYSA ARELIPGGVN SPVRAFTGVG GTPLFIEKAD GAYLYDVDGK AYIDYVGSWG
     PMVLGHNHPA IRNAVIEAAE RGLSFGAPTE MEVKMAQLVT ELVPTMDMVR MVNSGTEATM
     SAIRLARGFT GRDKIIKFEG CYHGHADCLL VKAGSGALTL GQPNSPGVPA DFAKYTLTCT
     YNDLASVRAA FEQYPQEIAC IIVEPVAGNM NCVPPLPEFL PGLRALCDEF GALLIIDEVM
     TGFRVALAGA QDYYGVVPDL TCLGKIIGGG MPVGAFGGRR DVMDALAPTG PVYQAGTLSG
     NPIAMAAGFA CLNEVAQPGV HETLDELTTR LAEGLLEAAE EAGIPLVVNH VGGMFGIFFT
     DAESVTCYQD VMACDVERFK RFFHMMLDEG VYLAPSAFEA GFMSVAHSME DINNTIDAAR
     RVFAKL
//


                   4   Status of the documentation files

SWISS-PROT is distributed with a large number of documentation files. Some
of these files have been available for a long time (the user manual,
release notes, the various indices for authors, citations, keywords, etc.),
but many have been created recently and we are continuously adding new
files, and updating and modifying existing files. Please note that the
header in many documentaiton files changed. The following table lists all
the documents that are currently available.

See also section 7.3 for information on how to access updated versions of
all documents in-between major releases.

 USERMAN.TXT    User manual
 RELNOTES.TXT   Release notes for the current release (40)
 SHORTDES.TXT   Short description of entries in SWISS-PROT [see 1]

 JOURLIST.TXT   List of cited journals
 KEYWLIST.TXT   List of keywords
 PLASMID.TXT    List of plasmids [see 2]
 SPECLIST.TXT   List of organism (species) identification codes
 TISSLIST.TXT   List of tissues
 EXPERTS.TXT    List of on-line experts for PROSITE and SWISS-PROT
 DBXREF.TXT     List of databases cross-referenced in SWISS-PROT [see 2]
 SUBMIT.TXT     Submission of sequence data to SWISS-PROT

 ACINDEX.TXT    Accession number index
 AUTINDEX.TXT   Authors index
 CITINDEX.TXT   Citation index
 KEYINDEX.TXT   Keywords index
 SPEINDEX.TXT   Species index
 DELETEAC.TXT   Deleted accession number index

 7TMRLIST.TXT   List of 7-transmembrane G-linked receptor entries [see 1]
 AATRNASY.TXT   List of aminoacyl-tRNA synthetases
 ALLERGEN.TXT   Nomenclature and index of allergen sequences
 ANNBIOCH.TXT   SWISS-PROT annotation: how is biochemical information
                assigned to sequence entries
 BLOODGRP.TXT   Blood group antigen proteins
 CALBICAN.TXT   Index of Candida albicans entries and their corresponding
                gene designations
 CDLIST.TXT     CD nomenclature for surface proteins of human leucocytes
                Index of Caenorhabditis elegans entries and their
 CELEGANS.TXT   corresponding gene designations and WormPep
                cross-references
                Index of Dictyostelium discoideum entries and their
 DICTY.TXT      corresponding gene designations and DictyDB
                cross-references
 EC2DTOSP.TXT   Index of Escherichia coli Gene-protein database
                (ECO2DBASE) entries referenced in SWISS-PROT
 ECOLI.TXT      Index of Escherichia coli strain K12 chromosomal entries
                and their corresponding EcoGene cross-references
 EMBLTOSP.TXT   Index of EMBL Nucleotide Sequence Database entries
                referenced in SWISS-PROT
 EXTRADOM.TXT   Nomenclature of extracellular domains
 FLY.TXT        Index of Drosophila entries and their corresponding
                FlyBase cross-references
 GLYCOSID.TXT   Classification of glycosyl hydrolase families and index of
                glycosyl hydrolase entries in SWISS-PROT
 HAEINFLU.TXT   Index of Haemophilus influenzae strain Rd chromosomal
                entries
 HOXLIST.TXT    Vertebrate homeotic Hox proteins: nomenclature and index
 HPYLORI.TXT    Index of Helicobacter pylori strain 26695 chromosomal
                entries
 HUMCHR01.TXT   Index of proteins encoded on human chromosome 1 [see 2]
 HUMCHR02.TXT   Index of proteins encoded on human chromosome 2 [see 2]
 HUMCHR03.TXT   Index of proteins encoded on human chromosome 3 [see 2]
 HUMCHR04.TXT   Index of proteins encoded on human chromosome 4 [see 2]
 HUMCHR05.TXT   Index of proteins encoded on human chromosome 5 [see 2]
 HUMCHR06.TXT   Index of proteins encoded on human chromosome 6 [see 2]
 HUMCHR07.TXT   Index of proteins encoded on human chromosome 7 [see 2]
 HUMCHR08.TXT   Index of proteins encoded on human chromosome 8 [see 2]
 HUMCHR09.TXT   Index of proteins encoded on human chromosome 9 [see 2]
 HUMCHR10.TXT   Index of proteins encoded on human chromosome 10 [see 2]
 HUMCHR11.TXT   Index of proteins encoded on human chromosome 11 [see 2]
 HUMCHR12.TXT   Index of proteins encoded on human chromosome 12 [see 2]
 HUMCHR13.TXT   Index of proteins encoded on human chromosome 13
 HUMCHR14.TXT   Index of proteins encoded on human chromosome 14 [see 2]
 HUMCHR15.TXT   Index of proteins encoded on human chromosome 15 [see 2]
 HUMCHR16.TXT   Index of proteins encoded on human chromosome 16
 HUMCHR17.TXT   Index of proteins encoded on human chromosome 17
 HUMCHR18.TXT   Index of proteins encoded on human chromosome 18
 HUMCHR19.TXT   Index of proteins encoded on human chromosome 19
 HUMCHR20.TXT   Index of proteins encoded on human chromosome 20
 HUMCHR21.TXT   Index of proteins encoded on human chromosome 21
 HUMCHR22.TXT   Index of proteins encoded on human chromosome 22
 HUMCHRX.TXT    Index of proteins encoded on human chromosome X
 HUMCHRY.TXT    Index of proteins encoded on human chromosome Y
 HUMPVAR.TXT    Index of human proteins with sequence variants
 INITFACT.TXT   List and index of translation initiation factors
 INTEIN.TXT     Index of intein-containing entries referenced in
                SWISS-PROT [see 2]
 METALLO.TXT    Classification of metallothioneins and index of the
                entries in SWISS-PROT
 MGDTOSP.TXT    Index of MGD entries referenced in SWISS-PROT
 MGENITAL.TXT   Index of Mycoplasma genitalium strain G-37 chromosomal
                entries
 MIMTOSP.TXT    Index of MIM entries referenced in SWISS-PROT
 MJANNASC.TXT   Index of Methanococcus jannaschii entries
 NGR234.TXT     Table of predicted proteins in Rhizobium plasmid pNGR234a
 NOMLIST.TXT    List of nomenclature related references for proteins
 PCC6803.TXT    Index of Synechocystis strain PCC 6803 entries
 PDBTOSP.TXT    Index of Protein Data Bank (PDB) entries referenced in
                SWISS-PROT
 PEPTIDAS.TXT   Classification of peptidase families and index of
                peptidase entries in SWISS-PROT
 PLASTID.TXT    List of chloroplast and cyanelle encoded proteins
 POMBE.TXT      Index of Schizosaccharomyces pombe entries and their
                corresponding gene designations
 RESTRIC.TXT    List of restriction enzyme and methylase entries
 RIBOSOMP.TXT   Index of ribosomal proteins classified by families on the
                basis of sequence similarities
 RPROWAZE.TXT   Index of Rickettsia prowazekii strain Madrid E entries
                [see 2]
 SALTY.TXT      Index of Salmonella typhimurium strain LT2 chromosomal
                entries and their corresponding StyGene cross-references
 SUBTILIS.TXT   Index of Bacillus subtilis strain 168 chromosomal entries
                and their corresponding SubtiList cross-references
 UPFLIST.TXT    UPF (Uncharacterized Protein Families) list and index of
                members
 YEAST.TXT      Index of Saccharomyces cerevisiae entries in SWISS-PROT
                and their corresponding gene designations
 YEAST1.TXT     Yeast Chromosome I entries
 YEAST2.TXT     Yeast Chromosome II entries
 YEAST3.TXT     Yeast Chromosome III entries
 YEAST5.TXT     Yeast Chromosome V entries
 YEAST6.TXT     Yeast Chromosome VI entries
 YEAST7.TXT     Yeast Chromosome VII entries
 YEAST8.TXT     Yeast Chromosome VIII entries
 YEAST9.TXT     Yeast Chromosome IX entries
 YEAST10.TXT    Yeast Chromosome X entries
 YEAST11.TXT    Yeast Chromosome XI entries
 YEAST13.TXT    Yeast Chromosome XIII entries
 YEAST14.TXT    Yeast Chromosome XIV entries

Notes:

 1   The '7TMRLIST.TXT' and 'SHORTDES.TXT' files have been converted to
     mixed-case characters.
 2   The 'DBXREF.TXT', 'HUMCHR01.TXT', 'HUMCHR02.TXT', 'HUMCHR03.TXT',
     'HUMCHR04.TXT', 'HUMCHR05.TXT', 'HUMCHR06.TXT', 'HUMCHR07.TXT',
     'HUMCHR08.TXT', 'HUMCHR09.TXT', 'HUMCHR10.TXT', 'HUMCHR11.TXT',
     'HUMCHR12.TXT', 'HUMCHR14.TXT', 'HUMCHR15.TXT', 'INTEIN.TXT',
     'PLASMID.TXT', and 'RPROWAZE.TXT' files are new documents introduced
     since release 38.

We have continued to include in some SWISS-PROT documentation files the
references of Web sites relevant to the subject under consideration. There
are now 89 documents that include such links.


                   5   The ExPASy World-Wide Web server

     5.1   Background information

The most efficient and user-friendly way to browse interactively in
SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases is to use the
World-Wide Web (WWW) molecular biology server ExPASy. The ExPASy server was
made available to the public in September 1993 and is reachable at the
following address:

     http://www.expasy.org/

The ExPASy WWW server allows access, using the user-friendly hypertext
model, to the SWISS-PROT/TrEMBL, PROSITE, ENZYME, SWISS-2DPAGE,
SWISS-3DIMAGE and CD40Lbase databases. And, through any SWISS-PROT protein
sequence entry, to other databases such as EMBL, Eco2DBASE, EcoCyc,
EcoGene, FlyBase, GCRDb, GlycoSuiteDB, MaizeDB, OMIM, PDB, HSSP, Pfam,
ProDom, REBASE, SGD, SubtiList, TRANSFAC, YPD, ZFIN and Medline. ExPASy
also offers many tools for the analysis of protein sequences and 2D gels.

There are currently five mirror sites of ExPASy, i.e. exact copies of the
server. The ExPASy mirror sites are located in:

 Australia http://au.expasy.org/
           at the Australian Proteome Analysis Facility (APAF), Sydney
 Canada    http://ca.expasy.org/
           at the Canadian Bioinformatics Resource (CBR), Halifax
 China     http://cn.expasy.org/
           at the Center of Bioinformatics, Peking University, Beijing
 Korea     http://kr.expasy.org/
           at the Yonsei Proteome Research Center
 Taiwan    http://tw.expasy.org/
           at the National Health Research Institutes (NHRI), Taipei

Explicit general and continuously updated documentation about the ExPASy
server is available at http://www.expasy.org/doc/expasy.pdf.

     5.2   Swiss-Shop

We provide, on ExPASy, a service called Swiss-Shop
(http://www.expasy.org/swiss-shop/). Swiss-Shop is an automated sequence
alerting system which allows users to obtain, by email, new sequence
entries relevant to their field(s) of interest. Every week, the new
sequences entered in SWISS-PROT are automatically compared with all the
criteria that have been defined by the users. If a sequence corresponds to
the selection criteria defined by a user, that sequence is sent by
electronic mail. Various criteria can be combined:

   * By entering one or more words that should be present in the
     description line;
   * By entering one or more species name(s) or taxonomic division(s);
   * By entering one or more keywords;
   * By entering one or more author names;
   * By entering the accession number (or entry name) of a PROSITE pattern
     or a user-defined sequence pattern. In this case, all new SWISS-PROT
     entries matching this pattern will be reported;
   * By entering the accession number (or entry name) of an existing
     SWISS-PROT entry or by entering a 'private' sequence. In this case,
     all new SWISS-PROT entries similar to that sequence will be reported.

     5.3   What is new on ExPASy

ExPASy is constantly modified and improved. If you wish to be informed on
the changes made to the server you can either:

   * Read the document 'History of changes, improvements and new features'
     which is available at the address: http://www.expasy.org/history.html
   * Subscribe to Swiss-Flash, a service that reports news of databases,
     software and service developments. By subscribing to this service, you
     will automatically get Swiss-Flash bulletins by electronic mail. To
     subscribe, use the address: http://www.expasy.org/swiss-flash/

Among all the improvements and the new features introduced since the last
SWISS-PROT release, here are those that we believe are specifically useful
to SWISS-PROT users:

     1. A new and improved version of the NiceProt view of SWISS-PROT is
     available and offers the following new features: a link to a
     printer-friendly view of a SWISS-PROT entry, display of the length of
     certain features in the FT lines, and access to a new tool, the
     'Feature aligner' which allows to select features for submission to
     the ClustalW multiple alignment program.

     2. SWISS-PROT release statistics are now available for every update of
     the database (http://www.expasy.org/sprot/relnotes/relstat.html).
     Among other parameters, statistics about database growth, average
     sequence lengths and amino acid composition, taxonomic origin, journal
     citations and database cross-references are presented, including some
     graphics.

     3. A new view is available within the SRS Sequence Retrieval System.
     It displays, for each protein corresponding to a user query, gene
     name(s) and organism (in addition to the parameters ID, AC,
     description and sequence length which are displayed by the default
     view "Short description"). This new view is entitled "Long
     description" and is available from the menu "Use view ..." in the SRS
     query form.

     4. The SIB Blast interface (accessible also via "Quick BLAST" or from
     the bottom of every SWISS-PROT/TrEMBL entry) now offers the
     possibility to restrict the similarity search by using taxonomic
     criteria. A "Taxonomic View" of the results can also be obtained via
     the BLAST result page. The user can also select a number of matching
     sequences and directly submit them to a ClustalW search, or retrieve
     and download the corresponding SWISS-PROT/TrEMBL entries. An
     alternative view of the results, NiceBlast, is available, which
     consists of an html table, detailing complete descriptions of all
     matching proteins, including the full protein name, gene name,
     sequence length and organism.

     5. Explicit cross-references have been implemented between SWISS-PROT
     and BLOCKS, GlycoSuiteDB, InterPro, Leproma, MEROPS, MypuList, SMART,
     TubercuList, ANU-2DPAGE, PHCI-2DPAGE, PMMA-2DPAGE, COMPLUYEAST-2DPAGE,
     and Siena-2DPAGE. Implicit links have been added to the resources DIP,
     GeneCensus, GeneLynx, HUGE and NucleaRDB.

     6. A new tool has been added to the ExPASy suite of proteomics tools:
     FindPept (http://www.expasy.org/tools/findpept.html) can identify
     peptides that result from unspecific cleavage of proteins from their
     experimental masses, taking into account artefactual chemical
     modifications, post-translational modifications (PTM) and protease
     autolytic cleavage. This new tool has been closely integrated with the
     other proteomics tools on ExPASy, such as PeptIdent and FindMod.

     7. The Sulfinator (http://www.expasy.org/tools/sulfinator/) is a newly
     developed tool to predict tyrosine sulfation sites for a protein
     sequence, using four different Hidden Markov Models (HMM).

     8. Sequences of alternatively spliced isoforms of the same protein are
     documented in the feature table of that protein sequence record. In
     collaboration with the SWISS-PROT group at EBI, a program varsplic.pl
     has been written to generate additional records from SWISS-PROT and
     TrEMBL, one for each splice isoform of each protein. The resulting
     data sets for SWISS- PROT and TrEMBL are available on the ExPASy ftp
     server (ftp://ftp.expasy.org/databases/sp_tr_nrdb/), along with a more
     detailed description of the project and information on how to obtain a
     local copy of the varsplic.pl program.

     The additional isoform entries have been added to the
     SWISS-PROT/TrEMBL databases underlying the BLAST server at SIB
     Switzerland, ScanProsite, and PeptIdent. Gradually, all other tools on
     ExPASy will be modified to handle splice isoforms. The NiceProt view
     of SWISS-PROT/TrEMBL provides links from the isoform name in the
     feature table (example: Q01432) to a page displaying the sequence of
     the corresponding isoform.

     9. In the framework of the HAMAP project (see section 2.3), several
     new features and tools have been implemented on ExPASy:
        o The keyword "Complete Proteome" has been introduced to all
          SWISS-PROT/TrEMBL entries describing a protein which is thought
          to be expressed by an organism whose genome has been completely
          sequenced. This keyword is so far only used for microbial
          (bacterial and archaeal) proteins. A complete set of proteins
          from a microbial genome can therefore be obtained using this
          keyword across SWISS-PROT and TrEMBL.
        o We provide clean non-redundant SWISS-PROT/TrEMBL data sets for
          all completely sequenced microbial genomes. These files are
          available on the ExPASy ftp server in SWISS-PROT and Fasta format
          (ftp://ftp.expasy.org/databases/complete_proteomes/), and can
          also be used for similarity searches on the SIB Blast server
          ("microbial proteomes").
        o A Genomic Proximity Viewer is available for those microbial
          genomes where an ORF numbering system exists. For those
          organisms, it is possible to click on the ORF name in the
          SWISS-PROT/TrEMBL GN lines to obtain a list of proteins encoded
          by genes in proximity. The tool is also accessible from the HAMAP
          complete proteome pages of those organisms. Example: Borrelia
          Burgdorferi,
          http://www.expasy.org/cgi-bin/genomeview.pl?bn=BORBU.

     10. A year ago we have launched Protein Spotlight
     (http://www.expasy.org/spotlight/); a periodical review centered on a
     specific protein or group of proteins. It is published on a monthly
     basis. You can subscribe to receive each issue, free of charge, in
     HTML or PDF format.


                   6 TrEMBL - a supplement to SWISS-PROT

The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into
SWISS-PROT. Since we do not want to dilute the quality standards of
SWISS-PROT by incorporating sequences into the database without proper
sequence analysis and annotation, we cannot speed up the incorporation of
new incoming data indefinitely. But as we also want to make the sequences
available as fast as possible, we have introduced with SWISS-PROT a
computer annotated supplement. This supplement consists of entries in
SWISS-PROT-like format derived from the translation of all coding sequences
(CDS) in the EMBL nucleotide sequence database, except those already
included in SWISS-PROT.

This supplement is named TrEMBL (Translation from EMBL). It can be
considered as a preliminary section of SWISS-PROT. This SWISS-PROT release
is supplemented by TrEMBL release 18.

TrEMBL is available by FTP from the EBI and ExPASy servers in the directory
'databases/trembl'. It can be queried on WWW by the EBI and ExPASy SRS
servers. It is distributed with its own set of release notes.


                  7   FTP access to SWISS-PROT and TrEMBL

     7.1   Generalities

SWISS-PROT is available for download on the following anonymous FTP
servers:

 Organization Swiss Institute of Bioinformatics (SIB)
 Address      ftp.expasy.org, au.expasy.org/ftp/,
              ca.expasy.org/ftp/, cn.expasy.org/ftp/,
              kr.expasy.org/ftp/, tw.expasy.org/ftp/
 Directory    /databases/swiss-prot/


 Organization European Bioinformatics Institute (EBI)
 Address      ftp.ebi.ac.uk
 Directory    /pub/databases/swissprot/


     7.2   Non-redundant database

We distribute on the ExPASy and EBI FTP servers, files that make up a
non-redundant (see further) and complete protein sequence database
consisting of three components:

1) SWISS-PROT
2) TrEMBL
3) New entries to be later integrated into TrEMBL (hereafter known as
TrEMBL_New)

Every week three files are completely rebuilt. These files are named:
sprot. dat.gz, trembl.dat.gz and trembl_new.dat.gz. As indicated by their
'. gz' extension, these are gzip-compressed files which, when decompressed,
will produce ASCII files in SWISS-PROT format.

Three other files are also available (sprot.fas.gz, trembl.fas.gz and
trembl_new.fas.gz) which are compressed 'fasta' format sequence files
useful for building the databases used by FASTA, BLAST and other sequence
similarity search programs. Please do not use these files for any other
purpose, as you will lose all annotations by using this very ' primitive'
format.

The files for the non-redundant database are stored in the directory
'/databases/sp_tr_nrdb' on the ExPASy FTP server (ftp.expasy.org) and in
the directory '/pub/databases/sp_tr_nrdb' on the EBI FTP server
(ftp.ebi.ac.uk).

Additional notes:

   * The SWISS-PROT file continuously grows as new annotated sequences are
     added.

   * The TrEMBL file decreases in size as sequences are moved out of that
     section after being annotated and moved into SWISS-PROT. Four times a
     year a new release of TrEMBL is built at EBI, at this point the TrEMBL
     file increases in size as it then includes all of the new data (see
     next section) that has accumulated since the last release.

   * The TrEMBL_New file starts as a very small file and grows in size
     until a new release of TrEMBL is available.

   * SWISS-PROT and TrEMBL share the same system of accession numbers.
     Therefore you will not find any primary accession number duplicated
     between the two sections. A TrEMBL entry (and its associated accession
     number(s)) can either move to SWISS-PROT as new entry or be merged
     with an existing SWISS-PROT entry. In the latter case, the accession
     number(s) of that TrEMBL entry are added to that of the SWISS-PROT
     entry.

   * TrEMBL_New does not have real accession numbers. However it was
     necessary to have an 'AC' line so as to be able to use it with
     different software products. This AC line contains a temporary
     identifier which consists of the protein_ID (protein sequence
     identifier) of the coding sequence in the parent nucleotide sequence.

   * TrEMBL_New is quite messy! You will of course find new sequence
     entries but you will also encounter sequences that are going to be
     used to update existing TrEMBL or SWISS-PROT entries. None of the
     "cleaning" steps that are applied to produce a TrEMBL release are run
     on TrEMBL_New nor are any of the computer-annotation software tools
     that are used to enhance the information content of TrEMBL. TrEMBL_New
     is provided only so that users can be sure not to miss any important
     new sequences when they run similarity searches.

   * While these three files allow you to build what we call a
     'non-redundant' database, it must be noted that this is not completely
     a true statement. Without going into a long explanation we can say
     that this is currently the best attempt in providing a complete
     selection of protein sequence entries while trying to eliminate
     redundancies. Also SWISS-PROT is completely (well 99.994% !)
     non-redundant, TrEMBL is far from being non-redundant and the addition
     of SWISS-PROT + TrEMBL is even less.

   * To describe to your users the version of the non-redundant database
     that you are providing them with, you should use a statement of the
     form:

          SWISS-PROT release 40.0 of 17-Oct-2001;
          TrEMBL release 18.0 of 22-Oct-2001;
          TrEMBL_New of 22-Oct-2001.

     7.3   Weekly updates of SWISS-PROT documents

Whilst the ExPASy FTP server so far only allowed FTP access to the
SWISS-PROT documents and indexes in their versions at the time of the last
full release, all documents are now updated with every weekly release of
SWISS-PROT. They are available for FTP download from the directory
/databases/swiss-prot/updated_doc/.

     7.4   Weekly updates of SWISS-PROT

Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
are generated at each update:

 new_seq.dat Contains all the new entries since the last full
             release;

 upd_seq.dat Contains the entries for which the sequence data has
             been updated since the last release;

 upd_ann.dat Contains the entries for which one or more annotation
             fields have been updated since the last release.

Important notes

   * Although we try to follow a regular schedule, we do not promise to
     update these files every week. In most cases two weeks may elapse
     between two updates.
   * Instead of using the above files, you can, every week, download an
     updated copy of the SWISS-PROT database. This file is available in the
     directory containing the non-redundant database (see section 7.2).


                          8   ENZYME and PROSITE

     8.1   The ENZYME nomenclature database

Release 27.0 of the ENZYME nomenclature database is distributed with
release 40 of SWISS-PROT. ENZYME release 27.0 contains information relative
to 3'870 enzymes. In this release, we have added a significant number of
new entries and we also updated many entries.

     8.2   The PROSITE database

Release 17.0 of the PROSITE database will be available in a few weeks.
PROSITE will now come with its own set of release notes.

                          9   We need your help!

We welcome feedback from our users. We would especially appreciate that you
notify us if you find that sequences belonging to your field of expertise
are missing from the database. We also would like to be notified about
annotations to be updated, if, for example, the function of a protein has
been clarified or if new information about post-translational modifications
has become available. To facilitate this feedback we offer, on the ExPASy
WWW server, a form that allows the submission of updates and/or corrections
to SWISS-PROT:

     http://www.expasy.org/sprot/sp_update_form.html

It is also possible, from any entry in SWISS-PROT displayed by the ExPASy
server, to submit updates and/or corrections for that particular entry.
Finally, you can also send your comments by electronic mail to the address:

     swiss-prot@expasy.org

Note that all update requests are assigned a unique identifier of the
form UR-Xnnnn (example: UR-A0123). This identifier is used internally by
the SWISS-PROT staff at SIB and EBI to track down the fate of requests
and is also be used in email exchanges with the persons having submitted
a request.


                       APPENDIX A:   Some statistics

     A.1   Amino acid composition

     A.1.1   Composition in percent for the complete database

   Ala (A) 7.61   Gln (Q) 3.93   Leu (L) 9.53   Ser (S) 7.08
   Arg (R) 5.19   Glu (E) 6.47   Lys (K) 5.97   Thr (T) 5.58
   Asn (N) 4.36   Gly (G) 6.85   Met (M) 2.37   Trp (W) 1.21
   Asp (D) 5.25   His (H) 2.24   Phe (F) 4.10   Tyr (Y) 3.16
   Cys (C) 1.63   Ile (I) 5.85   Pro (P) 4.89   Val (V) 6.61

   Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.01

     A.1.2   Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp

     A.2   Taxonomic origin

Total number of species represented in this release of SWISS-PROT: 7'188
The first twenty species represent 45'181 sequences: 44.5 % of the total
number of entries.

     A.2.1   Table of the frequency of occurrence of species

        Species represented 1x: 3396
                            2x: 1086
                            3x:  589
                            4x:  366
                            5x:  267
                            6x:  251
                            7x:  169
                            8x:  137
                            9x:  125
                           10x:   61
                       11- 20x:  308
                       21- 50x:  231
                       51-100x:   78
                         >100x:  124

     A.2.2   Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1       7471  Homo sapiens (Human)
       2       4859  Saccharomyces cerevisiae (Baker's yeast)
       3       4816  Mus musculus (Mouse)
       4       4741  Escherichia coli
       5       3091  Rattus norvegicus (Rat)
       6       2260  Bacillus subtilis
       7       2184  Caenorhabditis elegans
       8       1782  Schizosaccharomyces pombe (Fission yeast)
       9       1769  Haemophilus influenzae
      10       1514  Drosophila melanogaster (Fruit fly)
      11       1472  Methanococcus jannaschii
      12       1409  Arabidopsis thaliana (Mouse-ear cress)
      13       1321  Mycobacterium tuberculosis
      14       1295  Bos taurus (Bovine)
      15       1004  Gallus gallus (Chicken)
      16        883  Synechocystis sp. (strain PCC 6803)
      17        872  Escherichia coli O157:H7
      18        846  Salmonella typhimurium
      19        798  Archaeoglobus fulgidus
      20        794  Xenopus laevis (African clawed frog)
      21        765  Sus scrofa (Pig)
      22        680  Aquifex aeolicus
      23        671  Oryctolagus cuniculus (Rabbit)
      24        662  Mycoplasma pneumoniae
      25        594  Pseudomonas aeruginosa
      26        588  Treponema pallidum
      27        557  Buchnera aphidicola (subsp. Acyrthosiphon pisum)
      28        523  Rickettsia prowazekii
      29        522  Helicobacter pylori (Campylobacter pylori)
      30        505  Helicobacter pylori J99 (Campylobacter pylori J99)
      31        503  Mycobacterium leprae
      32        486  Mycoplasma genitalium
      33        481  Zea mays (Maize)
      34        450  Methanobacterium thermoautotrophicum
      35        403  Rhizobium sp. (strain NGR234)
      36        395  Borrelia burgdorferi (Lyme disease spirochete)
      37        390  Oryza sativa (Rice)
      38        387  Chlamydia trachomatis
      39        375  Thermotoga maritima
      40        374  Streptomyces coelicolor
      41        371  Chlamydia pneumoniae (Chlamydophila pneumoniae)
      42        368  Canis familiaris (Dog)
      43        364  Chlamydia muridarum
      44        356  Rhizobium meliloti (Sinorhizobium meliloti)
      45        353  Vibrio cholerae
      46        333  Nicotiana tabacum (Common tobacco)
      47        323  Pasteurella multocida
      48        322  Ovis aries (Sheep)
      49        320  Pyrococcus horikoshii
      50        311  Dictyostelium discoideum (Slime mold)
      51        301  Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
      52        284  Pyrococcus abyssi
      53        276  Pisum sativum (Garden pea)
      54        272  Bacteriophage T4
      55        260  Staphylococcus aureus
      56        256  Candida albicans (Yeast)
      57        255  Neurospora crassa
      58        254  Vaccinia virus (strain Copenhagen)
      59        247  Triticum aestivum (Wheat)
      60        247  Bacillus halodurans
      61        244  Glycine max (Soybean)
      62        243  Hordeum vulgare (Barley)
      63        242  Aeropyrum pernix
      64        241  Rhodobacter capsulatus (Rhodopseudomonas capsulata)
      65        231  Pseudomonas putida
      66        227  Lycopersicon esculentum (Tomato)
      67        221  Cavia porcellus (Guinea pig)
      68        220  Porphyra purpurea
      69        219  Solanum tuberosum (Potato)
      70        214  Spinacia oleracea (Spinach)
      71        214  Klebsiella pneumoniae
      72        213  Bacillus stearothermophilus
      73        210  Neisseria meningitidis (serogroup B)
      74        204  Neisseria meningitidis (serogroup A)
      75        193  Human cytomegalovirus (strain AD169)
      76        188  Campylobacter jejuni
      77        187  Vaccinia virus (strain WR)
      78        183  Deinococcus radiodurans
      79        180  Agrobacterium tumefaciens
      80        179  Sulfolobus solfataricus
      81        179  Brachydanio rerio (Zebrafish) (Zebra danio)
      82        173  Equus caballus (Horse)
      83        171  Mesocricetus auratus (Golden hamster)
      84        171  Chlamydomonas reinhardtii
      85        170  Thermoplasma acidophilum
      86        168  Emericella nidulans (Aspergillus nidulans)
      87        158  Halobacterium sp. (strain NRC-1)
      88        154  Autographa californica nuclear polyhedrosis virus (AcMNPV)
      89        153  Cyanidium caldarium
      90        152  Thermus aquaticus (subsp. thermophilus)
      91        151  Marchantia polymorpha (Liverwort)
      92        151  Cyanophora paradoxa
      93        149  Xylella fastidiosa
      94        148  Fowlpox virus (FPV)
      95        148  Guillardia theta (Cryptomonas phi)
      96        147  Synechococcus sp. (strain PCC 7942) (Anacystis nidulans R2)
      97        147  Variola virus
      98        143  Caulobacter crescentus
      99        142  Ureaplasma parvum (Ureaplasma urealyticum biotype 1)
     100        142  Kluyveromyces lactis (Yeast)

     A.2.3   Taxonomic distribution of the sequences

   Kingdom       Sequences (% of the database)
   Archaea            5032 (  5%)
   Bacteria          34782 ( 34%)
   Eukaryota         53357 ( 53%)
   Viruses            8431 (  8%)

     A.3   Sequence size

     A.3.1   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50    1950             1001-1100      915
                 51- 100    7099             1101-1200      708
                101- 150   10484             1201-1300      471
                151- 200    9010             1301-1400      318
                201- 250    8978             1401-1500      268
                251- 300    8130             1501-1600      172
                301- 350    7894             1601-1700      150
                351- 400    7945             1701-1800      105
                401- 450    5869             1801-1900      116
                451- 500    5485             1901-2000       87
                501- 550    4190             2001-2100       47
                551- 600    2852             2101-2200       87
                601- 650    2249             2201-2300       89
                651- 700    1651             2301-2400       50
                701- 750    1457             2401-2500       48
                751- 800    1240             >2500          273
                801- 850     985
                851- 900     965
                901- 950     700
                951-1000     593

     A.3.2   Longest and shortest sequences

   The shortest sequence is  GRWM_HUMAN (P24272) :     3 amino acids.
   The longest sequence is   NEBU_HUMAN (P20929) :  6669 amino acids.

     A.4   Journal citations

Note: the following citation statistics reflect the number of distinct
journal citations.

Total number of journals cited in this release of SWISS-PROT: 1'190

     A.4.1   Table of the frequency of journal citations

        Journals cited 1x:  443
                       2x:  157
                       3x:   87
                       4x:   58
                       5x:   51
                       6x:   27
                       7x:   24
                       8x:   19
                       9x:   21
                      10x:   11
                  11- 20x:   83
                  21- 50x:   88
                  51-100x:   31
                    >100x:   90

     A.4.2   List of the most cited journals in SWISS-PROT

   Nb    Citations   Journal name
   --    ---------   -------------------------------------------------------------
    1         8033   Journal of Biological Chemistry
    2         4615   Proceedings of the National Academy of Sciences of the U.S.A.
    3         3554   Nucleic Acids Research
    4         3295   Journal of Bacteriology
    5         3144   Gene
    6         2492   FEBS Letters
    7         2293   Biochemical and Biophysical Research Communications
    8         2255   European Journal of Biochemistry
    9         2144   Biochemistry
   10         1998   The EMBO Journal
   11         1894   Nature
   12         1833   Biochimica et Biophysica Acta
   13         1682   Journal of Molecular Biology
   14         1503   Genomics
   15         1477   Cell
   16         1434   Molecular and Cellular Biology
   17         1096   Biochemical Journal
   18         1085   Molecular and General Genetics
   19         1078   Plant Molecular Biology
   20         1024   Science
   21          982   Molecular Microbiology
   22          814   Virology
   23          808   Journal of Biochemistry
   24          637   Human Molecular Genetics
   25          592   Journal of Cell Biology
   26          573   Journal of Virology
   27          525   Human Mutation
   28          520   Plant Physiology
   29          518   Genes and Development
   30          510   Yeast
   31          505   Nature Genetics
   32          494   Oncogene
   33          486   Journal of General Virology
   34          477   Infection and Immunity
   35          461   Journal of Immunology
   36          441   The American Journal of Human Genetics
   37          424   Structure
   38          420   Archives of Biochemistry and Biophysics
   39          391   FEMS Microbiology Letters
   40          366   Microbiology
   41          358   Current Genetics
   42          346   Development
   43          333   Nature Structural Biology
   44          331   Molecular and Biochemical Parasitology
   45          320   Human Genetics
   46          293   Genetics
   47          280   Molecular Endocrinology
   48          277   Journal of Clinical Investigation
   49          270   Biological Chemistry Hoppe-Seyler
   50          267   Applied and Environmental Microbiology
   51          265   Blood
   52          263   Journal of Molecular Evolution
   53          253   Protein Science
   54          249   DNA and Cell Biology
   55          243   Developmental Biology
   56          229   Journal of General Microbiology
   57          224   Journal of Experimental Medicine
   58          213   Neuron
   59          213   Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
   60          211   Cancer Research
   61          210   Immunogenetics
   62          208   Mammalian Genome
   63          197   Endocrinology
   64          182   Mechanisms of Development
   65          180   DNA Sequence
   66          170   Acta Crystallographica, Section D
   67          164   The Plant Cell
   68          161   Brain Research. Molecular Brain Research
   69          159   Journal of Neurochemistry
   70          158   Molecular Biology and Evolution
   71          156   DNA
   72          155   Molecular Biology of the Cell
   73          147   The Plant Journal
   74          146   Journal of Cell Science
   75          145   Journal of Neuroscience
   76          135   Comparative Biochemistry and Physiology
   77          133   Bioscience, Biotechnology, and Biochemistry
   78          130   Antimicrobial Agents and Chemotherapy
   79          125   Biochimie
   80          123   Virus Research
   81          122   Bioorganicheskaia Khimiia
   82          120   Molecular Pharmacology
   83          117   Hemoglobin
   84          116   The Journal of Clinical Endocrinology and Metabolism
   85          113   Agricultural and Biological Chemistry
   86          112   Cytogenetics and Cell Genetics
   87          112   American Journal of Physiology
   88          110   Molecular Plant-Microbe Interactions
   89          105   Proteins
   90          102   Peptides
   91          100   DNA Research

     A.5   Statistics for some line types

The following table summarizes the total number of some SWISS-PROT lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                     182326              1.79
   Journal                          152419     89829    1.50
   Submitted to EMBL/GenBank/DDBJ    27607     24142    0.27
   Unpublished observations            500       496   <0.01
   Book citation                       438       428   <0.01
   Submitted to SWISS-PROT             437       435   <0.01
   Plant Gene Register                 385       378   <0.01
   Submitted to other databases        185       183   <0.01
   Thesis                              160       159   <0.01
   Unpublished results                 114       112   <0.01
   Patent                               79        77   <0.01
   Worm Breeder's Gazette                2         2   <0.01

Comments (CC)                       309232              3.04
   SIMILARITY                        91246     81758    0.90
   FUNCTION                          61984     61049    0.61
   SUBCELLULAR LOCATION              42010     42010    0.41
   CATALYTIC ACTIVITY                27896     26508    0.27
   SUBUNIT                           25865     25864    0.25
   PATHWAY                           11464     11431    0.11
   TISSUE SPECIFICITY                10070     10070    0.10
   COFACTOR                           7811      7811    0.08
   MISCELLANEOUS                      6942      6352    0.07
   PTM                                5829      5447    0.06
   INDUCTION                          2971      2971    0.03
   DEVELOPMENTAL STAGE                2811      2811    0.03
   ALTERNATIVE PRODUCTS               2755      2754    0.03
   DOMAIN                             2658      2471    0.03
   CAUTION                            2169      2099    0.02
   DISEASE                            1865      1620    0.02
   ENZYME REGULATION                  1473      1473    0.01
   MASS SPECTROMETRY                   548       506    0.01
   DATABASE                            503       465   <0.01
   POLYMORPHISM                        295       287   <0.01
   PHARMACEUTICAL                       38        38   <0.01
   BIOTECHNOLOGY                        29        29   <0.01

Features (FT)                       471213              4.64
   DOMAIN                            76115     22381    0.75
   TRANSMEM                          64913     14473    0.64
   CARBOHYD                          40298      9840    0.40
   CONFLICT                          36638     12924    0.36
   DISULFID                          34856      9355    0.34
   METAL                             27931      6801    0.27
   CHAIN                             20956     16975    0.21
   VARIANT                           18980      3544    0.19
   ACT_SITE                          18495     11839    0.18
   REPEAT                            17543      3013    0.17
   SIGNAL                            12976     12975    0.13
   NP_BIND                           12514      8916    0.12
   MOD_RES                           11665      6503    0.11
   NON_TER                           10234      7849    0.10
   BINDING                            7710      6160    0.08
   TURN                               7330       633    0.07
   STRAND                             7077       562    0.07
   ZN_FING                            5911      2061    0.06
   INIT_MET                           4892      4868    0.05
   HELIX                              4644       587    0.05
   VARSPLIC                           4211      2068    0.04
   SITE                               4151      3019    0.04
   PROPEP                             3842      3488    0.04
   DNA_BIND                           3796      3589    0.04
   MUTAGEN                            2797       963    0.03
   LIPID                              2684      2174    0.03
   TRANSIT                            2300      2284    0.02
   PEPTIDE                            2202       830    0.02
   CA_BIND                            2106       840    0.02
   NON_CONS                            732       387    0.01
   UNSURE                              255       117   <0.01
   SIMILAR                             242       203   <0.01
   SE_CYS                              104        64   <0.01
   THIOETH                              90        31   <0.01
   THIOLEST                             23        23   <0.01

Cross-references (DR)               718458              7.07
   EMBL                             179318     95610    1.76
   InterPro                         128566     81051    1.27
   Pfam                             101086     77741    0.99
   PROSITE                           83189     53484    0.82
   PIR                               47057     35789    0.46
   HSSP                              33548     33548    0.33
   PRINTS                            30494     27899    0.30
   SMART                             30434     22855    0.30
   ProDom                            16772     16337    0.17
   PDB                               10380      3124    0.10
   TIGR                               9378      9343    0.09
   MIM                                6755      6024    0.07
   SGD                                4903      4849    0.05
   MGD                                4408      4397    0.04
   EcoGene                            4134      4132    0.04
   Mendel                             3041      2942    0.03
   MEROPS                             2348      2260    0.02
   SubtiList                          2234      2233    0.02
   WormPep                            2071      2034    0.02
   FlyBase                            1936      1883    0.02
   GCRDb                              1661       972    0.02
   TRANSFAC                           1612      1494    0.02
   TubercuList                        1350      1313    0.01
   StyGene                             799       798    0.01
   SWISS-2DPAGE                        746       745    0.01
   Leproma                             501       497   <0.01
   MaizeDB                             402       398   <0.01
   HIV                                 370       354   <0.01
   REBASE                              352       347   <0.01
   ECO2DBASE                           351       299   <0.01
   DictyDb                             313       310   <0.01
   GlycoSuiteDB                        249       249   <0.01
   ZFIN                                154       154   <0.01
   YEPD                                129       120   <0.01
   Aarhus/Ghent-2DPAGE                 128        98   <0.01
   PHCI-2DPAGE                         128       128   <0.01
   Siena-2DPAGE                        104       104   <0.01
   HSC-2DPAGE                           85        85   <0.01
   COMPLUYEAST-2DPAGE                   50        50   <0.01
   CarbBank                             41        21   <0.01
   Maize-2DPAGE                         39        39   <0.01
   PMMA-2DPAGE                          26        26   <0.01
   MypuList                             21        21   <0.01
   ANU-2DPAGE                           13        13   <0.01

     A.6   Miscellaneous statistics

Total number of distinct authors cited in SWISS-PROT: 146'936

Total number of entries encoded on a chloroplast : 2'609
Total number of entries encoded on a mitochondrion : 2'262
Total number of entries encoded on a cyanelle : 145
Total number of entries encoded on a plasmid : 2'344

Number of additional sequences encoded on splice variants : 3'505

--End of document--
  

Swiss-Prot release 39.0

Published May 1, 2000

                   SWISS-PROT RELEASE 39.0 RELEASE NOTES

The release notes for release 39 were never finished. But some of the
statistics were computed


1.  INTRODUCTION

Release 39.0  of SWISS-PROT  contains 86'593  sequence entries,  comprising
31'411'114 amino  acids abstracted  from 68'779 references. This represents
an increase  of 8%  over release  38.  The  growth  of  the  data  bank  is
summarized below.

 Release      Date           Number of       Number of amino
                               entries                 acids
    2.0       09/86               3939               900 163
    3.0       11/86               4160               969 641
    4.0       04/87               4387             1 036 010
    5.0       09/87               5205             1 327 683
    6.0       01/88               6102             1 653 982
    7.0       04/88               6821             1 885 771
    8.0       08/88               7724             2 224 465
    9.0       11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368
   31.0       02/95              43470            15 335 248
   32.0       11/95              49340            17 385 503
   33.0       02/96              52205            18 531 384
   34.0       10/96              59021            21 210 389
   35.0       11/97              69113            25 083 768
   36.0       07/98              74019            26 840 295
   37.0       12/98              77977            28 268 293
   38.0       07/99              80000            29 085 965
   39.0       05/00              86593            31 411 114


2.  DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 38

2.1  Sequences and annotations

6'674 sequences have been added since release 38, the sequence data of 1026
existing entries  has been  updated and  the annotations  of 14'348 entries
have been revised.


   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.62   Gln (Q) 3.93   Leu (L) 9.47   Ser (S) 7.08
   Arg (R) 5.18   Glu (E) 6.40   Lys (K) 5.95   Thr (T) 5.64
   Asn (N) 4.40   Gly (G) 6.87   Met (M) 2.37   Trp (W) 1.22
   Asp (D) 5.25   His (H) 2.24   Phe (F) 4.10   Tyr (Y) 3.17
   Cys (C) 1.63   Ile (I) 5.84   Pro (P) 4.90   Val (V) 6.61

   Asx (B) 0.001  Glx (Z) 0.001  Xaa (X) 0.01


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 6783

   The first twenty species represent 39720 sequences: 45.9 % of the total
   number of entries.


   A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 3215
                            2x: 1029
                            3x:  546
                            4x:  353
                            5x:  252
                            6x:  233
                            7x:  166
                            8x:  132
                            9x:  113
                           10x:   61
                       11- 20x:  293
                       21- 50x:  219
                       51-100x:   70
                         >100x:  101


   A.2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1       5908  Homo sapiens (Human)
       2       4827  Saccharomyces cerevisiae (Baker's yeast)
       3       4572  Escherichia coli
       4       3828  Mus musculus (Mouse)
       5       2738  Rattus norvegicus (Rat)
       6       2196  Bacillus subtilis
       7       2066  Caenorhabditis elegans
       8       1734  Haemophilus influenzae
       9       1477  Schizosaccharomyces pombe (Fission yeast)
      10       1364  Methanococcus jannaschii
      11       1206  Mycobacterium tuberculosis
      12       1178  Bos taurus (Bovine)
      13       1121  Drosophila melanogaster (Fruit fly)
      14        938  Arabidopsis thaliana (Mouse-ear cress)
      15        933  Gallus gallus (Chicken)
      16        792  Synechocystis sp. (strain PCC 6803)
      17        774  Salmonella typhimurium
      18        751  Xenopus laevis (African clawed frog)
      19        696  Sus scrofa (Pig)
      20        621  Oryctolagus cuniculus (Rabbit)
      21        501  Mycoplasma pneumoniae
      22        484  Mycoplasma genitalium
      23        457  Rickettsia prowazekii
      24        453  Zea mays (Maize)
      25        441  Helicobacter pylori (Campylobacter pylori)
      26        424  Aquifex aeolicus
      27        409  Pseudomonas aeruginosa
      28        403  Rhizobium sp. (strain NGR234)
      29        374  Methanobacterium thermoautotrophicum
      30        354  Treponema pallidum
      31        352  Borrelia burgdorferi (Lyme disease spirochete)
      32        347  Rhizobium meliloti (Sinorhizobium meliloti)
      33        343  Mycobacterium leprae
      34        342  Oryza sativa (Rice)
      35        331  Canis familiaris (Dog)
      36        330  Archaeoglobus fulgidus
      37        304  Nicotiana tabacum (Common tobacco)
      38        295  Ovis aries (Sheep)
                295  Dictyostelium discoideum (Slime mold)
      40        272  Bacteriophage T4
      41        266  Pisum sativum (Garden pea)
      42        257  Thermotoga maritima
                257  Streptomyces coelicolor
      44        253  Vaccinia virus (strain Copenhagen)
      45        250  Chlamydia trachomatis
      46        244  Staphylococcus aureus
      47        240  Glycine max (Soybean)
      48        230  Neurospora crassa
                230  Hordeum vulgare (Barley)
      50        227  Rhodobacter capsulatus (Rhodopseudomonas capsulata)
                227  Candida albicans (Yeast)
      52        220  Porphyra purpurea
      53        219  Pseudomonas putida
      54        216  Pyrococcus horikoshii
                216  Lycopersicon esculentum (Tomato)
                216  Helicobacter pylori J99 (Campylobacter pylori J99)
      57        212  Chlamydia pneumoniae (Chlamydophila pneumoniae)
      58        211  Triticum aestivum (Wheat)
      59        208  Solanum tuberosum (Potato)
      60        207  Klebsiella pneumoniae
      61        205  Bacillus stearothermophilus
      62        193  Human cytomegalovirus (strain AD169)
      63        186  Vaccinia virus (strain WR)
      64        185  Cavia porcellus (Guinea pig)
      65        178  Agrobacterium tumefaciens
      66        174  Spinacia oleracea (Spinach)
      67        162  Equus caballus (Horse)
      68        160  Chlamydomonas reinhardtii
      69        159  Emericella nidulans (Aspergillus nidulans)
      70        155  Mesocricetus auratus (Golden hamster)
      71        154  Autographa californica nuclear polyhedrosis virus
      72        151  Marchantia polymorpha (Liverwort)
      73        148  Guillardia theta (Cryptomonas phi)
      74        147  Cyanophora paradoxa
      75        146  Variola virus
      76        145  Brachydanio rerio (Zebrafish) (Zebra danio)
      77        141  Aeropyrum pernix
      78        139  Thermus aquaticus (subsp. thermophilus)
                139  Odontella sinensis
                139  Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
      81        136  Synechococcus sp. (strain PCC 7942)
      82        134  Orgyia pseudotsugata multicapsid polyhedrosis virus
      83        133  Kluyveromyces lactis (Yeast)
      84        129  Trypanosoma brucei brucei
      85        125  Oncorhynchus mykiss (Rainbow trout) (Salmo gairdneri)
                125  Bradyrhizobium japonicum
                125  Alcaligenes eutrophus
      88        123  Streptococcus pneumoniae
      89        120  Rhodobacter sphaeroides (Rhodopseudomonas sphaeroides)
                120  Anabaena sp. (strain PCC 7120)
      91        118  Yersinia enterocolitica
      92        116  Bombyx mori (Silk moth)
      93        113  Neisseria gonorrhoeae
      94        110  Felis silvestris catus (Cat)
                110  Brassica napus (Rape)
      96        108  Vibrio cholerae
      97        105  Serratia marcescens
                105  Macaca mulatta (Rhesus macaque)
      99        104  Buchnera aphidicola (subsp. Schizaphis graminum)
     100        102  Shigella flexneri
     101        101  Cricetulus griseus (Chinese hamster)


   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    3407             1001-1100      775
                 51- 100    7142             1101-1200      598
                101- 150   10455             1201-1300      399
                151- 200    8372             1301-1400      270
                201- 250    7895             1401-1500      218
                251- 300    7290             1501-1600      145
                301- 350    6898             1601-1700      123
                351- 400    6938             1701-1800       93
                401- 450    5261             1801-1900       97
                451- 500    4946             1901-2000       70
                501- 550    3704             2001-2100       39
                551- 600    2500             2101-2200       83
                601- 650    1945             2201-2300       81
                651- 700    1451             2301-2400       40
                701- 750    1234             2401-2500       42
                751- 800    1047             >2500          236
                801- 850     832
                851- 900     862
                901- 950     600
                951-1000     505


   A.5  Statistics for journal citations


   Total number of journals cited in this release of SWISS-PROT: 1067


   A.5.1 Table of the frequency of journal citations

        Journals cited 1x: 400
                       2x: 138
                       3x:  83
                       4x:  57
                       5x:  36
                       6x:  24
                       7x:  19
                       8x:  19
                       9x:  16
                      10x:  10
                  11- 20x:  82
                  21- 50x:  71
                  51-100x:  27
                    >100x:  85


   A.5.2  List of the most cited journals in SWISS-PROT

   Nb    Citations   Journal abbreviation
   --    ---------   ----------------------------------
    1    7029        J. Biol. Chem.
    2    4194        Proc. Natl. Acad. Sci. U.S.A.
    3    3462        Nucleic Acids Res.
    4    3053        J. Bacteriol.
    5    2863        Gene
    6    2250        FEBS Lett.
    7    2122        Eur. J. Biochem.
    8    2018        Biochem. Biophys. Res. Commun.
    9    1985        Biochemistry
   10    1837        EMBO J.
   11    1736        Nature
   12    1634        Biochim. Biophys. Acta
   13    1539        J. Mol. Biol.
   14    1371        Cell
   15    1298        Mol. Cell. Biol.
   16    1160        Genomics
   17    1035        Mol. Gen. Genet.
   18    1021        Plant Mol. Biol.
   19     995        Biochem. J.
   20     913        Science
   21     889        Mol. Microbiol.
   22     797        Virology
   23     744        J. Biochem.
   24     546        J. Virol.
   25     514        J. Cell Biol.
   26     492        Yeast
   27     486        Hum. Mol. Genet.
   28     484        Plant Physiol.
   29     474        J. Gen. Virol.
   30     452        Genes Dev.
   31     440        Hum. Mutat.
   32     400        Infect. Immun.
   33     393        J. Immunol.
   34     387        Oncogene
   35     384        Structure
   36     381        Arch. Biochem. Biophys.
   37     348        Curr. Genet.
   38     347        Am. J. Hum. Genet.
   39     346        Nat. Genet.
   40     346        FEMS Microbiol. Lett.
   41     323        Mol. Biochem. Parasitol.
   42     314        Microbiology
   43     290        Development
   44     277        Nat. Struct. Biol.
   45     268        Biol. Chem. Hoppe-Seyler
   46     262        Hum. Genet.
   47     257        Mol. Endocrinol.
   48     256        J. Clin. Invest.
   49     247        Genetics
   50     246        J. Mol. Evol.
   51     236        Appl. Environ. Microbiol.
   52     224        J. Gen. Microbiol.
   53     223        DNA Cell Biol.
   54     222        Protein Sci.
   55     218        Blood
   56     213        Hoppe-Seyler's Z. Physiol. Chem.
   57     199        J. Exp. Med.
          199        Dev. Biol.
   59     196        Neuron
   60     174        Immunogenetics
   61     172        Endocrinology
   62     164        DNA Seq.
   63     152        DNA
   64     150        Cancer Res.
   65     148        Plant Cell
   66     135        Acta Crystallogr. D
   67     133        Plant J.
          133        Mech. Dev.
   69     131        Mol. Biol. Evol.
          131        Mamm. Genome
   71     128        Brain Res. Mol. Brain Res.
   72     125        J. Neurochem.
   73     124        Comp. Biochem. Physiol.
   74     121        Biochimie
   75     119        Biosci. Biotechnol. Biochem.
   76     118        Bioorg. Khim.
   77     117        J. Neurosci.
   78     116        Virus Res.
          116        Hemoglobin
   80     114        Antimicrob. Agents Chemother.
   81     112        Agric. Biol. Chem.
   82     108        Mol. Biol. Cell
          108        J. Cell Sci.
   84     106        Mol. Plant Microbe Interact.
          106        Mol. Pharmacol.
  

Swiss-Prot release 38.0

Published July 1, 1999

                   SWISS-PROT RELEASE 38.0 RELEASE NOTES


1.  INTRODUCTION

Release 38.0  of SWISS-PROT  contains 80'000  sequence entries,  comprising
29'085'265 amino  acids abstracted  from 64'965 references. This represents
an increase  of 3%  over release  37.  The  growth  of  the  data  bank  is
summarized below.

 Release      Date           Number of       Number of amino
                               entries                 acids
    2.0       09/86               3939               900 163
    3.0       11/86               4160               969 641
    4.0       04/87               4387             1 036 010
    5.0       09/87               5205             1 327 683
    6.0       01/88               6102             1 653 982
    7.0       04/88               6821             1 885 771
    8.0       08/88               7724             2 224 465
    9.0       11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368
   31.0       02/95              43470            15 335 248
   32.0       11/95              49340            17 385 503
   33.0       02/96              52205            18 531 384
   34.0       10/96              59021            21 210 389
   35.0       11/97              69113            25 083 768
   36.0       07/98              74019            26 840 295
   37.0       12/98              77977            28 268 293
   38.0       07/99              80000            29 085 965



2.  DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 37

2.1  Sequences and annotations

2'106 sequences  have been added since release 37, the sequence data of 400
existing entries  has been  updated and  the annotations  of 12'576 entries
have been revised.


2.2  What's happening with the model organisms

We have  selected a  number of  organisms that  are the  target  of  genome
sequencing and/or mapping projects and for which we intend to:

o  Be as  complete as possible.  All sequences  available at a given time
   should  be  immediately  included  in  SWISS-PROT.  This also includes
   sequence corrections and updates;
o  Provide a higher level of annotation;
o  Provide  cross-references  to  specialized  database(s) that  contain,
   among other  data,  some genetic information about the genes that code
   for these proteins;
o  Provide specific indices or documents.

Here is the current status of the model organisms in SWISS-PROT:

 Organism        Database            Index file       Number of
                 cross-referenced                     sequences
 --------------  ----------------    --------------   ---------
 A.thaliana      None yet            In preparation         821
 B.subtilis      SubtiList           SUBTILIS.TXT          2069
 C.albicans      None yet            CALBICAN.TXT           221
 C.elegans       Wormpep             CELEGANS.TXT          2202
 D.discoideum    DictyDB             DICTY.TXT              292
 D.melanogaster  FlyBase             FLY.TXT               1088
 E.coli          EcoGene             ECOLI.TXT             4516
 H.influenzae    HiDB (TIGR)         HAEINFLU.TXT          1698
 H.sapiens       MIM                 MIMTOSP.TXT           5406
 H.pylori        HpDB (TIGR)         HPYLORI.TXT            382
 M.genitalium    MgDB (TIGR)         MGENITAL.TXT           469
 M.musculus      MGD                 MGDTOSP.TXT           3549
 M.jannaschii    MjDB (TIGR)         MJANNASC.TXT          1312
 M.tuberculosis  None yet            None yet               928
 S.cerevisiae    SGD                 YEAST.TXT             4811
 S.typhimurium   StyGene             SALTY.TXT              727
 S.pombe         None yet            POMBE.TXT             1438
 S.solfataricus  None yet            None yet                86
 --------------  ----------------    --------------   ---------

Collectively the  entries from the above model organisms represent 38.5% of
all SWISS-PROT entries.

We plan  to finish as quickly as possible the annotation of the Escherichia
coli,  Haemophilus   influenzae,   Methanococcus   jannaschii   and   yeast
(S.cerevisiae) sequence entries which are not yet part of SWISS-PROT.

Please also  see the  description of  the Human  Proteomics  Initiative  in
section 10 of these release notes.


2.3  First steps in the conversion of SWISS-PROT to mixed-case characters

We are gradually converting SWISS-PROT entries from all UPPER CASE to MiXeD
CaSe. The  line-types that  have been  converted between  release 37 and 38
are: DT  (DaTe), OS  (Organism Species),  OC (Organism  Classification), OG
(OrGanelle), RL  (Reference Location)  and KW  (KeyWord). The RT (Reference
Title) lines  were already  introduced in  mixed-case  at  release  37.  As
described in  section 3.1,  the process  of converting all of SWISS-PROT to
mixed case is continuing.


2.4  Small change  in the  format of  RL lines  for submissions  to the DNA
     databases

Along with  the conversion  of the  RL to mixed-case (see 2.3) we have also
made a  small change  to the  format of RL lines for submissions to the DNA
databases. What used to be:

RL   SUBMITTED (MMM-YEAR) TO EMBL/GENBANK/DDBJ DATA BANKS.

is now:

RL   Submitted (MMM-YEAR) to the EMBL/GenBank/DDBJ databases.

This change  was made  to follow  more closely  the format used by the EMBL
nucleotide sequence database.


2.5  Introduction of a new CC line-type topic: MISCELLANEOUS

We have introduced in this release a new 'topic' for the comments (CC) line
type: MISCELLANEOUS.  This topic  is used  for all  comments which  do  not
belong to  any other  already defined  topic. This means that starting with
the current release all comments are now assigned to a topic. Example, what
was previously:

CC   -!- BINDS TO BACITRACIN.

is now:

CC   -!- MISCELLANEOUS: BINDS TO BACITRACIN.


2.6  Cleaning up of the SIMILARITY comment line (CC) topic

We are  continuing a  major overhaul of the SIMILARITY topic. We would like
the majority  of the  information stored  in this  topic to  be  usable  by
computer  programs   (while  being   human-readable).  We   are   therefore
standardizing the  format of this topic using two different subformats. One
to describe to which family a protein belongs:

CC   -!-  SIMILARITY: BELONGS TO THE <Name1> FAMILY [OF <Name2>].
CC        [<Name3> SUBFAMILY.]

Examples:

CC   -!-  SIMILARITY: BELONGS TO THE 14-3-3 FAMILY.
CC   -!-  SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASE
CC        FAMILY.
CC   -!-  SIMILARITY: BELONGS TO THE AAA FAMILY OF ATPASES.
CC   -!-  SIMILARITY: BELONGS TO THE IRON/ASCORBATE-DEPENDENT FAMILY OF
CC        OXIDOREDUCTASES.
CC   -!-  SIMILARITY: BELONGS TO THE ANTP FAMILY OF HOMEOBOX PROTEINS.
CC        "DEFORMED" SUBFAMILY.
CC   -!-  SIMILARITY: BELONGS TO THE KINESIN-LIKE PROTEIN FAMILY.
CC        KINESIN SUBFAMILY.

And one to describe which domains are found in a given protein:

CC   -!-  SIMILARITY: CONTAINS n <Name> [DOMAIN|REPEAT][S].

Examples:

CC   -!-  SIMILARITY: CONTAINS 1 FHA DOMAIN.
CC   -!-  SIMILARITY: CONTAINS 45 EGF-LIKE DOMAINS.
CC   -!-  SIMILARITY: CONTAINS 2 SH3 DOMAINS.
CC   -!-  SIMILARITY: CONTAINS 2 SUSHI (SCR) REPEATS.

We have  already updated many entries in this and the previous releases and
plan to complete this change for the next release.


2.7  Changes concerning cross-references (DR line)

We have added cross-references from SWISS-PROT to the Zebrafish Information
Network (ZFIN)  database available  at http://zfish.uoregon.edu/ZFIN/ (see:
Westerfield M.,  Doerry E.,  Kirkpatrick A.E.  and Douglas S.A.; Meth. Cell
Biol. 60:339-355(1999)).  These cross-references  are  present  in  the  DR
lines:

Data bank identifier: ZFIN
Primary identifier  : The ZFIN identifiers for a given gene.
Secondary identifier: The gene designation
Example             : DR   ZFIN; ZDB-GENE-980526-290; hoxa1.

We have  started to  add cross-references  from SWISS-PROT  to the CarbBank
Complex        Carbohydrate         Structure        Database        (CCSD)
(http://128.192.9.29/carbbank/). These  cross-references are present in the
DR lines:

Data bank identifier: CARBBANK
Primary identifier  : The CarbBank identifier for a given carbohydrate
                      structure.
Secondary identifier: A dash (-).
Example             : DR   CARBBANK; CCSD:27494; -.

In this  release, we have also updated all the DR lines pointing to the MIM
and Pfam databases.


2.8  Switching from  pID to  protein_ID  in  cross-references  to  the  DNA
     sequence databases

The DNA  sequence  databases  (EMBL/GenBank/DDBJ)  recently  changed  their
referencing system  for CDS (CoDing Sequence). They used to associate every
CDS in  the database  with what  was called  a pID. The pID was a string of
variable length  composed of  a letter  (D, E  or G)  followed by  a number
(example: E345673).  Whenever the  protein sequence  coded by  a CDS  would
change due  to a  sequence or annotation revision, a new pID was attributed
to that  CDS. This system made it difficult to track down changes. pID have
therefore been replaced by what is now called protein_ID' (protein sequence
IDentifier). The  protein_ID consists of a stable ID portion (8 characters:
3 letters  followed by  5 numbers)  plus a  version number  after a decimal
point (example:  AAA03208.1). The  version number  only  changes  when  the
protein sequence  coded by  the CDS  changes, while the stable part remains
unchanged.

In release  38, we have converted the cross-references to EMBL/GenBank/DDBJ
to use  the protein_ID  instead of  the pID  as the secondary identifier in
these DR lines. Example, what was previously:

DR   EMBL; Z75208; E1165324; -.

is now:

DR   EMBL; Z75208; CAA99603.1; -.

For a  number of  technical reasons,  there are still 732 pID referenced in
release 38, they will gradually be replaced by the corresponding protein_ID
for release 39.


2.9  Introduction of a unique identifier in the VARIANT feature description
     of human sequence entries

We have  introduced in  release 38  a unique  identifier  for  all  VARIANT
feature keys  in human  sequence entries.  This change  is the  first  step
toward providing  a unique  identifier to  all SWISS-PROT  features.  Human
sequence variants  were chosen  as a  prototype for this improvement. It is
now possible  to directly  link specific  sequence variants to the relevant
entries in disease mutation databases as well as to provide these databases
with a method to implement reciprocal links.

The unique  identifier is  of the  form of /FTId=VAR_nnnnnn and is added as
the last  part of the description field of 'VARIANT' feature keys. Example,
what was previously:

FT   VARIANT       6      6       E -> V (IN S; SICKLE CELL ANEMIA).
FT   VARIANT      11     11       V -> D (IN WINDSOR; O2 AFFINITY UP;
FT                                UNSTABLE).

is now:

FT   VARIANT       6      6       E -> V (IN S; SICKLE CELL ANEMIA).
FT                                /FTId=VAR_002863.
FT   VARIANT      11     11       V -> D (IN WINDSOR; O2 AFFINITY UP;
FT                                UNSTABLE).
FT                                /FTId=VAR_002873.



3.  FORTHCOMING CHANGES

3.1  Continuation of the conversion of SWISS-PROT to mixed-case characters

We will continue to convert SWISS-PROT entries from all UPPER CASE to MiXeD
CaSe. In  release 39  we are  planning to convert the RA (Reference Author)
and RC  (Reference Comment)  line types.  We will  also  convert  the  gene
designations in  the DR  (Database cross-Reference) lines for MGD, EcoGene,
StyGene, SubtiList and DictyDb to mixed case.

Further lines will be converted in release 40.

Here is an example of what a SWISS-PROT entry will look like in release 39:

ID   HXC4_MOUSE     STANDARD;      PRT;   264 AA.
AC   Q08624;
DT   01-OCT-1994 (Rel. 30, Created)
DT   01-OCT-1994 (Rel. 30, Last sequence update)
DT   15-DEC-1999 (Rel. 39, Last annotation update)
DE   HOMEOBOX PROTEIN HOX-C4 (HOX-3.5).
GN   HOXC4 OR HOXC-4 OR HOX-3.5.
OS   Mus musculus (Mouse).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
OC   Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=Balb/C; TISSUE=Liver;
RX   MEDLINE; 93288004.
RA   Goto J., Miyabayashi T., Wakamatsu Y., Takahashi N., Muramatsu M.;
RT   "Organization and expression of mouse Hox3 cluster genes.";
RL   Mol. Gen. Genet. 239:41-48(1993).
RN   [2]
RP   SEQUENCE FROM N.A.
RC   TISSUE=Embryo;
RX   MEDLINE; 93161956.
RA   Geada A.M.C., Gaunt S.J., Azzawi M., Shimeld S.M., Pearce J.,
RA   Sharpe P.T.;
RT   "Sequence and embryonic expression of the murine Hox-3.5 gene.";
RL   Development 116:497-506(1992).
RN   [3]
RP   SEQUENCE OF 177-201 FROM N.A.
RC   STRAIN=C57BL/6; TISSUE=Spleen;
RX   MEDLINE; 92073357.
RA   Murtha M.T., Leckman J.F., Ruddle F.H.;
RT   "Detection of homeobox genes in development and evolution.";
RL   Proc. Natl. Acad. Sci. U.S.A. 88:10711-10715(1991).
CC   -!- FUNCTION: SEQUENCE-SPECIFIC TRANSCRIPTION FACTOR WHICH IS PART OF
CC       A DEVELOPMENTAL REGULATORY SYSTEM THAT PROVIDES CELLS WITH
CC       SPECIFIC POSITIONAL IDENTITIES ON THE ANTERIOR-POSTERIOR AXIS.
CC   -!- SUBCELLULAR LOCATION: NUCLEAR.
CC   -!- SIMILARITY: BELONGS TO THE ANTP FAMILY OF HOMEOBOX PROTEINS.
CC       "DEFORMED" SUBFAMILY.
DR   EMBL; D11328; BAA01947.1; -.
DR   EMBL; S62287; AAB27153.1; -.
DR   EMBL; X69019; CAA48784.1; -.
DR   EMBL; M81660; AAA63313.1; -.
DR   PIR; S35219; S35219.
DR   HSSP; P02833; 1SAN.
DR   MGD; MGI:96195; Hoxc4.
DR   PFAM; PF00046; homeobox; 1.
DR   PROSITE; PS00027; HOMEOBOX_1; 1.
DR   PROSITE; PS00032; ANTENNAPEDIA; 1.
DR   PROSITE; PS50071; HOMEOBOX_2; 1.
KW   Homeobox; DNA-binding; Developmental protein; Nuclear protein;
KW   Transcription regulation.
FT   DOMAIN       54     60       POLY-PRO.
FT   DOMAIN      135    140       ANTP-TYPE HEXAPEPTIDE (BY SIMILARITY).
FT   DNA_BIND    156    215       HOMEOBOX (BY SIMILARITY).
FT   DOMAIN      183    186       POLY-ARG.
FT   CONFLICT     80     80       A -> G (IN REF. 2).
FT   CONFLICT     96     96       P -> S (IN REF. 2).
SQ   SEQUENCE   264 AA;  29865 MW;  611C069F CRC32;
     MIMSSYLMDS NYIDPKFPPC EEYSQNSYIP EHSPEYYGRT RESGFQHHHQ ELYPPPPPRP
     SYPERQYSCT SLQGPGNSRA HGPAQAGHHH PEKSQPLCEP APLSGTSASP SPAPPACSQP
     APDHPSSAAS KQPIVYPWMK KIHVSTVNPN YNGGEPKRSR TAYTRQQVLE LEKEFHYNRY
     LTRRRRIEIA HSLCLSERQI KIWFQNRRMK WKKDHRLPNT KVRSAPPAGA APSTLSAATP
     GTSEDHSQSA TPPEQQRAED ITRL
//


3.2  Extension of the accession number system

With the  creation of  the TrEMBL  database (see  section 6)  and the rapid
increase in  the amount  of sequence  data, we  are faced with a problem of
availability of  accession numbers.  Currently we  use a  system based on a
one-letter prefix  followed by  5 digits.  This system was also used by the
nucleotide sequence  databases which had originally reserved for SWISS-PROT
the prefix letters O, 'P' and 'Q'. The nucleotide databases, having run out
of space  (due mainly  to EST's),  have been  forced to  start using  a new
format based on a two-letter prefix followed by 6 digits.

We have now used up all possible numbers with O, 'P' and 'Q'. As we believe
that changing  the format  of the accession numbers to that used now by the
nucleotide database  would create  havoc on  the numerous software packages
using SWISS-PROT,  we have  decided to  keep a  system of accession numbers
based on a six-character code, but with the following format extension:

    1        2       3          4            5            6
    [O,P,Q]  [0-9]  [A-Z, 0-9]  [A-Z, 0-9]   [A-Z, 0-9]   [0-9]

What the above means is that we will keep a six-character code, but that in
positions 3,  4 and  5 of  this code any combination of letters and numbers
can be  present. This format allows a total of 14 million accession numbers
(up from 300'000 with the current system).

We only allow numbers in positions 2 and 6 so that the SWISS-PROT accession
numbers can  not be  mistaken with  gene names,  acronyms,  other  type  of
accession numbers or any type of words!

Examples: P0A3S2, Q2ASD4, O13YX2, P9B123


3.3  Introduction of a new FT key: SE_CYS

Selenocysteine is  the 21st natural amino acid. It is now known to occur in
several dozen  proteins. Its  mRNA codon  is UGA, which usually serves as a
stop codon,  but with  a specific  downstream sequence forming a loop and a
specific translational  elongation factor.  It is recognized as the site of
selenocysteine incorporation into proteins.

Very recently  the joint  nomenclature committee  of the  IUPAC/IUBMB  (see
http://     www.chem.qmw.ac.uk/iupac/jcbn/)      officially     recommended
(http://www.chem.qmw.ac.uk/iubmb/newsletter/1999/item3.html) a three-letter
and a one-letter symbol for selenocysteine, namely Sec and U.

We recognize that introducing a new one-letter code in the sequence records
would disrupt  most, if  not all,  sequence analysis software. We therefore
decided to  change, in  SWISS-PROT, the rules used to annotate the presence
of selenocysteine  residues in  sequence entries  in the  manner  described
below.

Currently selenocysteines  are stored,  in the  sequence records, using the
one-letter symbol  C for  cysteine and  are indicated  in the feature table
(FT) by a line of the type:

FT   BINDING       x      x       SELENIUM.

The one-letter  code will  not be changed (for the reason explained above),
but we  will introduce  a specific  feature key  (SE_CYS) to  indicate  the
presence of  a selenocysteine  at a  given  sequence  position.  The  above
example will therefore be changed to:

FT   SE_CYS        x      x

We also  want to  remind users  that the keyword Selenocysteine is and will
continue to  be used to tag sequence entries that contain at least one such
residue.


3.4  Introduction of a new CC line-type topic: PHARMACEUTICAL

We will  introduce in  the next release a new 'topic' for the comments (CC)
line type:  PHARMACEUTICAL. This  topic will describe the use of a specific
protein as  a pharmaceutical drug. The information provided by such a topic
will include  the brand  name(s) under  which a  protein is  available, the
name(s) of  the compani(es)  that produce it as well as a short description
of the therapeutic usage of the protein.

Examples:

CC   -!- PHARMACEUTICAL: Available under the names Avonex (Biogen),
CC       Betaseron (Berlex) and Rebif (Serono). Used in the treatment
CC       of multiple sclerosis (MS). Betaseron is a slightly modified
CC       form of IFNB1 with two residue substitutions.

CC   -!- PHARMACEUTICAL: Available under the name Proleukin (Chiron).
CC       Used in patients with renal cell carcinoma or metastatic
CC       melanoma.

It should be noted that any entries containing such a comment field will
also be tagged with the keyword Pharmaceutical.


3.5  Multiple AC lines

Starting with  release 39,  there can  be more than one AC (ACcession) line
per SWISS-PROT entry. Strictly speaking this is not a format change and the
users manual  of SWISS-PROT  always indicated that there could be more than
one AC line per entry. Until recently, a single line was sufficient and the
majority of  entries contained  only a single accession number. But, in the
process of  providing an  optimally non-redundant  database we  are merging
information from  TrEMBL entries  into SWISS-PROT  entries. When we merge a
TrEMBL entry  to a  SWISS-PROT one,  we add  to that  SWISS-PROT entry  the
accession number(s)  of the  TrEMBL entry. The repetition of such a process
sometimes produces  an accession  number list  which can no longer fit in a
single AC  line. Therefore  there will  now be some entries with two, three
(as shown below) or more AC lines.

AC   P16070; P22511; Q04858; Q13419; Q13957; Q13958; Q13959; Q13960;
AC   Q13961; Q13967; Q13968; Q13980; Q15861; Q16064; Q16065; Q16066;
AC   Q16208; Q16522;


3.6  Change in the syntax of the SQ line

The SQ  (SeQuence header) line marks the beginning of the sequence data and
gives a  quick summary  of its  content. The  format  of  the  SQ  line  is
currently:

SQ   SEQUENCE  XXXX AA; XXXXXX MW;  XXXXXXXX CRC32;

The last information item in the SQ line is a 32-bit CRC (Cyclic Redundancy
Check) value  which is  computed  from  the  sequence.  As  the  number  of
available sequences  is increasing rapidly, there are now a few cases where
two sequences can share the same CRC32 (but none, which also share the same
molecular weight  MW or number of amino acids AA). To address this issue we
will, starting with the next release, replace the 32-bit CRC value by a 64-
bit CRC. The format of the SQ line will therefore be changed to:

SQ   SEQUENCE  XXXX AA; XXXXXX MW;  XXXXXXXXXXXXXXXX CRC64;

Example:

SQ   SEQUENCE   233 AA;  25630 MW;  146A1B48A1475C86 CRC64;



4.  STATUS OF THE DOCUMENTATION FILES

SWISS-PROT is  distributed with a large number of documentation files. Some
of these  files have  been available  for a  long time  (the  user  manual,
release notes, the various indices for authors, citations, keywords, etc.),
but many  have been  created recently  and we  are continuously  adding new
files. The  following table  lists all  the documents  that  are  currently
available.

 USERMAN.TXT    User manual
 RELNOTES.TXT   Release notes for current release (38)
 OLDRLNOT.TXT   Release notes for previous release (37)
 SHORTDES.TXT   Short description of entries in SWISS-PROT
 JOURLIST.TXT   List of abbreviations for journals cited
 KEYWLIST.TXT   List of keywords in use
 SPECLIST.TXT   List of organism identification codes
 TISSLIST.TXT   List of tissues [See 1]
 EXPERTS.TXT    List of on-line experts for PROSITE and SWISS-PROT
 SUBMIT.TXT     Submission of sequence data to SWISS-PROT

 ACINDEX.TXT    Accession number index
 AUTINDEX.TXT   Author index
 CITINDEX.TXT   Citation index
 KEYINDEX.TXT   Keyword index
 SPEINDEX.TXT   Species index
 DELETEAC.TXT   Deleted accession number index

 7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
 AATRNASY.TXT   List of aminoacyl-tRNA synthetases
 ALLERGEN.TXT   Nomenclature and index of allergen sequences
 ANNBIOCH.TXT   SWISS-PROT annotation:  how is biochemical information
                assigned to sequence entries [See 2]
 BLOODGRP.TXT   List of blood group antigen proteins
 CALBICAN.TXT   Index   of  Candida  albicans  entries   and  their
                corresponding gene designations
 CDLIST.TXT     CD  nomenclature  for  surface  proteins  of  human
                leucocytes
 CELEGANS.TXT   Index  of Caenorhabditis elegans entries  and their
                corresponding gene Wormpep cross-references
 DICTY.TXT      Index   of  Dictyostelium  discoideum  entries  and
                their  corresponding gene designations  and DictyDb
                cross-references
 EC2DTOSP.TXT   Index  of  Escherichia coli  Gene-protein  database
                entries referenced in SWISS-PROT
 ECOLI.TXT      Index  of Escherichia coli K12  chromosomal entries
                and their corresponding EcoGene cross-references
 EMBLTOSP.TXT   Index  of   EMBL  Database  entries  referenced  in
                SWISS-PROT
 EXTRADOM.TXT   Nomenclature of extracellular domains
 FLY.TXT        Index  of  Drosophila  entries and  FlyBase  cross-
                references
 GLYCOSID.TXT   Classification  of glycosyl hydrolase  families and
                index of glycosyl hydrolase entries
 HAEINFLU.TXT   Index  of  Haemophilus  influenzae  RD  chromosomal
                entries
 HOXLIST.TXT    Vertebrate  homeotic Hox proteins: nomenclature and
                index
 HPYLORI.TXT    Index   of   Helicobacter   pylori   strain   26695
                chromosomal entries
 HUMCHR16.TXT   Index of protein  sequence entries encoded on human
                chromosome 16 [See 2]
 HUMCHR17.TXT   Index of protein  sequence entries encoded on human
                chromosome 17
 HUMCHR18.TXT   Index of protein  sequence entries encoded on human
                chromosome 18
 HUMCHR19.TXT   Index of protein  sequence entries encoded on human
                chromosome 19
 HUMCHR20.TXT   Index of protein  sequence entries encoded on human
                chromosome 20
 HUMCHR21.TXT   Index of protein  sequence entries encoded on human
                chromosome 21
 HUMCHR22.TXT   Index of protein  sequence entries encoded on human
                chromosome 22
 HUMCHRX.TXT    Index of protein  sequence entries encoded on human
                chromosome X
 HUMCHRY.TXT    Index of protein  sequence entries encoded on human
                chromosome Y
 HUMPVAR.TXT    Index of human proteins with sequence variants
 INITFACT.TXT   List and index of translation initiation factors
 MIMTOSP.TXT    Index of MIM entries referenced in SWISS-PROT
 METALLO.TXT    Classification  of  metallothioneins and  index  of
                entries in SWISS-PROT
 MGDTOSP.TXT    Index of MGD entries referenced in SWISS-PROT
 MGENITAL.TXT   Index  of Mycoplasma genitalium chromosomal entries
 MJANNASC.TXT   Index of Methanococcus jannaschii entries
 NGR234.TXT     Table  of   putative  genes  in  Rhizobium  plasmid
                pNGR234a
 NOMLIST.TXT    List   of  nomenclature   related  references   for
                proteins
 PCC6803.TXT    Index of Synechocystis strain PCC 6803 entries
 PDBTOSP.TXT    Index  of X-ray  crystallography Protein Data  Bank
                (PDB) entries referenced in SWISS-PROT
 PEPTIDAS.TXT   Classification  of peptidase families and  index of
                peptidase entries
 PLASTID.TXT    List of chloroplast and cyanelle encoded proteins
 POMBE.TXT      Index   of  Schizosaccharomyces  pombe  entries  in
                SWISS-PROT    and    their    corresponding    gene
                designations
 RESTRIC.TXT    List of restriction enzyme and methylase entries
 RIBOSOMP.TXT   Index of  ribosomal proteins classified by families
                on the basis of sequence similarities
 SALTY.TXT      Index  of  Salmonella typhimurium  LT2  chromosomal
                entries  and  their  corresponding  StyGene  cross-
                references
 SUBTILIS.TXT   Index of  Bacillus subtilis 168 chromosomal entries
                and their corresponding SubtiList cross-references
 UPFLIST.TXT    UPF  (Uncharacterized  Protein Families)  list  and
                index of members
 YEAST.TXT      Index   of  Saccharomyces  cerevisiae  entries  and
                their corresponding gene designations
 YEAST1.TXT     Yeast Chromosome I entries
 YEAST2.TXT     Yeast Chromosome II entries
 YEAST3.TXT     Yeast Chromosome III entries
 YEAST5.TXT     Yeast Chromosome V entries
 YEAST6.TXT     Yeast Chromosome VI entries
 YEAST7.TXT     Yeast Chromosome VII entries
 YEAST8.TXT     Yeast Chromosome VIII entries
 YEAST9.TXT     Yeast Chromosome IX entries
 YEAST10.TXT    Yeast Chromosome X entries
 YEAST11.TXT    Yeast Chromosome XI entries
 YEAST13.TXT    Yeast Chromosome XIII entries
 YEAST14.TXT    Yeast Chromosome XIV entries

 1. The tissue  list  (tisslist.txt)  has  been  converted  to  mixed-case
    characters;
 2. The annbioch.txt  and humchr16.txt  files are new documents introduced
    in this release.

We have  continued to  include in  some SWISS-PROT  documentation files the
references of  Web sites relevant to the subject under consideration. There
are now 42 documents that include such links.



5.  THE EXPASY WORLD-WIDE WEB SERVER

5.1  Background information

The most  efficient and user-friendly way to browse interactively in SWISS-
PROT, PROSITE,  ENZYME, SWISS-2DPAGE  and other  databases is  to  use  the
World-Wide Web (WWW) molecular biology server ExPASy. The ExPASy server was
made available  to the  public in  September 1993  and is  reachable at the
following address:

                           http://www.expasy.ch/

The ExPASy  WWW server  allows access,  using the  user-friendly  hypertext
model, to  the SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE, SWISS-3DIMAGE and
CD40Lbase databases. And, through any SWISS-PROT protein sequence entry, to
other databases  such as  EMBL, Eco2DBASE, EcoCyc, EcoGene, FlyBase, GCRDb,
MaizeDB,  Mendel,   OMIM,   PDB,   HSSP,   Pfam,   ProDom,   REBASE,   SGD,
SubtiList/NRSub, TRANSFAC,  YPD, ZFIN  and Medline. ExPASy also offers many
tools for the analysis of protein sequences and 2D gels.


5.2  Swiss-Shop

We    provide,     on    ExPASy,     a    service     called     Swiss-Shop
(http://www.expasy.ch/swiss-shop/). Swiss-Shop  is  an  automated  sequence
alerting system  which allows  users to  obtain,  by  email,  new  sequence
entries relevant  to their  field(s) of  interest. Various  criteria can be
combined:

-  By  entering  one  or  more  words  that  should  be  present  in  the
   description line;
-  By entering one or more species name(s) or taxonomic division(s);
-  By entering one or more keywords;
-  By entering one or more author names;
-  By entering  the accession number (or entry name) of a PROSITE pattern
   or a user-defined sequence pattern;
-  By entering the accession number (or entry name) of an existing SWISS-
   PROT entry or by entering a private sequence.

Every week,  the new  sequences entered  in  SWISS-PROT  are  automatically
compared with  all the  criteria that  have been defined by the users. If a
sequence corresponds  to the  selection criteria  defined by  a user,  that
sequence is sent by electronic mail.


5.3  What is new on ExPASy

ExPASy is  constantly modified  and improved. If you wish to be informed on
the changes made to the server you can either:

-  Read the  document History  of changes,  improvements and new features
   which is available at the address: http://www.expasy.ch/history.html
-  Subscribe to  Swiss-Flash, a  service that  reports news of databases,
   software and service developments. By subscribing to this service, you
   will automatically  get Swiss-Flash  bulletins by  electronic mail. To
   subscribe use the address: http://www.expasy.ch/ swiss-flash/

Among all  the improvements and the new features introduced during the last
three months,  here are  those that  we believe  are specifically useful to
SWISS-PROT users:

1. We have switched our default  view of  SWISS-PROT entry to that provided
   by the  NiceProt tool. NiceProt offers  a user-friendly  tabular view of
   SWISS-PROT  entries.  Access  to   the  original  SWISS-PROT  format  is
   maintained and  is directly available from the NiceProt view. Tools with
   similar functionalities have been  developed to  display the  ENZYME and
   PROSITE databases (see section 8.1 and 8.2).
2. We have  revised  the  ExPASy file  and directory structure, in order to
   have the  vast amount  of data  that has accumulated on the server since
   September 1993  available in a more structured manner, and to facilitate
   replication on our mirror sites. This has caused certain changes in html
   links, and you should update your bookmarks and links accordingly. If in
   doubt, please refer to the document 'How to create html links to ExPASy'
   (http://www.expasy.ch/expasy_urls.html). At the same  time  we  wish  to
   reiterate our  announcement of the  ExPASy  mirror  sites  in  Australia
   (http://expasy.proteome.org.au/) and Taiwan (http://expasy.nhri.org.tw/).
   For your own convenience,  please use  the mirror  site closest  to you.
   Please also make sure  to update all bookmarks or links that use the old
   domain expasy.hcuge.ch,  which was replaced by  www.expasy.ch  in  March
   1997! The 'expasy.hcuge.ch' address might be disabled in the near future.
3. WWW  links  have  been  implemented  between  SWISS-PROT  and  CarbBank,
   EcoGene and ZFIN.



6.  TREMBL - A SUPPLEMENT TO SWISS-PROT

The ongoing  genome  sequencing  and  mapping  projects  have  dramatically
increased the  number of  protein sequences  to be incorporated into SWISS-
PROT. Since we do not want to dilute the quality standards of SWISS-PROT by
incorporating sequences  into the database without proper sequence analysis
and annotation,  we cannot  speed up the incorporation of new incoming data
indefinitely. But  as we  also want to make the sequences available as fast
as possible,  we have  introduced  with  SWISS-PROT  a  computer  annotated
supplement. This  supplement consists  of entries in SWISS-PROT-like format
derived from  the translation  of all  coding sequences  (CDS) in  the EMBL
nucleotide sequence database, except those already included in SWISS-PROT.

This supplement  is  named  TrEMBL  (Translation  from  EMBL).  It  can  be
considered as  a preliminary section of SWISS-PROT. This SWISS-PROT release
is supplemented by TrEMBL release 11. TrEMBL is split in two main sections;
SP-TrEMBL and REM-TrEMBL:

SP-TrEMBL (SWISS-PROT  TrEMBL) contains the entries (199'794 in release 11)
which should  be incorporated into SWISS-PROT. SWISS-PROT accession numbers
have been assigned for all SP-TrEMBL entries.

REM-TrEMBL (REMaining  TrEMBL) contains  the entries (45'967 in release 11)
that we  do not  want to  include in  SWISS-PROT for  a variety  of reasons
(synthetic sequences,  pseudogenes, translations  of incorrect open reading
frames,  fragments   with  less  than  eight  amino  acids,  patent-derived
sequences, immunoglobulins and T-cell receptors, etc.)

TrEMBL is available by FTP from the EBI and ExPASy servers in the directory
databases/trembl'. It  can be  queried on  WWW by  the EBI  and ExPASy  SRS
servers. It  is also  searchable on  the FASTA, BIC_SW and BLAST servers of
the EBI.



7.  FTP ACCESS TO SWISS-PROT AND TREMBL

7.1  Generalities

SWISS-PROT is  available  for  download  on  the  following  anonymous  FTP
servers:

Organization   Swiss Institute of Bioinformatics (SIB)
Address        ftp.expasy.ch
Directory      /databases/swiss-prot/

Organization   European Bioinformatics Institute (EBI)
Address        ftp.ebi.ac.uk
Directory      /pub/databases/swissprot/


7.2  Weekly updates of SWISS-PROT

Weekly updates  of SWISS-PROT  are available  by anonymous FTP. Three files
are generated at each update:

new_seq.dat    Contains all the new entries since the last full release;
upd_seq.dat    Contains the  entries for  which the  sequence data has been
               updated since the last release;
upd_ann.dat    Contains the entries for which one or more annotation fields
               have been updated since the last release.

Important notes

o Although we try to follow a regular schedule, we do not promise to update
  these files  every week.  In most  cases two weeks may elapse between two
  updates.
o Instead of  using the  above files,  you can,  every  week,  download  an
  updated copy  of the  SWISS-PROT database.  This file is available in the
  directory containing the non-redundant database (see next section).


7.3  Non-redundant database

More than  a year  ago, we  started to distribute on the ExPASy and EBI FTP
servers, files  that make  up a  non-redundant (see  further) and  complete
protein sequence database consisting of three components:

1) SWISS-PROT
2) TrEMBL
3) New  entries to  be later  integrated into  TrEMBL (hereafter  known  as
TrEMBL_New)

Every week  three files  are completely  rebuilt. These  files  are  named:
sprot.dat.Z, trembl.dat.Z  and trembl_new.dat.Z.  As indicated  by their .Z
extension these  are Unix  compress format  files which, when decompressed,
will produce ASCII files in SWISS-PROT format.

Three  other  files  are  also  available  (sprot.fas.Z,  trembl.fas.Z  and
trembl_new.fas.Z) which  are compressed  fasta format sequence files useful
for building  the  databases  used  by  FASTA,  BLAST  and  other  sequence
similarity search  programs. Please  do not  use these  files for any other
purpose, as  you will  lose all  annotations by  using this  very primitive
format.

The files  for the  non-redundant database  are  stored  in  the  directory
/databases/sp_tr_nrdb on  the ExPASy  FTP server (ftp.expasy.ch) and in the
directory /pub/databases/sp_tr_nrdb on the EBI FTP server (ftp.ebi.ac.uk).

Additional notes

o The SWISS-PROT  file continuously  grows as  new annotated  sequences are
  added.

o The TrEMBL  file decreases  in size  as sequences  are moved  out of that
  section after  being annotated  and moved  into SWISS-PROT.  Four times a
  year a  new release  of TrEMBL  is built at EBI, at this point the TrEMBL
  file increases  in size as it then includes all of the new data (see next
  section) that has accumulated since the last release.

o The TrEMBL_New file starts as a very small file and grows in size until a
  new release of TrEMBL is available.

o SWISS-PROT and  TrEMBL  share  the  same  system  of  accession  numbers.
  Therefore you  will not  find any  primary  accession  number  duplicated
  between the  two sections.  A TrEMBL  entry (and its associated accession
  number(s)) can  either move  to SWISS-PROT as new entry or be merged with
  an existing SWISS-PROT entry. In the latter case, the accession number(s)
  of that TrEMBL entry are added to that of the SWISS-PROT entry.

o TrEMBL_New does not have real accession numbers. However it was necessary
  to have  an AC  line so  as to  be able to use it with different software
  products. This  AC line contains a temporary identifier which consists of
  the protein_ID  (protein sequence  identifier) of  the coding sequence in
  the parent nucleotide sequence.

o TrEMBL_New is  quite messy!  You will of course find new sequence entries
  but you will also encounter sequences that are going to be used to update
  existing TrEMBL  or SWISS-PROT entries. None of the "cleaning" steps that
  are applied to produce a TrEMBL release are run on TrEMBL_New nor are any
  of the  computer-annotation software  tools that  are used to enhance the
  information content  of TrEMBL. TrEMBL_New is provided only so that users
  can be  sure not  to miss  any important  new  sequences  when  they  run
  similarity searches.

o While these  three files  allow you to build what we call a non-redundant
  database, it  must be noted that this is not completely a true statement.
  Without going  into a  long explanation we can say that this is currently
  the best  attempt in  providing a  complete selection of protein sequence
  entries while  trying  to  eliminate  redundancies.  Also  SWISS-PROT  is
  completely (well  99.994% !) non-redundant, TrEMBL is far from being non-
  redundant and the addition of SWISS-PROT + TrEMBL is even less.

o To describe  to your users the version of the non-redundant database that
  you are providing them with, you should use a statement of the form:

     SWISS-PROT release 38 and updates until <current_date>;
     TrEMBL  release  11  minus  data  integrated  into  SWISS-PROT  as  of
     <current_date>;
     New preliminary TrEMBL entries created since release 11 of TrEMBL



8.  ENZYME AND PROSITE

8.1  The ENZYME nomenclature database

Release 25.0  of the  ENZYME  nomenclature  database  is  distributed  with
release 38 of SWISS-PROT. ENZYME release 25.0 contains information relative
to 3704  enzymes. In  this release,  we have  added a significant number of
synonyms (AN lines) to a number of entries.

The WWW  version of  ENZYME on  ExPASy now  provides a  more  user-friendly
tabular view of enzyme entries through a new tool called NiceZyme. NiceZyme
also provides  direct links,  through  Medline,  to  literature  references
relevant to  a specific enzyme. You can use this tool to link to any ENZYME
entry  by  using  the  following  type  of  URL:  http://www.expasy.ch/cgi-
bin/nicezyme.pl?a.b.c.d (where  a.b.c.d is  any  valid  enzyme  EC  number;
example: 1.2.1.1).

Please also  note that  the URL  of the  top page  of ENZYME  has moved to:
http://www.expasy.ch/enzyme/


8.2  The PROSITE database

Release 16.0  of the  PROSITE database  is distributed  with release  38 of
SWISS-PROT. This  release of  PROSITE contains  1034 documentation  entries
that describe  1'374 different patterns, rules and profiles/matrices. Since
release 15.0, 20 entries have been added and 180 entries have been updated.

The WWW  version of  PROSITE on  ExPASy now  provides a  more user-friendly
tabular view  of enzyme entries through a new tool called NiceSite. You can
use this  tool to link to any PROSITE entry by using the following types of
URL: http://www.expasy.ch/cgi-bin/nicesite.pl?PSxxxxx (where PSxxxxx is any
valid  PROSITE  pattern  or  matrix  entry)  and  http://www.expasy.ch/cgi-
bin/nicedoc.pl?PDOCxxxxx (where  PDOCxxxxx is  any valid  PROSITE  document
entry).

Please also  note that  the URL  of the  top page  of PROSITE has moved to:
http://www.expasy.ch/prosite/



9.  WE NEED YOUR HELP!

We welcome feedback from our users. We would especially appreciate that you
notify us  if you  find that sequences belonging to your field of expertise
are missing  from the  database. We  also would  like to  be notified about
annotations to  be updated,  if, for example, the function of a protein has
been clarified or if new information about post-translational modifications
has become  available. To  facilitate this feedback we offer, on the ExPASy
WWW server, a form that allows the submission of updates and/or corrections
to SWISS-PROT:

              http://www.expasy.ch/sprot/sp_update_form.html

It is  also possible,  from any entry in SWISS-PROT displayed by the ExPASy
server, to  submit updates  and/or corrections  for that  particular entry.
Finally, you can also send your comments by electronic mail to the address:

                           swiss-prot@expasy.ch

Note that  since January  1999, all  update requests  are assigned a unique
identifier of  the form  UR-Xnnnn (example:  UR-A0123). This  identifier is
used internally  by the  SWISS-PROT staff  at SIB and EBI to track down the
fate of  requests and  is also  be used in email exchanges with the persons
having submitted a request.



10.  JULY 1999 ANNOUNCEMENT: THE HUMAN PROTEOMICS INITIATIVE

In a  few months the combined efforts of a number of sequencing centers and
companies will  produce a first draft of the human genome sequence. Such an
endeavor is  only a  very preliminary  step in  the understanding  of human
biological processes. The first pitfall to overcome is the detection of all
coding regions  on the  genomic sequence.  Current algorithms,  while being
very powerful,  are not  capable of detecting with certainty all exons, are
not well  equipped to  distinguish different splice variants and are unable
to detect small proteins (which are numerous and crucial to many biological
processes). Even when all potential coding regions have been predicted, the
user community  will have  at its disposition the sequence of from 80000 to
100000 naked  proteins.  We  call  these  proteins  naked  because  genomic
information does  not allow  the efficient  prediction  of  all  the  post-
translational modifications (PTM) of which the majority of proteins are the
target. Proteins,  once synthesized  on the  ribosomes, are  subject  to  a
multitude  of   modification  steps.   The  complexity  due  to  all  these
modifications is compounded by the high level of diversity that alternative
splicing can produce at the level of sequence. Thus the number of different
protein molecules  expressed by  the human  genome is  probably closer to a
million than  to  the  hundred  thousand  generally  considered  by  genome
scientists. Another factor of complexity to take into account is the amount
of polymorphism  at  the  protein  sequence  level.  While  some  of  these
polymorphisms are  linked to disease states, most are not, yet have in many
cases a direct or indirect effect on the activities of the proteins.

We therefore  are initiating  a major  project to  annotate all known human
sequences according  to the  quality standards  of SWISS-PROT.  This  means
providing, for  each known  protein, a  wealth of information that includes
the  description   of  its  function,  its  domain  structure,  subcellular
location, post-translational modifications, variants, similarities to other
proteins, etc.  There are currently slightly more than 5400 annotated human
sequences in  SWISS-PROT. These  entries are  associated with  about  14500
literature references;  16000 experimental  or predicted  PTMs, 800  splice
variants and  8000 polymorphisms  (most of  which are  linked with  disease
states). We  will use  the current information as the ground basis for what
we call the Human Proteomics Initiative (HPI).

The HPI  project contains  a number  of sub-components,  which are  briefly
described below:

- Annotation of  all known  human proteins.  In the course of the next nine
  months (from  July 1999 to end of March 2000) the human protein sequences
  that are  not yet  in SWISS-PROT  will be  fully annotated.  We will also
  review and  complete the  annotation of  the human sequences currently in
  SWISS-PROT. At the end of this nine-month period we expect to be complete
  and up-to-date  and to  hereafter keep up with the appearance of new data
  relevant to human proteins.
- Annotation of  mammalian orthologs  of human  proteins. We will make sure
  that for  any human  proteins,  existing  orthologs  in  other  mammalian
  species will  also be  annotated at  a level  equivalent to  that of  the
  cognate human sequences.
- Annotation of  all known  human polymorphisms  at  the  protein  sequence
  level. As  mentioned above,  SWISS-PROT already  holds information  on  a
  sizeable amount  of such  polymorphisms, and it will significantly expand
  its effort  to store  and annotate  all small  variations at  the protein
  level.
- Annotation  of   all  known  post-translational  modifications  in  human
  proteins. During  the next  nine months  a major  effort will  be made to
  supplement the  already quite  comprehensive description  of known  post-
  translational modifications  in  human  proteins  currently  provided  in
  SWISS-PROT.
- Tight links  to structural  information. SWISS-PROT  is tightly linked to
  the PDB/RCSB  3D-structure database  and already  includes many  features
  useful to  structural biologists.  These  tight  links  will  be  further
  expanded by  providing homology-derived models for all human proteins for
  which such an approach is scientifically relevant.

For all  aspects of  the HPI  projects, we  would appreciate  the help  and
collaboration of the scientific community. Information concerning the human
proteome is  highly critical  to  a  large  section  of  the  life  science
community. We  therefore appeal  to the user community to fully participate
in this  initiative by  providing all the necessary information to help and
to speed up the comprehensive annotation of the human proteome.

The HPI  project has  two different time-related aspects: one of which is a
nine-month "marathon"  to catch  up with the current state of research, the
other one is a long-term commitment to keep such a project alive as long as
it is  necessary. For  a detailed  description of  the HPI  project and its
current status please consult:

                      http://www.expasy.ch/sprot/hpi/



11. JULY 1998 ANNOUNCEMENT: NEW SWISS-PROT FUNDING SCHEME

It became  obvious in  the last  years that the tremendous increase in data
flow has  created a  requirement for resources which cannot be addressed in
full by  public funding.  This is  causing databases  to  fall  behind  the
research. We believe that the only solution to the resource shortfall is to
ask commercial  users to  participate by paying a license fee. No fee is or
will be  charged to  academic users,  nor is  any restriction be imposed on
their use  or reuse  of the data. Both SWISS-PROT and PROSITE are concerned
by these changes, while this is not the case of ENZYME.

A document  fully describing  what will  be the  impact of  this change for
SWISS-PROT is  available with  the SWISS-PROT  distribution  files  on  FTP
(sp_info.txt). You  can also  access the document as well as other relevant
ones from:

                      http://www.expasy.ch/announce/
 http://www.ebi.ac.uk/swissprot/Information/Announcement/announcement.html

If you do not have the time to read this document, the most important take-
home message is that these changes do not have any impact on the way SWISS-
PROT or  PROSITE are  accessed or  redistributed. Academic  users  are  not
affected by  these changes.  Industrial end-users  are  also  not  directly
affected as  long as  their employer  pays the  license fee. The same holds
true for bioinformatics companies. Academic software or database developers
as well  as providers  of database distribution services are only minimally
affected by  these changes. We hope to be able to keep the spirit of SWISS-
PROT and  PROSITE alive  and  at  the  same  time  ensure  their  long-term
financial survival.  We sincerely  hope and  believe that  in the  next two
years the  only change  that will  matter will be the increase in scope and
timeliness of the databases.


  ========================================================================


                         APPENDIX A: SOME STATISTICS


   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.58   Gln (Q) 3.97   Leu (L) 9.43   Ser (S) 7.13
   Arg (R) 5.16   Glu (E) 6.36   Lys (K) 5.94   Thr (T) 5.67
   Asn (N) 4.44   Gly (G) 6.84   Met (M) 2.37   Trp (W) 1.24
   Asp (D) 5.27   His (H) 2.24   Phe (F) 4.10   Tyr (Y) 3.19
   Cys (C) 1.66   Ile (I) 5.81   Pro (P) 4.92   Val (V) 6.58

   Asx (B) 0.001  Glx (Z) 0.001  Xaa (X) 0.01


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 6580

   The first twenty species represent 37741 sequences: 47.2 % of the total
   number of entries.


   A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 3122
                            2x: 1013
                            3x:  509
                            4x:  363
                            5x:  243
                            6x:  225
                            7x:  154
                            8x:  127
                            9x:  105
                           10x:   62
                       11- 20x:  304
                       21- 50x:  191
                       51-100x:   73
                         >100x:   89


   A.2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1       5406  Homo sapiens (Human)
       2       4811  Saccharomyces cerevisiae (Baker's yeast)
       3       4516  Escherichia coli
       4       3549  Mus musculus (Mouse)
       5       2630  Rattus norvegicus (Rat)
       6       2069  Bacillus subtilis
       7       2002  Caenorhabditis elegans
       8       1698  Haemophilus influenzae
       9       1438  Schizosaccharomyces pombe (Fission yeast)
      10       1313  Methanococcus jannaschii
      11       1149  Bos taurus (Bovine)
      12       1088  Drosophila melanogaster (Fruit fly)
      13        928  Mycobacterium tuberculosis
      14        894  Gallus gallus (Chicken)
      15        821  Arabidopsis thaliana (Mouse-ear cress)
      16        729  Xenopus laevis (African clawed frog)
      17        727  Salmonella typhimurium
      18        699  Synechocystis sp. (strain PCC 6803)
      19        670  Sus scrofa (Pig)
      20        604  Oryctolagus cuniculus (Rabbit)
      21        490  Mycoplasma pneumoniae
      22        469  Mycoplasma genitalium
      23        446  Zea mays (Maize)
      24        403  Rhizobium sp. (strain NGR234)
      25        382  Helicobacter pylori (Campylobacter pylori)
      26        368  Pseudomonas aeruginosa
      27        337  Oryza sativa (Rice)
      28        308  Canis familiaris (Dog)
      29        296  Nicotiana tabacum (Common tobacco)
      30        292  Dictyostelium discoideum (Slime mold)
      31        277  Treponema pallidum
      32        272  Bacteriophage T4
      33        269  Ovis aries (Sheep)
                269  Mycobacterium leprae
      35        266  Borrelia burgdorferi (Lyme disease spirochete)
      36        263  Pisum sativum (Garden pea)
      37        255  Methanobacterium thermoautotrophicum
      38        253  Vaccinia virus (strain Copenhagen)
      39        239  Glycine max (Soybean)
      40        228  Staphylococcus aureus
      41        227  Neurospora crassa
      42        226  Hordeum vulgare (Barley)
      43        221  Candida albicans (Yeast)
      44        219  Porphyra purpurea
      45        216  Archaeoglobus fulgidus
      46        211  Lycopersicon esculentum (Tomato)
      47        209  Triticum aestivum (Wheat)
      48        205  Solanum tuberosum (Potato)
      49        204  Rhodobacter capsulatus (Rhodopseudomonas capsulata)
      50        199  Klebsiella pneumoniae
      51        196  Pseudomonas putida
      52        193  Human cytomegalovirus (strain AD169)
      53        192  Bacillus stearothermophilus
      54        186  Vaccinia virus (strain WR)
      55        172  Cavia porcellus (Guinea pig)
      56        170  Agrobacterium tumefaciens
      57        169  Spinacia oleracea (Spinach)
      58        159  Chlamydomonas reinhardtii
      59        158  Rhizobium meliloti
      60        154  Autographa californica nuclear polyhedrosis virus
      61        153  Emericella nidulans (Aspergillus nidulans)
      62        152  Mesocricetus auratus (Golden hamster)
      63        151  Marchantia polymorpha (Liverwort)
      64        150  Streptomyces coelicolor
                150  Equus caballus (Horse)
      66        148  Guillardia theta (Cryptomonas phi)
      67        147  Cyanophora paradoxa
      68        146  Variola virus
      69        142  Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
      70        139  Odontella sinensis
      71        134  Orgyia pseudotsugata multicapsid polyhedrosis virus
      72        133  Kluyveromyces lactis (Yeast)
      73        128  Brachydanio rerio (Zebrafish) (Zebra danio)
      74        127  Trypanosoma brucei brucei
                127  Synechococcus sp. (strain PCC 7942)
      76        126  Thermus aquaticus (subsp. thermophilus)
      77        120  Alcaligenes eutrophus
                118  Anabaena sp. (strain PCC 7120)
      79        116  Bombyx mori (Silk moth)
      80        115  Bradyrhizobium japonicum
      81        113  Yersinia enterocolitica
      82        112  Oncorhynchus mykiss (Rainbow trout) (Salmo gairdneri)
      83        111  Aquifex aeolicus
                108  Streptococcus pneumoniae
      85        107  Brassica napus (Rape)
      86        104  Neisseria gonorrhoeae
      87        103  Macaca mulatta (Rhesus macaque)
                103  Felis silvestris catus (Cat)
      89        102  Rhodobacter sphaeroides (Rhodopseudomonas sphaeroides)



   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    3213             1001-1100      722
                 51- 100    6704             1101-1200      553
                101- 150    9719             1201-1300      377
                151- 200    7640             1301-1400      251
                201- 250    7202             1401-1500      210
                251- 300    6703             1501-1600      133
                301- 350    6294             1601-1700      117
                351- 400    6438             1701-1800       89
                401- 450    4831             1801-1900       94
                451- 500    4566             1901-2000       65
                501- 550    3444             2001-2100       37
                551- 600    2308             2101-2200       80
                601- 650    1801             2201-2300       75
                651- 700    1326             2301-2400       40
                701- 750    1159             2401-2500       42
                751- 800     956             >2500          232
                801- 850     762
                851- 900     798
                901- 950     552
                951-1000     467



   A.4  Longest sequences

   The longest sequences (>=4000 residues) are listed here:

                              BACA_BACLI  5255
                              HTS1_COCCA  5217
                              MUC2_HUMAN  5179
                              FAT_DROME   5147
                              RYNR_RABIT  5037
                              RYNR_PIG    5035
                              RYNR_HUMAN  5032
                              RYNC_RABIT  4969
                              LRP_CAEEL   4753
                              DYHC_DICDI  4725
                              PLEC_RAT    4687
                              LRP2_RAT    4660
                              LRP2_HUMAN  4655
                              DYHC_RAT    4644
                              DYHC_DROME  4639
                              DYHC_CAEEL  4568
                              DYHB_CHLRE  4568
                              APB_HUMAN   4563
                              APOA_HUMAN  4548
                              LRP1_HUMAN  4544
                              LRP1_CHICK  4543
                              DYHC_PARTE  4540
                              RRPA_CVMJH  4488
                              DYHG_CHLRE  4485
                              DYHC_ANTCR  4466
                              DYHC_TRIGR  4466
                              GRSB_BACBR  4451
                              PKSK_BACSU  4447
                              PKSL_BACSU  4427
                              PGBM_HUMAN  4393
                              YP73_CAEEL  4385
                              DYHC_NEUCR  4367
                              DYHC_FUSSO  4349
                              DYHC_EMENI  4344
                              PKD1_HUMAN  4303
                              DYHC_SCHPO  4196
                              DYHC_YEAST  4092
                              RRPA_CVH22  4085
                              RRPL_DUGBV  4036


   A.5  Statistics for journal citations


   Total number of journals cited in this release of SWISS-PROT: 1011


   A.5.1 Table of the frequency of journal citations

        Journals cited 1x: 381
                       2x: 130
                       3x:  84
                       4x:  46
                       5x:  39
                       6x:  23
                       7x:  15
                       8x:  15
                       9x:  14
                      10x:  14
                  11- 20x:  75
                  21- 50x:  71
                  51-100x:  24
                    >100x:  80


   A.5.2  List of the most cited journals in SWISS-PROT

   Nb    Citations   Journal abbreviation
   --    ---------   ----------------------------------
    1    6683        J. Biol. Chem.
    2    4031        Proc. Natl. Acad. Sci. U.S.A.
    3    3434        Nucleic Acids Res.
    4    2868        J. Bacteriol.
    5    2714        Gene
    6    2162        FEBS Lett.
    7    2046        Eur. J. Biochem.
    8    1915        Biochem. Biophys. Res. Commun.
    9    1888        Biochemistry
   10    1788        EMBO J.
   11    1684        Nature
   12    1542        Biochim. Biophys. Acta
   13    1462        J. Mol. Biol.
   14    1321        Cell
   15    1240        Mol. Cell. Biol.
   16    1042        Genomics
   17     999        Mol. Gen. Genet.
   18     987        Plant Mol. Biol.
   19     956        Biochem. J.
   20     867        Science
   21     828        Mol. Microbiol.
   22     786        Virology
   23     714        J. Biochem.
   24     534        J. Virol.
   25     487        Yeast
   26     485        J. Cell Biol.
   27     465        Plant Physiol.
   28     465        J. Gen. Virol.
   29     437        Hum. Mol. Genet.
   30     427        Genes Dev.
   31     398        Hum. Mutat.
   32     371        J. Immunol.
   33     367        Arch. Biochem. Biophys.
   34     348        Infect. Immun.
   35     346        Oncogene
   36     336        Structure
   37     329        Curr. Genet.
   38     311        Mol. Biochem. Parasitol.
   39     307        FEMS Microbiol. Lett.
   40     307        Am. J. Hum. Genet.
   41     301        Nat. Genet.
   42     267        Development
   43     265        Biol. Chem. Hoppe-Seyler
   44     256        Microbiology
   45     252        J. Clin. Invest.
   46     250        Mol. Endocrinol.
   47     249        Nat. Struct. Biol.
   48     234        J. Mol. Evol.
   49     233        Hum. Genet.
   50     231        Genetics
   51     222        J. Gen. Microbiol.
   52     213        Hoppe-Seyler's Z. Physiol. Chem.
   53     206        DNA Cell Biol.
   54     204        Appl. Environ. Microbiol.
   55     196        Protein Sci.
   56     193        J. Exp. Med.
   57     193        Blood
   58     189        Dev. Biol.
   59     184        Neuron
   60     164        Immunogenetics
   61     152        DNA Seq.
   62     152        DNA
   63     151        Endocrinology
   64     140        Plant Cell
   65     132        Cancer Res.
   66     125        Plant J.
   67     119        Mol. Biol. Evol.
   68     118        Brain Res. Mol. Brain Res.
   69     117        Mech. Dev.
   70     117        J. Neurochem.
   71     117        Biochimie
   72     116        Hemoglobin
   73     116        Bioorg. Khim.
   74     115        Acta Crystallogr. D
   75     113        Comp. Biochem. Physiol.
   76     111        Virus Res.
   77     110        Agric. Biol. Chem.
   78     106        Mamm. Genome
   79     106        J. Neurosci.
   80     103        Biosci. Biotechnol. Biochem.

  ========================================================================


   APPENDIX B: RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR
               DATABASES

   The current  status of  the relationships (cross-references) between
   SWISS-PROT and some biomolecular databases is shown in the following
   schematic:


                         ***********************
                         *  EMBL Nucleotide    *
                         *  Sequence Database  *
                         *       [EBI]         *
                         ***********************
                           ^ ^ ^  ^  ^ ^ ^ ^ ^
******************         | | |  I  | | | | |         **********************
* FlyBase        * <-------+ | |  I  | | | | +-------> * MGD [Mouse]        *
******************         | | |  I  | | | | |         **********************
                           | | |  I  | | | | |
******************         | | |  I  | | | | |         **********************
* SubtiList      * <---------+ |  I  | | | +---------> * GCRDb [7TM recep.] *
* [B.subtilis]   *         | | |  I  | | | | |         **********************
******************         | | |  I  | | | | |
                           | | |  I  | | | | |         **********************
******************         | | |  I  | | +-----------> * EcoGene [E.coli]   *
* Mendel [Plant] * <-----+ | | |  I  | | | | |         **********************
******************       | | | |  I  | | | | |
                         | | | |  I  | | | | |         **********************
******************       | | | |  I  +---------------> * SGD [Yeast]        *
* MaizeDb        * <-----------+  I  | | | | |         **********************
* [Zea mays]     *       | | | |  I  | | | | |
******************       | | | |  I  | | | | |         **********************
                         | | | |  I  | +-------------> * DictyDB [D.disco.] *
******************       | | | |  I  | | | | |         **********************
* WormPep        *       | | | |  I  | | | | |
* [C.elegans]    * <---+ | | | |  I  | | | | |         **********************
******************     | | | | |  I  | | | | | +-----> * ENZYME [Nomencl.]  *
                       | | | | |  I  | | | | | |       **********************
******************     | v v v v  v  v v v v v v           v
* REBASE         *     *************************       **********************
* [Restriction   * <-- *   SWISS-PROT          * ----> * OMIM [Human]       *
*  enzymes]      *     *   Protein Sequence    *       **********************
******************     *   Data Bank           *
                       *************************       **********************
******************      ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^          * ECO2DBASE     [2D] *
* StyGene        *      | | | | | | | | | | +--------> **********************
* [S.typhimurium]* <----+ | | | | | | | | |
******************        | | | | | | | | |            **********************
                          | | | | | | | | +----------> * Maize-2DPAGE  [2D] *
******************        | | | | | | | |              **********************
* TRANSFAC       * <------+ | | | | | | |
******************          | | | | | | |              **********************
                            | | | | | | +------------> * SWISS-2DPAGE  [2D] *
******************          | | | | | |                **********************
* Harefield [2D] * <--------+ | | | | |
******************            | | | | |                **********************
                              | | | | +--------------> * Aarhus/Ghent  [2D] *
******************            | | | |                  **********************
* PROSITE        *            | | | |
* [Patterns and  * <----------+ | | +----------------> **********************
* profiles]      *              | |                    * YEPD [Yeast]  [2D] *
******************              | +----------------+   **********************
             |                  v                  |
             |          ***********************    +-> **********************
             +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
                        ***********************        **********************

  =End=of=SWISS-PROT=release=38=notes=====================================
  

Swiss-Prot release 37.0

Published December 1, 1998
                  SWISS-PROT RELEASE 37.0 RELEASE NOTES

!! Important: do not forget to read section 10 of these release notes. It
contains an important announcement relevant to SWISS-PROT and PROSITE !!



                           1.  INTRODUCTION


Release 37.0  of SWISS-PROT  contains 77'977 sequence entries, comprising
28'268'293 amino acids abstracted from 62'513 references. This represents
an increase  of 5.3%  over release  36. The  growth of  the data  bank is
summarized below.

 Release      Date           Number of       Number of amino
                               entries                 acids
    2.0       09/86               3939               900 163
    3.0       11/86               4160               969 641
    4.0       04/87               4387             1 036 010
    5.0       09/87               5205             1 327 683
    6.0       01/88               6102             1 653 982
    7.0       04/88               6821             1 885 771
    8.0       08/88               7724             2 224 465
    9.0       11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368
   31.0       02/95              43470            15 335 248
   32.0       11/95              49340            17 385 503
   33.0       02/96              52205            18 531 384
   34.0       10/96              59021            21 210 389
   35.0       11/97              69113            25 083 768
   36.0       07/98              74019            26 840 295
   37.0       12/98              77977            28 268 293



     2.  DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 36


2.1  Sequences and annotations

3'988 sequences  have been  added since  release 36, the sequence data of
667 existing  entries has  been updated  and the  annotations  of  12'047
entries have been revised.


2.2  What's happening with the model organisms

We have  selected a  number of  organisms that  are the  target of genome
sequencing and/or mapping projects and for which we intend to:

o  Be as  complete as possible.  All sequences  available at a given time
   should  be  immediately  included  in  SWISS-PROT.  This also includes
   sequence corrections and updates;
o  Provide a higher level of annotation;
o  Provide  cross-references  to  specialized  database(s) that  contain,
   among other  data,  some genetic information about the genes that code
   for these proteins;
o  Provide specific indices or documents.

Here is the current status of the model organisms in SWISS-PROT:

 Organism        Database            Index file       Number of
                 cross-referenced                     sequences
 --------------  ----------------    --------------   ---------
 A.thaliana      None yet            In preparation         792
 B.subtilis      SubtiList           SUBTILIS.TXT          2046
 C.albicans      None yet            CALBICAN.TXT           194
 C.elegans       Wormpep             CELEGANS.TXT          1956
 D.discoideum    DictyDB             DICTY.TXT              285
 D.melanogaster  FlyBase             FLY.TXT               1064
 E.coli          EcoGene             ECOLI.TXT             4476
 H.influenzae    HiDB (TIGR)         HAEINFLU.TXT          1701
 H.sapiens       MIM                 MIMTOSP.TXT           5146
 H.pylori        HpDB (TIGR)         HPYLORI.TXT            367
 M.genitalium    MgDB (TIGR)         MGENITAL.TXT           470
 M.musculus      MGD                 MGDTOSP.TXT           3387
 M.jannaschii    MjDB (TIGR)         MJANNASC.TXT          1307
 M.tuberculosis  None yet            None yet               918
 S.cerevisiae    SGD                 YEAST.TXT             4806
 S.typhimurium   StyGene             SALTY.TXT              723
 S.pombe         None yet            POMBE.TXT             1406
 S.solfataricus  None yet            None yet                84

We  plan  to  finish  as  quickly  as  possible  the  annotation  of  the
Escherichia coli,  Haemophilus influenzae,  Methanococcus jannaschii  and
yeast (S.cerevisiae)  sequence entries  which are  not yet part of SWISS-
PROT.


2.3  Switch to the NCBI taxonomy

To contribute  to the standardization of the taxonomies used in molecular
sequence databases  we have changed our taxonomy with release 37. We have
switched  to   the  NCBI   taxonomy,  which   is  already   used  by  the
DDBJ/EMBL/GenBank   nucleotide    sequence   databases.   The   taxonomic
classification maintained  at the  NCBI  is  available  from  the  server
http://www.ncbi.nlm.nih.gov/Taxonomy.

This modification affects the OC (Organism Classification) lines. However
it has  no impact  on the  format of that line-type, only on its content.
For example, the OC lines for Homo sapiens (human) used to be:

OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC   EUTHERIA; PRIMATES.

and is now:

OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; MAMMALIA; EUTHERIA;
OC   PRIMATES; CATARRHINI; HOMINIDAE; HOMO.

The switch  to  the  new  taxonomy  indirectly  brings  along  additional
changes. Most of these changes are subtle, yet they may have an impact on
some users  and some  specific usage of SWISS-PROT. We will describe here
some of these changes.

The NCBI taxonomy is much more detailed than that formerly used by SWISS-
PROT. The  number of  nodes listed in the OC lines is therefore generally
larger. For example, the taxonomic lineage for Pisum sativum (garden pea)
used to be:

OC   EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;
OC   FABALES; FABACEAE.

It is now:

OC   EUKARYOTA; VIRIDIPLANTAE; STREPTOPHYTA; EMBRYOPHYTA; TRACHEOPHYTA;
OC   EUPHYLLOPHYTES; SPERMATOPHYTA; MAGNOLIOPHYTA; EUDICOTYLEDONS;
OC   ROSIDAE; FABALES; FABACEAE; PAPILIONOIDEAE; PISUM.

The names  of the  taxonomic kingdoms  at the  root of the NCBI taxonomic
tree differ from the old SWISS-PROT taxonomy in the following manner:

        NCBI        Old SWISS-PROT
        ----------  --------------
        Archaea     Archaebacteria
        Bacteria    Prokaryota
        Eukaryota   Eukaryota
        Viruses     Viridae

This is important for users selecting a subset of the database based on a
particular taxonomic kingdom.

We  also   changed  the   names  of   the  corresponding   files  in  the
special_selection section  of the anonymous FTP server (see section 7.1).
The files:

archaebacteria.seq.xxxxxx
eukaryota     .seq.xxxxxx
prokaryota    .seq.xxxxxx
viridae       .seq.xxxxxx

(where 'xxxxxx' is the date the file was created) are now renamed:

archaea       .seq.xxxxxx
eukaryota     .seq.xxxxxx
bacteria      .seq.xxxxxx
viruses       .seq.xxxxxx

The format  and content  of  the 'speclist.txt' documentation  file  (see
section 4)  has changed.  It no  longer contains the section that used to
list the taxonomic nodes as it would now be too cumbersome to be included
in such a document. The SWISS-PROT taxonomic node code is replaced by the
NCBI  taxonomy  ID  (TaxID).  As  the  NCBI  code  does  not  convey  any
information per  se on  which taxonomic  kingdom a species belongs to, we
have followed each organism code by a letter that indicates the taxonomic
kingdom a species belongs to. It can be one of the following:

'A' for archaea (=archaebacteria);
'B' for bacteria (=prokaryota or eubacteria);
'E' for eukaryota;
'V' for viruses and phages (=viridae).

Example:

DROME E 007227: N=Drosophila melanogaster
                C=Fruit fly

On the ExPASy WWW version (http://www.expasy.ch/cgi-bin/speclist) of this
document, the  NCBI TaxID  is an active link to the NCBI server, querying
the Taxonomic database on the lineage of the selected organism.

While in  the process  of mapping  the old SWISS-PROT taxonomy to that of
NCBI, we  corrected more  than 100  misspelling in species names. We also
updated many  names to  newer and more appropriate designations (but kept
the previous names as synonyms).


2.4  Introduction of the Reference Title (RT) line-type

In release  37 we  have introduced  a new  line type,  the RT  (Reference
Title) line. This optional line is placed between the RA and RL line. The
RT line  gives the title of the paper (or other work) cited as exactly as
possible given  the limitations of the computer character set. The format
of the RT line is:

RT   "TITLE";

An example of the use of RT lines is shown below:

RT   "Sequence analysis of the genome of the unicellular cyanobacterium
RT   Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb
RT   region from map positions 64% to 92% of the genome.";

It should be noted that:

o The form  used is  that which  would be  used in a citation rather than
  that displayed  at the  top of the published paper. For instance, where
  journals capitalize major title words this is not preserved;
o The text of a title ends  with either a period '.', a question mark '?'
  or an exclamation mark '!';
o Double  quotation  marks '"' are  not present in the text of the title;
  they are replaced by single quotation marks;
o Titles of articles published in a language other than English have been
  translated into English;
o Greek letters are spelled out (alpha, beta, etc.).

The RT  lines were  introduced in  journal, book and patent references as
well as  in some  other types  of references  (Plant Gene  Register, Worm
Breeder Gazette).  They have  not yet  been systematically introduced for
unpublished submissions. The RT lines were introduced using the following
sources of information:

o  For all  references linked to Medline,  the  titles were automatically
   extracted from the relevant Medline abstracts;
o  The  EMBL DNA  sequence  database  was  then  automatically scanned to
   retrieve additional titles. We then searched for the remaining missing
   titles in a variety of on-line resources:
o  The  LITDB    bibliographic  database    from  the   Protein  Research
   Foundation in Japan;
o  The AGRICOLA bibliographic database from NAL;
o  The Web sites of various journals;
o  The Korean journals abstract database;
o  The PDB 3D-structure database;
o  The MIM database;
o  The Plant Gene Register;
o  The NCBI Entrez protein search tool;
o  The European Patent Office patent database;
o  About 200 titles were typed-in by going to various libraries in Geneva
   to find the relevant papers;
o  Finally some authors, editors or publishers were  contacted  by email.
   We want  to thank  all  those  that  responded and sent us titles that
   would otherwise have been very difficult to find.

Currently out  of more  than 62000 references, we only lack the title for
less than 50 (this corresponds to a coverage of more than 99.9%).

The RT  line has been introduced in mixed-case, instead of the ALL UPPER-
CASE format used elsewhere in SWISS-PROT. As you will see in section 3.1,
we plan to gradually convert all of SWISS-PROT to mixed-case.


2.5  Changes affecting the accession numbers

With the  creation of  the TrEMBL  database (see section 6) and the rapid
increase in  the amount  of sequence data, we are faced with a problem of
availability of  accession numbers.  Currently we use a system based on a
one-letter prefix  followed by 5 digits. This system was also used by the
nucleotide sequence  databases which  had originally  reserved for SWISS-
PROT the  prefix letters 'P' and 'Q'. The nucleotide databases having run
out of space (due mainly to EST's), have been forced to start using a new
format based on a two-letter prefix followed by 6 digits.

We have used up all possible numbers with 'P' and 'Q' and the only letter
prefix which  was not  used by  the nucleotide  database is  'O'.  As  we
believe that  changing the  format of  the accession numbers to that used
now by  the nucleotide  database  would  create  havoc  on  the  numerous
software packages  using SWISS-PROT,  we have decided to keep a system of
accession numbers  based on  a six-character code, but with the following
changes:

o We  have  started  using  'O'.  This  extra  letter  should  allow  the
  continuation of  the present  format (1  prefix letter  + 5 digits) for
  approximately one year.
o When we  will have  finished using  up 'O',  we will introduce a system
  based on the following format:

    1        2       3          4            5            6
    [O,P,Q]  [0-9]  [A-Z, 0-9]  [A-Z, 0-9]   [A-Z, 0-9]   [0-9]

What the  above means is that we will keep a six-character code, but that
in positions  3, 4  and 5  of this  code any  combination of  letters and
numbers can  be present.  This  format  allows  a  total  of  14  million
accession numbers (up from 300'000 with the current system).

We only  allow numbers  in positions  2 and  6  so  that  the  SWISS-PROT
accession numbers  can not  be mistaken  with gene names, acronyms, other
type of accession numbers or any type of words!

Examples: P0A3S2, Q2ASD4, O13YX2, P9B123


2.6  Changes concerning the reference location line (RL)

The (IN)  prefix is  mainly used  for book  citations. We  have  slightly
changed the  format of  these book  citations so  that the  format is now
similar to  that used  by the  EMBL nucleotide sequence database. The new
format is:

RL   (IN) EDITOR_1 I.[, EDITOR2 I., EDITOR_X I.] (EDS.);
RL   BOOK-NAME, PP.[VOL:]FIRST-LAST, PUBLISHER, CITY (YEAR).

So, what was before:

RL   (IN) TRENDS IN QSAR AND MOLECULAR MODELING 92, WERMUTH C.G., ED.,
RL   PP.485-486, ESCOM, LEIDEN, (1993).

is now:

RL   (IN) WERMUTH C.G. (EDS.);
RL   TRENDS IN QSAR AND MOLECULAR MODELLING 92, PP.485-486, ESCOM
RL   SCIENCE PUBLISHERS, LEIDEN (1993).

Since release 36, the (IN) prefix has also been used for citations to the
electronic Plant Gene Register. In release 37 it can additionally be used
for    references     to    the     Worm    Breeders     Gazette     (see
http://elegans.swmed.edu/wli/). Example:

RL   (IN) WORM BREEDER'S GAZETTE 15(3):34(1998).


2.7  Cleaning up of the SIMILARITY comment line (CC) topic

We are continuing a major overhaul of the SIMILARITY topic. We would like
the majority  of the  information stored  in this  topic to  be usable by
computer  programs   (while  being   human-readable).  We  are  therefore
standardizing the  format of  this topic  using two different subformats.
One to describe to which family a protein belongs:

CC   -!-  SIMILARITY: BELONGS TO THE <Name1> FAMILY [OF <Name2>].
CC        [<Name3> SUBFAMILY.]

Examples:

CC   -!-  SIMILARITY: BELONGS TO THE 14-3-3 FAMILY.
CC   -!-  SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASE
CC        FAMILY.
CC   -!-  SIMILARITY: BELONGS TO THE AAA FAMILY OF ATPASES.
CC   -!-  SIMILARITY: BELONGS TO THE IRON/ASCORBATE-DEPENDENT FAMILY OF
CC        OXIDOREDUCTASES.
CC   -!-  SIMILARITY: BELONGS TO THE ANTP FAMILY OF HOMEOBOX PROTEINS.
CC        "DEFORMED" SUBFAMILY.
CC   -!-  SIMILARITY: BELONGS TO THE KINESIN-LIKE PROTEIN FAMILY. KINESIN
CC        SUBFAMILY.

And one to describe which domains are found in a given protein:

CC   -!-  SIMILARITY: CONTAINS n <Name> [DOMAIN|REPEAT][S].

Examples:

CC   -!-  SIMILARITY: CONTAINS 1 FHA DOMAIN.
CC   -!-  SIMILARITY: CONTAINS 45 EGF-LIKE DOMAINS.
CC   -!-  SIMILARITY: CONTAINS 2 SH3 DOMAINS.
CC   -!-  SIMILARITY: CONTAINS 2 SUSHI (SCR) REPEATS.

We have  already updated  many entries  in this and the previous releases
and plan to complete this change for the next release.


2.8  Changes concerning cross-references (DR line)

We have added cross-references from SWISS-PROT to the Pfam protein domain
database  (see   http://www.sanger.ac.uk/Pfam/;  reference:  Bateman  A.,
Birney E., Durbin R., Eddy S.R., Finn R.D. and Sonnhammer E.L.L.; Nucleic
Acids Res.  27:260-262(1999)). These  cross-references are present in the
DR lines.  The specific format for cross-references to the Pfam databases
is almost identical to that used for the PROSITE database:

DR   PFAM; ACCESSION_NUMBER; ENTRY_NAME; STATUS.

Where 'ACCESSION_NUMBER' stands for the accession number of the Pfam HMM-
profile  entry; 'ENTRY_NAME' is  the  name  of  the entry and 'STATUS' is
either  'n' or  'PARTIAL'.  'n' is  the  number of hits of the profile in
that particular protein sequence. The 'PARTIAL' status indicates that the
profile did not detect the sequence because that sequence is not complete
and lacks the region on  which is  the profile  is based.  The difference
between the cross-references to  Pfam and  those to  PROSITE is  that the
PROSITE DR  lines  make  use  of two additional 'STATUS': 'FALSE_NEG' and
'UNKNOWN'.

Examples of Pfam cross-references:

DR   PFAM; PF00017; SH2; 1.
DR   PFAM; PF00008; EGF; 8.
DR   PFAM; PF00595; PDZ; PARTIAL.


In this  release, we  have also  updated all the DR lines pointing to the
HSSP, Mendel and TRANSFAC databases.




                        3. PLANNED CHANGES


3.1  Conversion of SWISS-PROT to mixed-case characters

We are  happy to  announce that we will gradually start the conversion of
SWISS-PROT entries from all 'UPPER CASE' to 'MiXeD CaSe'. The first line-
type that  follows the  new format  is the  newly introduced RT line (see
section 2.4). In release 38 we are planning to convert the following line
types:

                        DT, OS, OG, OC, RL and KW

Further lines  will be  converted in  release 39,  and this  process will
probably be  completed for  January 1,  2000. We  can't enter  the  third
millennium with  a  carry  over  from  the  time  of  punched  tapes  and
teletypes!

Here is  an example  of what a SWISS-PROT entry will look like in release
38:

ID   PETG_CYAPA     STANDARD;      PRT;    37 AA.
AC   P14236;
DT   01-JAN-1990 (Rel. 13, Created)
DT   01-JAN-1990 (Rel. 13, Last sequence update)
DT   01-NOV-1997 (Rel. 35, Last annotation update)
DE   CYTOCHROME B6-F COMPLEX SUBUNIT 5.
GN   PETG.
OS   Cyanophora paradoxa.
OG   Cyanelle.
OC   Eukaryota; Glaucocystophyceae; Cyanophoraceae; Cyanophora.
RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=LB555 / PRINGSHEIM;
RX   MEDLINE; 90098772.
RA   STIREWALT V.L., BRYANT D.A.;
RT   "Molecular cloning and nucleotide sequence of the petG gene of the
RT   cyanelle genome of Cyanophora paradoxa.";
RL   Nucleic Acids Res. 17:10095-10095(1989).
RN   [2]
RP   SEQUENCE FROM N.A.
RC   STRAIN=LB555 / PRINGSHEIM;
RA   STIREWALT V.L., MICHALOWSKI C.B., LUFFELHARDT W., BOHNERT H.J.,
RA   BRYANT D.A.;
RL   Submitted (JUL-1995) to the EMBL/GenBank/DDBJ databases.
CC   -!- FUNCTION: THE CYTOCHROME B6-F COMPLEX FUNCTIONS IN THE LINEAR
CC       CROSS-MEMBRANE TRANSPORT OF ELECTRONS BETWEEN PHOTOSYSTEM II AND
CC       I, AS WELL AS IN CYCLIC ELECTRON FLOW AROUND PHOTOSYSTEM I.
CC       PETG IS REQUIRED FOR EITHER THE STABILITY OR ASSEMBLY OF THE
CC       CYTOCHROME B6-F COMPLEX.
CC   -!- SUBCELLULAR LOCATION: THYLAKOID MEMBRANE-ASSOCIATED.
CC   -!- SIMILARITY: BELONGS TO THE PETG FAMILY.
DR   EMBL; X16974; G12549; -.
DR   EMBL; U30821; G1016164; -.
DR   PIR; S06916; S06916.
DR   MENDEL; 7879; CYApa;petG;1.
KW   Electron transport; Respiratory chain; Cyanelle;
KW   Thylakoid membrane; Transmembrane.
FT   DOMAIN        1      4       LUMENAL (POTENTIAL).
FT   TRANSMEM      5     25       POTENTIAL.
FT   DOMAIN       26     37       STROMAL (POTENTIAL).
SQ   SEQUENCE   37 AA;  4139 MW;  265A8973 CRC32;
     MVEPLLSGIV LGLIPVTLIG LFVAAYLQYR RGNQFEF
//


3.2  Extension of the accession number system

As already  explained in  detail under  2.5, we will extend the accession
number system  when we  will have  used up  the 'O'  series of  accession
numbers. This can be anticipated for early 1999.


3.3  Introduction of a new CC line-type topic: MISCELLANEOUS

We will introduce in the next release a new 'topic' for the comments (CC)
line-type:  'MISCELLANEOUS'.  This  topic  will  be used for all comments
which  do not  belong to any other already defined topic. What this means
is that,  starting with release 38, all comment lines will be assigned to
a topic. Example:

CC   -!- BINDS TO BACITRACIN.

will become:

CC   -!- MISCELLANEOUS: BINDS TO BACITRACIN.


3.4  Introduction  of   a  unique   identifier  in  the  VARIANT  feature
     description of human sequence entries

We plan  to introduce  in release  38 a unique identifier for all VARIANT
feature keys  in human  sequence entries.  This change  is the first step
toward providing  a unique  identifier to  all SWISS-PROT features. Human
sequence  variants   were  chosen   as  a   prototype  for  this  planned
improvement. It  will be  possible, as  soon as  these identifiers become
available, to  directly link  specific sequence  variants to the relevant
entries in  disease mutation  databases  as  well  as  to  provide  these
databases with a method to implement reciprocal links.

The  unique identifier will be of the form of '/FTId=VAR_nnnnnn' and will
be added as the last part of the description field of a 'VARIANT' feature
keys. Examples:

FT   VARIANT       6      6       E -> V (IN S; SICKLE CELL ANEMIA).
FT   VARIANT      11     11       V -> D (IN WINDSOR; O2 AFFINITY UP;
FT                                UNSTABLE).

will become:

FT   VARIANT       6      6       E -> V (IN S; SICKLE CELL ANEMIA);
FT                                /FTId=VAR_000001.
FT   VARIANT      11     11       V -> D (IN WINDSOR; O2 AFFINITY UP;
FT                                UNSTABLE); /FTId=VAR_000234.


3.5  Small change  in the  format of  RL lines for submissions to the DNA
     databases

Along with  the conversion of the RL to mixed-case (see 3.1) we will also
make a  small change to the format of RL lines for submissions to the DNA
databases. What is now:

RL   SUBMITTED (MMM-YEAR) TO EMBL/GENBANK/DDBJ DATA BANKS.

will be changed to:

RL   Submitted (MMM-YEAR) to the EMBL/GenBank/DDBJ databases.

Such a change is made so as to follow more closely the format used by the
EMBL nucleotide sequence database.




                  4. STATUS OF THE DOCUMENTATION FILES


SWISS-PROT is  distributed with  a large  number of  documentation files.
Some of these files have been available for a long time (the user manual,
release notes,  the various  indices for  authors,  citations,  keywords,
etc.), but many have been created recently and we are continuously adding
new files. The following table lists all the documents that are currently
available.

 USERMAN.TXT    User manual
 RELNOTES.TXT   Release notes for current release (37)
 OLDRLNOT.TXT   Release notes for previous release (36)
 SHORTDES.TXT   Short description of entries in SWISS-PROT
 JOURLIST.TXT   List of abbreviations for journals cited [see 1]
 KEYWLIST.TXT   List of keywords in use [see 2]
 SPECLIST.TXT   List of organism identification codes [see 3]
 TISSLIST.TXT   List of tissues
 EXPERTS.TXT    List of on-line experts for PROSITE and SWISS-PROT
 SUBMIT.TXT     Submission of sequence data to SWISS-PROT

 ACINDEX.TXT    Accession number index
 AUTINDEX.TXT   Author index
 CITINDEX.TXT   Citation index
 KEYINDEX.TXT   Keyword index
 SPEINDEX.TXT   Species index
 DELETEAC.TXT   Deleted accession number index

 7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
 AATRNASY.TXT   List of aminoacyl-tRNA synthetases
 ALLERGEN.TXT   Nomenclature and index of allergen sequences
 BLOODGRP.TXT   List of blood group antigen proteins
 CALBICAN.TXT   Index   of  Candida  albicans  entries   and  their
                corresponding gene designations
 CDLIST.TXT     CD  nomenclature  for  surface  proteins  of  human
                leucocytes
 CELEGANS.TXT   Index  of Caenorhabditis elegans entries  and their
                corresponding gene Wormpep cross-references
 DICTY.TXT      Index   of  Dictyostelium  discoideum  entries  and
                their  corresponding gene designations  and DictyDb
                cross-references
 EC2DTOSP.TXT   Index  of  Escherichia coli  Gene-protein  database
                entries referenced in SWISS-PROT
 ECOLI.TXT      Index  of Escherichia coli K12  chromosomal entries
                and their corresponding EcoGene cross-references
 EMBLTOSP.TXT   Index  of   EMBL  Database  entries  referenced  in
                SWISS-PROT
 EXTRADOM.TXT   Nomenclature of extracellular domains
 FLY.TXT        Index  of  Drosophila  entries and  FlyBase  cross-
                references
 GLYCOSID.TXT   Classification  of glycosyl hydrolase  families and
                index of glycosyl hydrolase entries
 HAEINFLU.TXT   Index  of  Haemophilus  influenzae  RD  chromosomal
                entries
 HOXLIST.TXT    Vertebrate  homeotic Hox proteins: nomenclature and
                index
 HPYLORI.TXT    Index   of   Helicobacter   pylori   strain   26695
                chromosomal entries
 HUMCHR17.TXT   Index of protein  sequence entries encoded on human
                chromosome 17
 HUMCHR18.TXT   Index of protein  sequence entries encoded on human
                chromosome 18
 HUMCHR19.TXT   Index of protein  sequence entries encoded on human
                chromosome 19
 HUMCHR20.TXT   Index of protein  sequence entries encoded on human
                chromosome 20
 HUMCHR21.TXT   Index of protein  sequence entries encoded on human
                chromosome 21
 HUMCHR22.TXT   Index of protein  sequence entries encoded on human
                chromosome 22
 HUMCHRX.TXT    Index of protein  sequence entries encoded on human
                chromosome X
 HUMCHRY.TXT    Index of protein  sequence entries encoded on human
                chromosome Y
 HUMPVAR.TXT    Index of human proteins with sequence variants
 INITFACT.TXT   List and index of translation initiation factors
 MIMTOSP.TXT    Index of MIM entries referenced in SWISS-PROT
 METALLO.TXT    Classification  of  metallothioneins and  index  of
                entries in SWISS-PROT
 MGDTOSP.TXT    Index of MGD entries referenced in SWISS-PROT
 MGENITAL.TXT   Index  of Mycoplasma genitalium chromosomal entries
 MJANNASC.TXT   Index of Methanococcus jannaschii entries
 NGR234.TXT     Table  of   putative  genes  in  Rhizobium  plasmid
                pNGR234a
 NOMLIST.TXT    List   of  nomenclature   related  references   for
                proteins
 PCC6803.TXT    Index of Synechocystis strain PCC 6803 entries
 PDBTOSP.TXT    Index  of X-ray  crystallography Protein Data  Bank
                (PDB) entries referenced in SWISS-PROT
 PEPTIDAS.TXT   Classification  of peptidase families and  index of
                peptidase entries
 PLASTID.TXT    List of chloroplast and cyanelle encoded proteins
 POMBE.TXT      Index   of  Schizosaccharomyces  pombe  entries  in
                SWISS-PROT    and    their    corresponding    gene
                designations
 RESTRIC.TXT    List of restriction enzyme and methylase entries
 RIBOSOMP.TXT   Index of  ribosomal proteins classified by families
                on the basis of sequence similarities
 SALTY.TXT      Index  of  Salmonella typhimurium  LT2  chromosomal
                entries  and  their  corresponding  StyGene  cross-
                references
 SUBTILIS.TXT   Index of  Bacillus subtilis 168 chromosomal entries
                and their corresponding SubtiList cross-references
 UPFLIST.TXT    UPF  (Uncharacterized  Protein Families)  list  and
                index of members
 YEAST.TXT      Index   of  Saccharomyces  cerevisiae  entries  and
                their corresponding gene designations
 YEAST1.TXT     Yeast Chromosome I entries
 YEAST2.TXT     Yeast Chromosome II entries
 YEAST3.TXT     Yeast Chromosome III entries
 YEAST5.TXT     Yeast Chromosome V entries
 YEAST6.TXT     Yeast Chromosome VI entries
 YEAST7.TXT     Yeast Chromosome VII entries
 YEAST8.TXT     Yeast Chromosome VIII entries
 YEAST9.TXT     Yeast Chromosome IX entries
 YEAST10.TXT    Yeast Chromosome X entries
 YEAST11.TXT    Yeast Chromosome XI entries
 YEAST13.TXT    Yeast Chromosome XIII entries
 YEAST14.TXT    Yeast Chromosome XIV entries

Notes:

[1]  The journal list ('jourlist.txt') has been extensively updated. This
     document now  lists for  each journal  the name  of  its  publisher.
     Journal subtitles,  when they  are available,  have also been added.
     This file  can now  be considered as a mini-database on life science
     journals. It  lists 1073  journals and  contains more  than 800  Web
     links. Example of an entry in the journal list:

     Abbrev: Allergy
     Title : Allergy
             [European Journal of Allergy and Clinical Immunology]
     ISSN  : 0105-4538
     CODEN : LLRGDY
     Publis: Munksgaard
     Note  : Replaces Acta Allergol., starts with vol. 33 in 1978.
     Server: http://www.munksgaard.dk/allergy/

[2]  The keyword  list ('keywlist.txt') has been converted to  mixed-case
     characters.
[3]  The species  list ('speclist.txt') has been extensively  updated due
     to the  switch  to  the NCBI taxonomy (see section 2.3); it also has
     been converted to mixed-case characters.

We have  continued to  include in  some  SWISS-PROT  document  files  the
references of  Web sites  relevant to  the subject  under  consideration.
There are now 40 documents that include such links.




                  5. THE EXPASY WORLD-WIDE WEB SERVER


5.1  Background information

The most  efficient and  user-friendly way  to  browse  interactively  in
SWISS-PROT, PROSITE,  ENZYME, SWISS-2DPAGE  and other databases is to use
the World-Wide  Web (WWW)  molecular biology  server ExPASy.  The  ExPASy
server was  made available  to  the  public  in  September  1993  and  is
reachable at the following address:

                          http://www.expasy.ch/

The ExPASy  WWW server  allows access,  using the user-friendly hypertext
model, to  the SWISS-PROT,  PROSITE, ENZYME,  SWISS-2DPAGE, SWISS-3DIMAGE
and CD40Lbase  databases. And,  through any  SWISS-PROT protein  sequence
entry, to  other databases  such as  EMBL,  Eco2DBASE,  EcoCyc,  FlyBase,
GCRDb, MaizeDB,  Mendel, OMIM,  PDB, HSSP,  Pfam,  ProDom,  REBASE,  SGD,
SubtiList/NRSub, TRANSFAC, YPD and Medline. ExPAsy also offers many tools
for the analysis of protein sequences and 2D gels.


5.2  Swiss-Shop

We provide,  on ExPASy,  a service  called Swiss-Shop.  Swiss-Shop is  an
automated sequence  alerting system  which allows  users  to  obtain,  by
email, new  sequence entries  relevant to  their  field(s)  of  interest.
Various criteria can be combined:

o    By entering  one or  more  words  that  should  be  present  in  the
     description line;
o    By entering one or more species name(s) or taxonomic division(s);
o    By entering one or more keywords;
o    By entering one or more author names;
o    By entering  the accession  number (or  entry  name)  of  a  PROSITE
     pattern or a user-defined sequence pattern;
o    By entering  the accession  number (or  entry name)  of an  existing
     SWISS-PROT entry or by entering a private sequence.

Every week,  the new  sequences entered  in SWISS-PROT  are automatically
compared with  all the criteria that have been defined by the users. If a
sequence corresponds  to the  selection criteria  defined by a user, that
sequence is sent by electronic mail.


5.3  What is new on ExPASy

ExPASy is constantly modified and improved. If you wish to be informed on
the changes made to the server you can either:

o    Read the  document History of changes, improvements and new features
     which is available at the address:

                 http://www.expasy.ch/www/history.html

o    Subscribe to  Swiss-Flash, a service that reports news of databases,
     software and  services developments. By subscribing to this service,
     you will automatically get Swiss-Flash bulletins by electronic mail.
     To subscribe use the address:

              http://www.expasy.ch/www/swiss-flash.html

Among all  the improvements  and the  new features  introduced during the
last  six   months,  there  are  at  least  three  that  we  believe  are
specifically useful to SWISS-PROT users:

o NiceProt is a tool that provides a user-friendly tabular view of SWISS-
  PROT entries. The 'NiceProt View of SWISS-PROT' is accessible  from the
  top and  bottom of  each  SWISS-PROT entry on ExPASy.  You can use this
  tool  to link  to any  SWISS-PROT by using the following style  of URL:
  http://www.expasy.ch/cgi-bin/niceprot.pl?P01585 (where the last part of
  the URL is a valid primary accession number).

o The  SWISS-PROT/TrEMBL  full  text  search  tool has been improved. The
  databases are now  indexed  using  the Glimpse search engine, wildcards
  can be used in query strings, more fields (line types) are  indexed and
  response times are much shorter than before. See:

             http://www.expasy.ch/cgi-bin/sprot-search-ful

o Users who wish to save and retrieve all SWISS-PROT entries  originating
  from  a species can do this via the SWISS-PROT 'speclist.txt' document.
  By clicking on any of the species codes and specifying a file name, one
  can save  all corresponding  entries to  a file  that can be  retrieved
  from the anonymous ExPASy FTP server.




                  6. TREMBL - A SUPPLEMENT TO SWISS-PROT


The ongoing  genome sequencing  and mapping  projects  have  dramatically
increased the  number of protein sequences to be incorporated into SWISS-
PROT. Since  we do not want to dilute the quality standards of SWISS-PROT
by incorporating  sequences into  the database  without  proper  sequence
analysis and  annotation, we  cannot speed  up the  incorporation of  new
incoming data  indefinitely. But  as we  also want  to make the sequences
available as  fast as  possible, we  have introduced  with  SWISS-PROT  a
computer annotated  supplement. This  supplement consists  of entries  in
SWISS-PROT-like  format  derived  from  the  translation  of  all  coding
sequences (CDS)  in the  EMBL nucleotide  sequence database, except those
already included in SWISS-PROT.

We name  this supplement  TrEMBL  (Translation  from  EMBL).  It  can  be
considered as  a  preliminary  section  of  SWISS-PROT.  This  SWISS-PROT
release is  supplemented by TrEMBL release 8. TrEMBL is split in two main
sections; SP-TrEMBL and REM-TrEMBL:

SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (180'763 in release 8)
which  should  be  incorporated  into  SWISS-PROT.  SWISS-PROT  accession
numbers have been assigned for all SP-TrEMBL entries.

REM-TrEMBL (REMaining  TrEMBL) contains the entries (43'780 in release 8)
that we  do not  want to  include in  SWISS-PROT for a variety of reasons
(synthetic sequences, pseudogenes, translations of incorrect open reading
frames, fragments  with  less  than  eight  amino  acids,  patent-derived
sequences, immunoglobulins and T-cell receptors, etc.)

TrEMBL is  available by  FTP from  the EBI  and  ExPASy  servers  in  the
directory databases/trembl'.  It can  be queried  on WWW  by the  EBI and
ExPASy SRS  servers. It is also searchable on the FASTA, BIC-SW and BLAST
servers of the EBI.




                7.  FTP ANONYMOUS ACCESS TO SWISS-PROT


7.1  Generalities

SWISS-PROT is  available for  download on  the  following  anonymous  FTP
servers:

Organization   Swiss Institute of Bioinformatics (SIB)
Address        ftp.expasy.ch
Directory      /databases/swiss-prot/

Organization   European Bioinformatics Institute (EBI)
Address        ftp.ebi.ac.uk
Directory      /pub/databases/swissprot/

We have  reorganized the  directory on the ExPASy FTP server where SWISS-
PROT is stored. The new organization is shown below.

+--swiss-prot-+
              |
              |--release             The files for the current release of
              |                      SWISS-PROT
              |
              |--release_compressed  The files of the compressed version
              |                      (*.Z) of the current release of SWISS-
              |                      PROT
              |
              |--special_selections  Files storing SWISS-PROT entries either
              |                      from a specific taxonomic subset or
              |                      linked to a specific database
              |
              |--sw_old_releases     The compressed 'tar' (archive) files
              |                      of previous releases of SWISS-PROT
              |
              +--updates             The files of the cumulative weekly
              |                      updates
              |
              +--updates_compressed  The files of the compressed version
                                     (*.Z) of the cumulative weekly updates


The main differences from the previous release are:

o The  SWISS-PROT  release  files  are  now  in  a  subdirectory  (swiss-
  prot/release) instead of the main directory which is now devoid of data
  files.
o A  new   subdirectory  (swiss-prot/sw_old_releases)   was  created.  It
  contains Unix  compressed 'tar' (archive) files of previous releases of
  SWISS-PROT.  Each   release  is   stored  in   a  file  with  the  name
  sprotNN.tar.Z where  NN is a release number. Such a file stores all the
  documentation (*.txt)  files and  the data  file (sprotNN.dat)  of  the
  corresponding SWISS-PROT  release. The  release notes  are renamed from
  release.txt to  release.NN. We  have decided  to provide these files to
  answer two  kinds of  requests. The  main one originates from users who
  want to  compare sequence analysis algorithms by benchmarking them on a
  specific release  of the  database so  as to compare their results with
  those of  a competing  program. The  second type of requests originates
  from legal  departments of biotech companies that often want to be able
  to check  the state  of knowledge  on a  particular sequence at a given
  time frame.


7.2  Weekly updates of SWISS-PROT

Weekly updates  of SWISS-PROT are available by anonymous FTP. Three files
are generated at each update:

new_seq.dat    Contains all the new entries since the last full release;
upd_seq.dat    Contains the  entries for which the sequence data has been
               updated since the last release;
upd_ann.dat    Contains the  entries for  which one  or  more  annotation
               fields have been updated since the last release.

!! Important notes !!

o Although we  try to  follow a  regular schedule,  we do  not promise to
  update these  files every  week. In  most cases  two weeks  may  elapse
  between two updates.
o Instead of  using the  above files,  you can,  every week,  download an
  updated copy  of the SWISS-PROT database. This file is available in the
  directory containing the non-redundant database (see next section).


7.3  Non-redundant database

About a  year ago,  we started  to distribute  on the  ExPASy and EBI FTP
servers, files  that make  up a  non-redundant (see further) and complete
protein sequence database consisting of three components:

1) SWISS-PROT
2) TrEMBL
3) New  entries to  be later  integrated into  TrEMBL (hereafter known as
   TrEMBL_New)

Every week  three files  are completely  rebuilt. These  files are named:
sprot.dat.Z, trembl.dat.Z  and trembl_new.dat.Z. As indicated by their .Z
extension these  are Unix compress format files which, when decompressed,
will produce ASCII files in SWISS-PROT format.

Three others  files are  also available  (sprot.fas.Z,  trembl.fas.Z  and
trembl_new.fas.Z) Which are compressed fasta format sequence files useful
for building  the databases  used by  FASTA,  BLAST  and  other  sequence
similarity search  programs. Please  do not use these files for any other
purpose, as  you will  lose all  annotations by using this very primitive
format.

The files  for the  non-redundant database  are stored  in the  directory
/databases/sp_tr_nrdb on the ExPASy FTP server (ftp.expasy.ch) and in the
directory   /pub/databases/sp_tr_nrdb    on   the    EBI    FTP    server
(ftp.ebi.ac.uk).

Additional notes

o The SWISS-PROT  file continuously  grows as new annotated sequences are
  added.

o The TrEMBL  file decreases  in size  as sequences are moved out of that
  section after  being annotated  and moved into SWISS-PROT. Four times a
  year a  new release of TrEMBL is built at EBI, at this point the TrEMBL
  file increases  in size  as it  then includes  all of the new data (see
  next section) that has accumulated since the last release.

o The TrEMBL_New file starts as a very small file and grows in size until
  a new release of TrEMBL is available.

o SWISS-PROT and  TrEMBL share  the same  system  of  accession  numbers.
  Therefore you  will not  find any  primary accession  number duplicated
  between the  two sections. A TrEMBL entry (and its associated accession
  number(s)) can either move to SWISS-PROT as new entry or be merged with
  an existing  SWISS-PROT  entry.  In  the  latter  case,  the  accession
  number(s) of  that TrEMBL  entry are  added to  that of  the SWISS-PROT
  entry.

o TrEMBL_New does  not  have  real  accession  numbers.  However  it  was
  necessary to  have an AC line so as to be able to use it with different
  software products.  This AC  line contains a temporary identifier which
  consists of  the pID (protein identifier) of the coding sequence in the
  parent nucleotide sequence.

o While these three files allow you to build what we call a non-redundant
  database, it  must  be  noted  that  this  is  not  completely  a  true
  statement. Without  going into  a long explanation we can say that this
  is currently  the best  attempt in  providing a  complete selection  of
  protein sequence  entries while  trying to eliminate redundancies. Also
  SWISS-PROT is  completely (well 99.994% !) non-redundant, TrEMBL is far
  from being  non-redundant and  the addition  of SWISS-PROT  + TrEMBL is
  even less.

o To describe  to your  users the  version of  the non-redundant database
  that you  are providing  them with,  you should  use a statement of the
  form:

     SWISS-PROT release 37 and updates until <current_date>;
     TrEMBL release  8  minus  data  integrated  into  SWISS-PROT  as  of
     <current_date>;
     New preliminary TrEMBL entries created since release 8 of TrEMBL




                         8.  ENZYME and PROSITE


8.1  The ENZYME data bank

Release 24.0  of the  ENZYME data  bank is distributed with release 37 of
SWISS-PROT. ENZYME  release 24.0  contains information  relative to  3704
enzymes. It  differs from  the previous release (23 of July 1998) in that
we have  converted the  CA (Catalytic Activity) and DI (DIsease) lines to
mixed-case characters.  The conversion  of the  ENZYME database  from ALL
UPPER-CASE to mixed-case is therefore completed.

Example, what was before:

ID   1.14.15.4
DE   Steroid 11-beta-monooxygenase.
AN   Steroid 11-beta-hydroxylase.
AN   Steroid 11-beta/18-hydroxylase.
AN   Cytochrome p450 XIB1.
CA   A STEROID + REDUCED ADRENAL FERREDOXIN + O(2) = AN 11-BETA-
CA   HYDROXYSTEROID + OXIDIZED ADRENAL FERREDOXIN + H(2)O.
CF   Heme-thiolate.
CC   -!- Also hydroxylates steroids at the 18-position, and converts
CC       18-hydroxycorticosterone into aldosterone.
DI   ADRENAL HYPERPLASIA IV; MIM:202010.
PR   PROSITE; PDOC00081;
DR   P15150, CPN1_BOVIN;  Q64408, CPN1_CAVPO;  P15538, CPN1_HUMAN;
DR   P97720, CPN1_MESAU;  Q29527, CPN1_PAPHA;  Q29552, CPN1_PIG  ;
DR   Q92104, CPN1_RANCA;  P15393, CPN1_RAT  ;  P51663, CPN1_SHEEP;
DR   P19099, CPN2_HUMAN;  Q64658, CPN2_MESAU;  P15539, CPN2_MOUSE;
DR   P30099, CPN2_RAT  ;  P30100, CPN3_RAT  ;
//

is now:

ID   1.14.15.4
DE   Steroid 11-beta-monooxygenase.
AN   Steroid 11-beta-hydroxylase.
AN   Steroid 11-beta/18-hydroxylase.
AN   Cytochrome p450 XIB1.
CA   A steroid + reduced adrenal ferredoxin + O(2) = an 11-beta-
CA   hydroxysteroid + oxidized adrenal ferredoxin + H(2)O.
CF   Heme-thiolate.
CC   -!- Also hydroxylates steroids at the 18-position, and converts
CC       18-hydroxycorticosterone into aldosterone.
DI   Adrenal hyperplasia IV; MIM:202010.
PR   PROSITE; PDOC00081;
DR   P15150, CPN1_BOVIN;  Q64408, CPN1_CAVPO;  P15538, CPN1_HUMAN;
DR   P97720, CPN1_MESAU;  Q29527, CPN1_PAPHA;  Q29552, CPN1_PIG  ;
DR   Q92104, CPN1_RANCA;  P15393, CPN1_RAT  ;  P51663, CPN1_SHEEP;
DR   P19099, CPN2_HUMAN;  Q64658, CPN2_MESAU;  P15539, CPN2_MOUSE;
DR   P30099, CPN2_RAT  ;  P30100, CPN3_RAT  ;
//

In this  release, we  have also updated and added a significant number of
DI (Disease) lines and added synonyms (AN lines) to a number of entries.

The WWW  version of  ENZYME on  ExPASy now  includes links  to the BRENDA
database of enzymes. See:

  http://www.uni-koeln.de/math-nat-fak/biochemie/ds/dsbren_e.htm


8.2  The PROSITE data bank

Release 15.0  of the  PROSITE data bank is distributed with release 36 of
SWISS-PROT. This  release of  PROSITE contains 1014 documentation entries
that describe 1'352 different patterns, rules and profiles/matrices.



                       9. WE NEED YOUR HELP !


We welcome  feedback from  our users. We would especially appreciate that
you notify  us if  you find  that sequences  belonging to  your field  of
expertise are  missing from  the database.  We  also  would  like  to  be
notified about  annotations to  be updated, if, for example, the function
of a  protein has  been clarified  or  if  new  information  about  post-
translational modifications  has become  available.  To  facilitate  this
feedback we  offer, on  the ExPASY  WWW server,  a form  that allows  the
submission of updates and/or corrections to SWISS-PROT:

             http://www.expasy.ch/sprot/sp_update_form.html

It is also possible, from any entry in SWISS-PROT displayed by the ExPASy
server, to  submit updates  and/or corrections for that particular entry.
Finally, you  can also  send your  comments by  electronic  mail  to  the
address:

                         swiss-prot@expasy.ch

Note that  from January  1999, all  update requests  will be  assigned  a
unique  identifier   of the form  'UR-Xnnnn'  (example:  UR-A0123).  This
identifier will be used internally by the SWISS-PROT staff at SIB and EBI
to track  down the  fate of  requests and  will also  be  used  in  email
exchanges with the persons having submitted a request.



                    10. IMPORTANT ANNOUNCEMENT


It became  obvious in the last years that the tremendous increase in data
flow has created a requirement for resources which cannot be addressed in
full by  public funding.  This is  causing databases  to fall  behind the
research. We  believe that the only solution to the resource shortfall is
to ask  commercial users  to participate  by paying a license fee. No fee
will be charged to academic users, nor will any restriction be imposed on
their use or reuse of the data. Both SWISS-PROT and PROSITE are concerned
by these changes, while this is not the case of ENZYME.

A document  fully describing  what will  be the impact of this change for
SWISS-PROT is  available with  the SWISS-PROT  distribution files  on FTP
(sp_info.txt). You can also access the document as well as other relevant
ones from:

                     http://www.expasy.ch/announce/
                     http://www.ebi.ac.uk/news.html

If you  do not  have the  time to  read this document, the most important
take-home message is that these changes should not have any impact on the
way SWISS-PROT  or PROSITE  are accessed or redistributed. Academic users
will not be affected by these changes. Industrial end-users will also not
directly be  affected as long as their employer pays the license fee. The
same holds  true  for  bioinformatics  companies.  Academic  software  or
database  developers  as  well  as  providers  of  database  distribution
services will  be only minimally affected by these changes. We hope to be
able to  keep the  spirit of SWISS-PROT and PROSITE alive and at the same
time ensure  their long-term  financial survival.  We sincerely  hope and
believe that  in the next two years the only change that will matter will
be the increase in scope and timeliness of the databases.


----------------------------------------------------------------------------
SWISS-PROT is copyright.  It is produced through a collaboration between the
Swiss Institute  of  Bioinformatics   and the EMBL Outstation - the European
Bioinformatics Institute. There are no restrictions on its use by non-profit
institutions as long as its  content is in no way modified. Usage by and for
commercial entities requires a license agreement.  For information about the
licensing  scheme  see: http://www.isb-sib.ch/announce/ or send  an email to
license@isb-sib.ch.
----------------------------------------------------------------------------

   ========================================================================


                         APPENDIX A: SOME STATISTICS


   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.58   Gln (Q) 3.97   Leu (L) 9.42   Ser (S) 7.12
   Arg (R) 5.16   Glu (E) 6.37   Lys (K) 5.95   Thr (T) 5.67
   Asn (N) 4.45   Gly (G) 6.84   Met (M) 2.37   Trp (W) 1.23
   Asp (D) 5.28   His (H) 2.24   Phe (F) 4.09   Tyr (Y) 3.18
   Cys (C) 1.66   Ile (I) 5.81   Pro (P) 4.90   Val (V) 6.58

   Asx (B) 0.001  Glx (Z) 0.001  Xaa (X) 0.01


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 6307

   The first twenty species represent 36880 sequences: 47.3 % of the total
   number of entries.


   A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 2929
                            2x:  984
                            3x:  503
                            4x:  340
                            5x:  244
                            6x:  216
                            7x:  161
                            8x:  116
                            9x:  107
                           10x:   65
                       11- 20x:  297
                       21- 50x:  185
                       51-100x:   74
                         >100x:   86


   A.2.2  Table of the most represented species

    Number   Frequency     Species
         1        5146     Human
         2        4806     Baker's yeast (Saccharomyces cerevisiae)
         3        4476     Escherichia coli
         4        3387     Mouse
         5        2550     Rat
         6        2046     Bacillus subtilis
         7        1956     Caenorhabditis elegans
         8        1701     Haemophilus influenzae
         9        1406     Fission yeast (Schizosaccharomyces pombe)
        10        1307     Methanococcus jannaschii
        11        1126     Bovine
        12        1064     Fruit fly (Drosophila melanogaster)
        13         918     Mycobacterium tuberculosis
        14         862     Chicken
        15         792     Arabidopsis thaliana (Mouse-ear cress)
        16         723     Salmonella typhimurium
        17         711     African clawed frog (Xenopus laevis)
        18         670     Synechocystis sp. (strain PCC 6803)
        19         651     Pig
        20         582     Rabbit
        21         490     Mycoplasma pneumoniae
        22         470     Mycoplasma genitalium
        23         428     Maize
        24         403     Rhizobium sp. (strain NGR234)
        25         367     Helicobacter pylori
        26         363     Pseudomonas aeruginosa
        27         332     Rice
        28         296     Dog
        29         295     Tobacco
        30         285     Slime mold (Dictyostelium discoideum)
        31         274     Treponema pallidum
        32         272     Bacteriophage T4
        33         268     Sheep
        34         262     Mycobacterium leprae
        35         260     Borrelia burgdorferi
        36         256     Pea
        37         253     Vaccinia virus (strain Copenhagen)
        38         235     Methanobacterium thermoautotrophicum
                   235     Soybean
        40         224     Neurospora crassa
        41         222     Staphylococcus aureus
        42         221     Barley
        43         219     Porphyra purpurea
        44         209     Wheat
        45         203     Tomato
        46         201     Rhodobacter capsulatus
        47         199     Potato
        48         198     Klebsiella pneumoniae
        49         194     Candida albicans
        50         193     Human cytomegalovirus (strain AD169)
        51         192     Bacillus stearothermophilus
        52         189     Archaeoglobus fulgidus
                   189     Pseudomonas putida
        54         186     Vaccinia virus (strain WR)
        55         170     Agrobacterium tumefaciens
        56         169     Spinach
        57         166     Guinea pig
        58         159     Chlamydomonas reinhardtii
        59         158     Rhizobium meliloti
        60         154     Autographa californica nuclear polyhedrosis virus
        61         150     Aspergillus nidulans
                   150     Marchantia polymorpha (Liverwort)
        63         148     Streptomyces coelicolor
                   148     Guillardia theta (Cryptomonas phi)
        65         147     Cyanophora paradoxa
        66         146     Variola virus
        67         144     Golden hamster
        68         143     Horse
        69         140     Lactococcus lactis (subsp. lactis)
        70         139     Odontella sinensis
        71         134     Orgyia pseudotsugata multicapsid polyhedrosis virus
        72         132     Kluyveromyces lactis
        73         127     Trypanosoma brucei brucei
        74         126     Synechococcus sp. (strain PCC 7942)
        75         125     Thermus aquaticus (subsp. thermophilus)
        76         120     Alcaligenes eutrophus
        77         115     Bombyx mori (Silk moth)
                   115     Anabaena sp. (strain PCC 7120)
        79         114     Bradyrhizobium japonicum
        80         109     Yersinia enterocolitica
        81         107     Streptococcus pneumoniae
        82         105     Brachydanio rerio (Zebrafish)
        83         104     Oncorhynchus mykiss (Rainbow trout)
                   104     Brassica napus (Rape)
        85         102     Rhodobacter sphaeroides
        86         101     Cat



   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    3186             1001-1100      708
                 51- 100    6584             1101-1200      537
                101- 150    9506             1201-1300      365
                151- 200    7467             1301-1400      246
                201- 250    7006             1401-1500      202
                251- 300    6508             1501-1600      127
                301- 350    6115             1601-1700      115
                351- 400    6164             1701-1800       86
                401- 450    4707             1801-1900       93
                451- 500    4450             1901-2000       62
                501- 550    3351             2001-2100       34
                551- 600    2258             2101-2200       68
                601- 650    1768             2201-2300       70
                651- 700    1292             2301-2400       35
                701- 750    1146             2401-2500       41
                751- 800     941             >2500          222
                801- 850     740
                851- 900     781
                901- 950     536
                951-1000     460



   A.4  Longest sequences

   The longest sequences (>=4000 residues) are listed here:

                              HTS1_COCCA  5217
                              MUC2_HUMAN  5179
                              FAT_DROME   5147
                              RYNR_RABIT  5037
                              RYNR_PIG    5035
                              RYNR_HUMAN  5032
                              RYNC_RABIT  4969
                              LRP_CAEEL   4753
                              DYHC_DICDI  4725
                              PLEC_RAT    4687
                              LRP2_RAT    4660
                              LRP2_HUMAN  4655
                              DYHC_RAT    4644
                              DYHC_DROME  4639
                              DYHC_CAEEL  4568
                              DYHB_CHLRE  4568
                              APB_HUMAN   4563
                              APOA_HUMAN  4548
                              LRP1_HUMAN  4544
                              LRP1_CHICK  4543
                              DYHC_PARTE  4540
                              RRPA_CVMJH  4488
                              DYHG_CHLRE  4485
                              DYHC_ANTCR  4466
                              DYHC_TRIGR  4466
                              GRSB_BACBR  4451
                              PKSK_BACSU  4447
                              PKSL_BACSU  4427
                              PGBM_HUMAN  4393
                              YP73_CAEEL  4385
                              DYHC_NEUCR  4367
                              DYHC_NECHA  4349
                              DYHC_EMENI  4344
                              PKD1_HUMAN  4303
                              DYHC_SCHPO  4196
                              DYHC_YEAST  4092
                              RRPA_CVH22  4085
                              RRPL_DUGBV  4036


   A.5  Statistics for journal citations


   Total number of journals cited in this release of SWISS-PROT: 955


   A.5.1 Table of the frequency of journal citations

        Journals cited 1x: 351
                       2x: 130
                       3x:  79
                       4x:  43
                       5x:  33
                       6x:  26
                       7x:  15
                       8x:  17
                       9x:  15
                      10x:  12
                  11- 20x:  66
                  21- 50x:  67
                  51-100x:  25
                    >100x:  76


   A.5.2  List of the most cited journals in SWISS-PROT

   Nb    Citations       Journal abbreviation
   --    ---------       ----------------------------------
    1    6476            J. Biol. Chem.
    2    3931            Proc. Natl. Acad. Sci. U.S.A.
    3    3418            Nucleic Acids Res.
    4    2815            J. Bacteriol.
    5    2606            Gene
    6    2119            FEBS Lett.
    7    1994            Eur. J. Biochem.
    8    1843            Biochem. Biophys. Res. Commun.
    9    1811            Biochemistry
   10    1751            EMBO J.
   11    1650            Nature
   12    1484            Biochim. Biophys. Acta
   13    1398            J. Mol. Biol.
   14    1264            Cell
   15    1214            Mol. Cell. Biol.
   16     981            Mol. Gen. Genet.
   17     973            Plant Mol. Biol.
   18     941            Genomics
   19     922            Biochem. J.
   20     833            Science
   21     811            Mol. Microbiol.
   22     778            Virology
   23     702            J. Biochem.
   24     525            J. Virol.
   25     482            Yeast
   26     472            J. Cell Biol.
   27     464            J. Gen. Virol.
   28     452            Plant Physiol.
   29     431            Hum. Mutat.
   30     419            Genes Dev.
   31     402            Hum. Mol. Genet.
   32     355            J. Immunol.
   33     344            Arch. Biochem. Biophys.
   34     339            Infect. Immun.
   35     324            Curr. Genet.
   36     322            Oncogene
   37     309            Mol. Biochem. Parasitol.
   38     295            FEMS Microbiol. Lett.
   39     291            Structure
   40     269            Am. J. Hum. Genet.
   41     264            Biol. Chem. Hoppe-Seyler
   42     261            Nat. Genet.
   43     250            Development
   44     244            J. Clin. Invest.
   45     238            Mol. Endocrinol.
   46     238            Microbiology
   47     225            J. Mol. Evol.
   48     220            J. Gen. Microbiol.
   49     220            Genetics
   50     219            Nat. Struct. Biol.
   51     213            Hoppe-Seyler's Z. Physiol. Chem.
   52     200            DNA Cell Biol.
   53     199            Hum. Genet.
   54     196            Appl. Environ. Microbiol.
   55     186            J. Exp. Med.
   56     183            Blood
   57     182            Dev. Biol.
   58     176            Protein Sci.
   59     175            Neuron
   60     154            Immunogenetics
   61     152            DNA
   62     146            Endocrinology
   63     146            DNA Seq.
   64     136            Plant Cell
   65     122            Cancer Res.
   66     116            Plant J.
   67     116            Hemoglobin
   68     115            Bioorg. Khim.
   69     115            Biochimie
   70     112            Mol. Biol. Evol.
   71     112            J. Neurochem.
   72     109            Virus Res.
   73     109            Agric. Biol. Chem.
   74     107            Comp. Biochem. Physiol.
   75     106            Brain Res. Mol. Brain Res.
   76     101            Mech. Dev.


   ========================================================================


   APPENDIX B: RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR
               DATABASES

   The current  status of  the relationships (cross-references) between
   SWISS-PROT and some biomolecular databases is shown in the following
   schematic:


                         ***********************
                         *  EMBL Nucleotide    *
                         *  Sequence Database  *
                         *       [EBI]         *
                         ***********************
                           ^ ^ ^  ^  ^ ^ ^ ^ ^
******************         | | |  I  | | | | |         **********************
* FlyBase        * <-------+ | |  I  | | | | +-------> * MGD [Mouse]        *
******************         | | |  I  | | | | |         **********************
                           | | |  I  | | | | |
******************         | | |  I  | | | | |         **********************
* SubtiList      * <---------+ |  I  | | | +---------> * GCRDb [7TM recep.] *
* [B.subtilis]   *         | | |  I  | | | | |         **********************
******************         | | |  I  | | | | |
                           | | |  I  | | | | |         **********************
******************         | | |  I  | | +-----------> * EcoGene [E.coli]   *
* Mendel [Plant] * <-----+ | | |  I  | | | | |         **********************
******************       | | | |  I  | | | | |
                         | | | |  I  | | | | |         **********************
******************       | | | |  I  +---------------> * SGD [Yeast]        *
* MaizeDb        * <-----------+  I  | | | | |         **********************
* [Zea mays]     *       | | | |  I  | | | | |
******************       | | | |  I  | | | | |         **********************
                         | | | |  I  | +-------------> * DictyDB [D.disco.] *
******************       | | | |  I  | | | | |         **********************
* WormPep        *       | | | |  I  | | | | |
* [C.elegans]    * <---+ | | | |  I  | | | | |         **********************
******************     | | | | |  I  | | | | | +-----> * ENZYME [Nomencl.]  *
                       | | | | |  I  | | | | | |       **********************
******************     | v v v v  v  v v v v v v           v
* REBASE         *     *************************       **********************
* [Restriction   * <-- *   SWISS-PROT          * ----> * OMIM [Human]       *
*  enzymes]      *     *   Protein Sequence    *       **********************
******************     *   Data Bank           *
                       *************************       **********************
******************      ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^          * ECO2DBASE     [2D] *
* StyGene        *      | | | | | | | | | | +--------> **********************
* [S.Typhimurium]* <----+ | | | | | | | | |
******************        | | | | | | | | |            **********************
                          | | | | | | | | +----------> * Maize-2DPAGE  [2D] *
******************        | | | | | | | |              **********************
* TRANSFAC       * <------+ | | | | | | |
******************          | | | | | | |              **********************
                            | | | | | | +------------> * SWISS-2DPAGE  [2D] *
******************          | | | | | |                **********************
* Harefield [2D] * <--------+ | | | | |
******************            | | | | |                **********************
                              | | | | +--------------> * Aarhus/Ghent  [2D] *
******************            | | | |                  **********************
* PROSITE        *            | | | |
* [Patterns and  * <----------+ | | +----------------> **********************
* profiles]      *              | |                    * YEPD [Yeast]  [2D] *
******************              | +----------------+   **********************
             |                  v                  |
             |          ***********************    +-> **********************
             +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
                        ***********************        **********************

   =End=of=SWISS-PROT=release=37=notes=====================================

  

Swiss-Prot release 36.0

Published July 1, 1998
                   SWISS-PROT RELEASE 36.0 RELEASE NOTES

 !! Important: do not forget to read section 11 of these release notes. It
 contains an important announcement relevant to SWISS-PROT and PROSITE !!


                   1.  INTRODUCTION


 Release 36.0 of  SWISS-PROT contains 74'019  sequence entries,  comprising
 26'840'295 amino acids abstracted from 59'911 references.  This represents
 an increase  of  7% over  release  35. The  growth  of the  data  bank  is
 summarized below.

 Release      Date           Number of       Number of amino
                               entries                 acids
    2.0       09/86               3939               900 163
    3.0       11/86               4160               969 641
    4.0       04/87               4387             1 036 010
    5.0       09/87               5205             1 327 683
    6.0       01/88               6102             1 653 982
    7.0       04/88               6821             1 885 771
    8.0       08/88               7724             2 224 465
    9.0       11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368
   31.0       02/95              43470            15 335 248
   32.0       11/95              49340            17 385 503
   33.0       02/96              52205            18 531 384
   34.0       10/96              59021            21 210 389
   35.0       11/97              69113            25 083 768
   36.0       07/98              74019            26 840 295



     2.  DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 35

 2.1  Sequences and annotations

 4'976 sequences have been added since release 35, the sequence data of 712
 existing entries has  been updated  and the annotations  of 9'954  entries
 have been revised.


 2.2  What's happening with the model organisms

 We have  selected a  number of  organisms that  are the  target of  genome
 sequencing and/or mapping projects and for which we intend to:

 . Be as  complete as possible.  All sequences  available at  a given  time
   should  be  immediately  included  in  SWISS-PROT.  This  also  includes
   sequence corrections and updates;
 . Provide a higher level of annotation;
 . Provide cross-references to specialized database(s) that  contain, among
   other data, some genetic information about the genes that code for these
   proteins;
 . Provide specific indices or documents.
   What was  done since  the  last release  or in  preparation  for the  next release concerning model organisms:

 - We have  continued  our  effort  in  catching up  with  the  backlog  of
   sequences from other model organisms.  In particular we added about  350
   entries from human and  from E.coli, 300 from  mouse, 250 from  S.pombe,
   200 from M.jannaschii, 150 from C.elegans, 100 from B.subtilis, H.pylori
   and from M.tuberculosis.

 - We  plan to  finish  as  quickly  as  possible  the  annotation  of  the
   Escherichia coli and Haemophilus  influenzae sequence entries which  are
   not yet part of SWISS-PROT.

 Here is the current status of the model organisms in SWISS-PROT:

 Organism        Database            Index file       Number of
                 cross-referenced                     sequences
 --------------  ----------------    --------------   ---------
 A.thaliana      None yet            In preparation         719
 B.subtilis      SubtiList           SUBTILIS.TXT          1970
 C.albicans      None yet            CALBICAN.TXT           192
 C.elegans       Wormpep             CELEGANS.TXT          1887
 D.discoideum    DictyDB             DICTY.TXT              280
 D.melanogaster  FlyBase             FLY.TXT               1042
 E.coli          EcoGene             ECOLI.TXT             4416
 H.influenzae    HiDB (TIGR)         HAEINFLU.TXT          1693
 H.sapiens       MIM                 MIMTOSP.TXT           4980
 H.pylori        HpDB (TIGR)         HPYLORI.TXT            334
 M.genitalium    MgDB (TIGR)         MGENITAL.TXT           470
 M.musculus      MGD                 MGDTOSP.TXT           3253
 M.jannaschii    MjDB (TIGR)         MJANNASC.TXT          1283
 M.tuberculosis  None yet            None yet               873
 S.cerevisiae    SGD                 YEAST.TXT             4787
 S.typhimurium   StyGene             SALTY.TXT              706
 S.pombe         None yet            POMBE.TXT             1315
 S.solfataricus  None yet            None yet                72

 Collectively the entries from the above model organisms represent 40.9% of
 all SWISS-PROT entries.


 2.3  Changes affecting the accession numbers

 With the creation  of the TrEMBL  database (see section  6) and the  rapid
 increase in the amount of  sequence data, we are  faced with a problem  of
 availability of accession numbers.  Currently we use a  system based on  a
 one-letter prefix followed by 5  digits. This system was also used  by the
 nucleotide sequence databases which had originally reserved for SWISS-PROT
 the prefix letters 'P' and 'Q'. The nucleotide databases having run out of
 space (due mainly to EST's), have been forced to start using  a new format
 based on a two-letter prefix followed by 6 digits.

 We have used up all possible numbers with 'P' and 'Q' and  the only letter
 prefix which was not used by the nucleotide database is 'O'. As we believe
 that changing the format of the accession numbers to that used  now by the
 nucleotide database would create havoc  on the numerous software  packages
 using SWISS-PROT, we have  decided to keep a  system of accession  numbers
 based on a six-character code, but with the following changes:

 1)   We have  started  using 'O'.  This  extra letter  should  allow the
 continuation of  the present  format (1  prefix letter  + 5  digits) for
 approximately one year.
 2)   When we will have finished using up 'O', we will introduce a system
 based on the following format:

      1        2       3          4            5            6
     [O,P,Q]  [0-9]  [A-Z, 0-9]  [A-Z, 0-9]   [A-Z, 0-9]   [0-9]

 What the above means is that  we will keep a six-character code,  but that
 in positions  3, 4  and 5  of this  code any  combination  of letters  and
 numbers can be present. This format allows a total of 14 million accession
 numbers (up from 300'000 with the current system).

 We only  allow  numbers  in positions  2  and  6 so  that  the  SWISS-PROT
 accession numbers can  not be  mistaken with gene  names, acronyms,  other
 type of accession numbers or any type of words!

 Examples: P0A3S2, Q2ASD4, O13YX2, P9B123


 2.4  Changes concerning the reference location line (RL)

 The (IN) prefix  used for books  is now  also used for  references to  the
 electronic Plant Gene Register (See http://www.tarweed.com/pgr/). Example:

 RL   (IN) PLANT GENE REGISTER PGR98-023.


 2.5  Cleaning up of the SIMILARITY comment line (CC) topic

 We started a major overhaul  of the "SIMILARITY" topic. We would  like the
 majority of the information stored in this topic to be  usable by computer
 programs (while being human-readable). We are therefore  standardizing the
 format of this topic  using two different subformats.  One to describe  to
 which family a protein belongs to:

 CC   - !-  SIMILARITY: BELONGS TO THE {Name1} FAMILY [OF {Name2}].
 CC         [{Name3} SUBFAMILY.]

 Examples:

 CC   - !-  SIMILARITY: BELONGS TO THE 14-3-3 FAMILY.
 CC   - !-  SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASE
 CC         FAMILY.
 CC   - !-  SIMILARITY: BELONGS TO THE AAA FAMILY OF ATPASES.
 CC   - !-  SIMILARITY: BELONGS TO THE IRON/ASCORBATE-DEPENDENT FAMILY OF
 CC         OXIDOREDUCTASES.
 CC   - !-  SIMILARITY: BELONGS TO THE ANTP FAMILY OF HOMEOBOX PROTEINS.
 CC         "DEFORMED" SUBFAMILY.
 CC   - !-  SIMILARITY: BELONGS TO THE KINESIN-LIKE PROTEIN FAMILY. KINESIN
 CC         SUBFAMILY.

 And one to describe which domains are found in a given protein:

 CC   - !-  SIMILARITY: CONTAINS n {Name} [DOMAIN|REPEAT][S].

 Examples:

 CC   - !-  SIMILARITY: CONTAINS 1 FHA DOMAIN.
 CC   - !-  SIMILARITY: CONTAINS 45 EGF-LIKE DOMAINS.
 CC   - !-  SIMILARITY: CONTAINS 2 SH3 DOMAINS.
 CC   - !-  SIMILARITY: CONTAINS 2 SUSHI (SCR) REPEATS.

 We already have updated many entries in this release and  plan to continue
 to do so for the next release.


 2.6  Changes concerning cross-references (DR line)

 We have added cross-references from  SWISS-PROT to the Mendel database,  a
 plant gene  nomenclature  database  from the  Commission  for  Plant  Gene
 Nomenclature (CPGN). These cross-references are present in the DR lines:

 Data bank identifier:  MENDEL
 Primary identifier  :  The Mendel accession number for a gene  in a  given
                        species.
 Secondary identifier:  Composed of the acronym of  the species  (generally
                        the same five-letter  code as that defined and used
                        by SWISS-PROT in the entry name), the gene name and
                        a number.
 Example:               DR   MENDEL; 294; Amahy;psbA;1.



                  3. PLANNED CHANGES

 3.1  Extension of the accession number system

 As already explained in  detail under 2.3, we  will extend the accession
 number system when  we will  have used  up the  'O' series  of accession
 numbers. This can be anticipated for October 1998.


 3.2  Switch to the NCBI taxonomy

 To standardize the taxonomies used by different databases we will change
 with release 37 our taxonomy. We will switch to the NCBI taxonomy, which
 is  already  used  as  the  common  taxonomy  by  the  DDBJ/EMBL/GenBank
 nucleotide sequence databases.


 3.3  Introduction of RT lines

 With release 37  we will introduce  a new  line type,  the RT (Reference
 Title) line. This  optional line will  be placed  between the  RA and RL
 line. The  RT line  gives the  title  of the  paper (or  other  work) as
 exactly as possible given the limitations of the computer character set.
 The form which will  be used is that  which would be used  in a citation
 rather than displayed at  the top of the  published paper. For instance,
 where journals capitalize major  title words this is  not preserved. The
 title is enclosed  in double quotes,  and may be  continued over several
 lines as necessary.  The title lines  are terminated by  a semicolon. An
 example of the use of RT lines is shown below:

 RT   "Sequence analysis of the genome of the unicellular cyanobacterium
 RT   Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb
 RT   region from map positions 64% to 92% of the genome.";



                  4. STATUS OF THE DOCUMENTATION FILES

 SWISS-PROT is distributed with a large number of documentation files. Some
 of these  files have  been available  for a  long time  (the user  manual,
 release notes,  the  various  indices for  authors,  citations,  keywords,
 etc.), but many have been created recently and we are  continuously adding
 new files. Since release 35,  we have added three new document  files. The
 following table lists all the documents that are currently available.

 USERMAN.TXT    User manual
 RELNOTES.TXT   Release notes
 OLDRLNOT.TXT   Release notes for previous release [1,2]
 SHORTDES.TXT   Short description of entries in SWISS-PROT
 JOURLIST.TXT   List of abbreviations for journals cited [3]
 KEYWLIST.TXT   List of keywords in use
 SPECLIST.TXT   List of organism identification codes
 TISSLIST.TXT   List of tissues [4]
 EXPERTS.TXT    List of on-line experts for PROSITE and SWISS-PROT
 SUBMIT.TXT     Submission of sequence data to SWISS-PROT

 ACINDEX.TXT    Accession number index
 AUTINDEX.TXT   Author index
 CITINDEX.TXT   Citation index
 KEYINDEX.TXT   Keyword index
 SPEINDEX.TXT   Species index
 DELETEAC.TXT   Deleted accession number index

 7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
 AATRNASY.TXT   List of aminoacyl-tRNA synthetases
 ALLERGEN.TXT   Nomenclature and index of allergen sequences
 BLOODGRP.TXT   List of blood group antigen proteins
 CALBICAN.TXT   Index   of  Candida  albicans  entries   and  their
                corresponding gene designations
 CDLIST.TXT     CD  nomenclature  for  surface  proteins  of  human
                leucocytes
 CELEGANS.TXT   Index  of Caenorhabditis elegans entries  and their
                corresponding gene Wormpep cross-references
 DICTY.TXT      Index   of  Dictyostelium  discoideum  entries  and
                their  corresponding gene designations  and DictyDb
                cross-references
 EC2DTOSP.TXT   Index  of  Escherichia coli  Gene-protein  database
                entries referenced in SWISS-PROT
 ECOLI.TXT      Index  of Escherichia coli K12  chromosomal entries
                and their corresponding EcoGene cross-references
 EMBLTOSP.TXT   Index  of   EMBL  Database  entries  referenced  in
                SWISS-PROT
 EXTRADOM.TXT   Nomenclature of extracellular domains
 FLY.TXT        Index  of  Drosophila  entries and  FlyBase  cross-
                references
 GLYCOSID.TXT   Classification  of glycosyl hydrolase  families and
                index of glycosyl hydrolase entries
 HAEINFLU.TXT   Index  of  Haemophilus  influenzae  RD  chromosomal
                entries
 HOXLIST.TXT    Vertebrate  homeotic Hox proteins: nomenclature and
                index
 HPYLORI.TXT    Index   of   Helicobacter   pylori   strain   26695
                chromosomal entries
 HUMCHR17.TXT   Index of protein  sequence entries encoded on human
                chromosome 17 [1]
 HUMCHR18.TXT   Index of protein  sequence entries encoded on human
                chromosome 18
 HUMCHR19.TXT   Index of protein  sequence entries encoded on human
                chromosome 19
 HUMCHR20.TXT   Index of protein  sequence entries encoded on human
                chromosome 20
 HUMCHR21.TXT   Index of protein  sequence entries encoded on human
                chromosome 21
 HUMCHR22.TXT   Index of protein  sequence entries encoded on human
                chromosome 22
 HUMCHRX.TXT    Index of protein  sequence entries encoded on human
                chromosome X
 HUMCHRY.TXT    Index of protein  sequence entries encoded on human
                chromosome Y
 HUMPVAR.TXT    Index of human proteins with sequence variants [1]
 INITFACT.TXT   List and index of translation initiation factors
 MIMTOSP.TXT    Index of MIM entries referenced in SWISS-PROT
 METALLO.TXT    Classification  of  metallothioneins and  index  of
                entries in SWISS-PROT
 MGDTOSP.TXT    Index of MGD entries referenced in SWISS-PROT
 MGENITAL.TXT   Index  of Mycoplasma genitalium chromosomal entries
 MJANNASC.TXT   Index of Methanococcus jannaschii entries
 NGR234.TXT     Table  of   putative  genes  in  Rhizobium  plasmid
                pNGR234a
 NOMLIST.TXT    List   of  nomenclature   related  references   for
                proteins
 PCC6803.TXT    Index of Synechocystis strain PCC 6803 entries
 PDBTOSP.TXT    Index  of X-ray  crystallography Protein Data  Bank
                (PDB) entries referenced in SWISS-PROT
 PEPTIDAS.TXT   Classification  of peptidase families and  index of
                peptidase entries
 PLASTID.TXT    List of chloroplast and cyanelle encoded proteins
 POMBE.TXT      Index   of  Schizosaccharomyces  pombe  entries  in
                SWISS-PROT    and    their    corresponding    gene
                designations
 RESTRIC.TXT    List of restriction enzyme and methylase entries
 RIBOSOMP.TXT   Index of  ribosomal proteins classified by families
                on the basis of sequence similarities
 SALTY.TXT      Index  of  Salmonella typhimurium  LT2  chromosomal
                entries  and  their  corresponding  StyGene  cross-
                references
 SUBTILIS.TXT   Index of  Bacillus subtilis 168 chromosomal entries
                and their corresponding SubtiList cross-references
 UPFLIST.TXT    UPF  (Uncharacterized  Protein Families)  list  and
                index of members
 YEAST.TXT      Index   of  Saccharomyces  cerevisiae  entries  and
                their corresponding gene designations
 YEAST1.TXT     Yeast Chromosome I entries
 YEAST2.TXT     Yeast Chromosome II entries
 YEAST3.TXT     Yeast Chromosome III entries
 YEAST5.TXT     Yeast Chromosome V entries
 YEAST6.TXT     Yeast Chromosome VI entries
 YEAST7.TXT     Yeast Chromosome VII entries
 YEAST8.TXT     Yeast Chromosome VIII entries
 YEAST9.TXT     Yeast Chromosome IX entries
 YEAST10.TXT    Yeast Chromosome X entries
 YEAST11.TXT    Yeast Chromosome XI entries
 YEAST13.TXT    Yeast Chromosome XIII entries
 YEAST14.TXT    Yeast Chromosome XIV entries

 Notes:

 1    New in release 36.
 2    We  apologize  for  having  not   included,  with  release  35,   the
      corresponding release notes. We are therefore including it  with this
      release. As we believe that it may be useful to always distribute the
      release notes of  the previous release,  we will start  to do so  and
      such a file will be now known as "OLDRLNOT.TXT".
 3    Has been extensively updated and contains Web links to more  than 640
      journals.
 4    Has been  extensively  updated and  now  includes synonyms  for  many
      tissues.

 We have  continued  to  include  in some  SWISS-PROT  document  files  the
 references of Web sites relevant to the subject under consideration. There
 are now 24 documents that include such links.



                  5. THE EXPASY WORLD-WIDE WEB SERVER

 5.1  Background information

 The most  efficient and  user-friendly  way to  browse  interactively in
 SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases. is to use
 the World-Wide  Web (WWW)  molecular biology  server ExPASy.  The ExPASy
 server was  made  available  to the  public in  September  1993,  it  is
 reachable at the following address:

                              http://www.expasy.ch/

 The ExPASy WWW server  allows access, using  the user-friendly hypertext
 model, to the  SWISS-PROT, PROSITE,  ENZYME, SWISS-2DPAGE, SWISS-3DIMAGE
 and CD40Lbase  databases and,  through any  SWISS-PROT  protein sequence
 entry, to  other databases  such  as EMBL,  Eco2DBASE,  EcoCyc, FlyBase,
 GCRDb, MaizeDB, SubtiList/NRSub,  OMIM, PDB, HSSP,  ProDom, REBASE, SGD,
 YEPD and  Medline. ExPAsy  also offers  many tools  for the  analysis of
 protein sequences and 2D gels.


 5.2  SWISS-SHOP

 We provide, on  ExPASy, a  service called SWISS-SHOP.  SWISS-Shop allows
 any users of SWISS-PROT  to indicate what proteins  he/she is interested
 in. This can be done using various criteria that can be combined:

 -    By entering  one  or  more words  that  should  be  present  in the
      description line;
 -    By entering one or more species name(s) or taxonomic division(s);
 -    By entering one or more keywords;
 -    By entering one or more author names;
 -    By entering  the  accession number  (or  entry name)  of  a PROSITE
      pattern or a user-defined sequence pattern;
 -    By entering the  accession number  (or entry  name) of  an existing
      SWISS-PROT entry or by entering a "private" sequence.

 Every week, the  new sequences  entered in SWISS-PROT  are automatically
 compared with all the criteria that have been defined by the users. If a
 sequence corresponds to the  selection criteria defined by  a user, that
 sequence is sent by electronic mail.


 5.3  What is new on ExPASy

 ExPASy is constantly modified  and improved. If you  wish to be informed
 on the changes made to the server you can either:

 -    Read  the  document  "History  of  changes,  improvements  and  new
      features" which is available at the address:

              http://www.expasy.ch/www/history.html

 -    Subscribe to SWISS-Flash, a service that reports news of databases,
      software and services developments. By subscribing to this service,
      you will  automatically  get  SWISS-Flash  bulletins  by electronic
      mail. To subscribe use the address:

              http://www.expasy.ch/www/swiss-flash.html



                  6. TREMBL - A SUPPLEMENT TO SWISS-PROT

 The ongoing  genome  sequencing  and mapping  projects  have  dramatically
 increased the number of protein  sequences to be incorporated into  SWISS-
 PROT. Since we do not  want to dilute the quality standards  of SWISS-PROT
 by incorporating  sequences  into  the database  without  proper  sequence
 analysis and  annotation, we  cannot  speed up  the incorporation  of  new
 incoming data  indefinitely. But as  we also  want to  make the  sequences
 available as  fast  as possible,  we  have introduced  with  SWISS-PROT  a
 computer annotated  supplement. This  supplement  consists of  entries  in
 SWISS-PROT-like  format  derived  from  the  translation  of   all  coding
 sequences (CDS)  in the EMBL  nucleotide sequence  database, except  those
 already included in SWISS-PROT.
   We  name this  supplement  TrEMBL  (Translation  from  EMBL).  It  can  be
 considered as a preliminary section of SWISS-PROT. This SWISS-PROT release
 is supplemented by TrEMBL release 6. TrEMBL is split in two main sections;
 SP-TrEMBL and REM-TrEMBL:
   - SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (150'329 in release
   6) which should  be incorporated into SWISS-PROT. SWISS-PROT accession
   numbers have been assigned for all SP-TrEMBL entries.

 - REM-TrEMBL (REMaining TrEMBL) contains  the entries (27'428 in release
   6) that  we do not want to include  in  SWISS-PROT  for  a  variety of
   reasons (synthetic  sequences,  pseudogenes, translations of uncorrect
   open reading frames,  fragments  with  less  than eight  amino  acids,
   patent-derived sequences, immunoglobulins and T-cell receptors, etc.)

 TrEMBL is  available by FTP  from the  EBI server  (ftp.ebi.ac.uk) in  the
 directory '/pub/databases/trembl'. It can be queried on WWW by the EBI SRS
 server (http://www.ebi.ac.uk/). It is also available on the SWISS-PROT CD-
 ROM and is searchable on the FASTA, BIC and BLAST servers of the EBI.


                  7. WEEKLY UPDATES OF SWISS-PROT

 Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
 are updated at each update:

 new_seq.dat    Contains all the new entries since the last full release;
 upd_seq.dat    Contains the entries for which the sequence data has been
                updated since the last release;
 upd_ann.dat    Contains  the entries  for which  one or  more annotation
                fields have been updated since the last release.

 Currently these  files  are  available on  the  following  anonymous FTP
 servers:

 Organization   Swiss Institute of Bioinformatics (SIB)
 Address        ftp.expasy.ch
 Directory      /databases/swiss-prot/updates

 Organization   European Bioinformatics Institute (EBI)
 Address        ftp.ebi.ac.uk
 Directory      /pub/databases/swissprot/new

 !! Important notes !!

 - Although  we try to  follow a  regular schedule,  we do  not promise  to
   update these files every  week. In some cases two weeks will  elapse in-
   between two updates.
 - Due to  the current mechanism used to build  a release the entries  that
   are provided in these updates are not guaranteed to be error free.
 - Instead  of using  the above  files, you  can, every  week, download  an
   updated copy of the  SWISS-PROT database. This file is available  in the
   directory containing the non-redundant database (see next section).



                  8. NON-REDUNDANT DATABASE

 A few  months ago, we  started to  distribute on  the ExPASy  and EBI  FTP
 servers, files that  make up  a non-redundant (see  further) and  complete
 protein sequence database consisting of three components:

 1) SWISS-PROT
 2) TrEMBL
 3) New  entries to be  later integrated  into TrEMBL  (hereafter known  as
    TrEMBL_New)

 Every week  three files  are completely  rebuilt. These  files are  named:
 sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z. As indicated by their ".Z"
 extension these are Unix "compress" format files which, when decompressed,
 will produce ASCII files in SWISS-PROT format.

 Three others  files  are  also available  (sprot.fas.Z,  trembl.fas.Z  and
 trembl_new.fas.Z)  Which are  compressed  "fasta"  format  sequence  files
 useful for building the databases used by FASTA, BLAST and  other sequence
 similarity search  programs.  Please do  not  use these  files  for  other
 purpose as you loose all annotations by using this very primitive format.

 The files  for the  non-redundant  database are  stored in  the  directory
 "/databases/sp_tr_nrdb" on the  ExPASy FTP server  (ftp.expasy.ch) and  in the  directory   "/pub/databases/sp_tr_nrdb"  on   the   EBI  FTP   server
 (ftp.ebi.ac.uk).

 Additional notes

 - The SWISS-PROT  file continuously grows as  new annotated sequences  are
   added.
 - The TrEMBL  file decreases in size  as sequences are  moved out of  that
   section after  being annotated and moved  into SWISS-PROT. Four times  a
   year a new release  of TrEMBL is built at EBI, at this point  the TrEMBL
   file  increases in size  as it then  includes all of  the new data  (see
   next section) that has accumulated since the last release.
 - The TrEMBL_New file starts as a very small file and grows  in size until
   a new release of TrEMBL is available.

 - SWISS-PROT  and  TrEMBL share  the  same system  of  accession  numbers.
   Therefore  you will  not find  any primary  accession number  duplicated
   between the two  sections. A TrEMBL entry (and its  associated accession
   number(s)) can either move to SWISS-PROT as new entry or  be merged with
   an  existing  SWISS-PROT  entry.  In  the  later   case,  the  accession
   number(s)  of that  TrEMBL entry  are added  to that  of the  SWISS-PROT
   entry.
 - TrEMBL_New  does  not  have  real  accession  numbers.  However  it  was
   necessary  to  have an  "AC"  line so  as  to be  able  to use  it  with
   different  software   products.  This  AC  line  contains  a   temporary
   identifier which consists of the pID (protein identifier)  of the coding
   sequence in the parent nucleotide sequence.

 - While  these  three files  allow  you to  build  what we  call  a  "non-
   redundant"  database, it must  be noted  that this is  not completely  a
   true statement.  Without going into a long  explanation we can say  that
   this is currently the best attempt in providing a  complete selection of
   protein  sequence entries yet  trying to  eliminate redundancies.  While
   SWISS-PROT is  completely (well 99.994% !) non-redundant, TrEMBL  is far
   from  being non-redundant and  the addition  of SWISS-PROT  + TrEMBL  is
   even less.
   - To  describe to your  users the  version of  the non-redundant  database
   that you are providing to them, you should use a statement of the form:

      SWISS-PROT release 36 and updates until {current_date};
      TrEMBL  release  6  minus  data  integrated  into  SWISS-PROT  as  of
      {current_date};
      New preliminary TrEMBL entries created since release 6 of TrEMBL


                  9.  ENZYME and PROSITE

 9.1  The ENZYME data bank

 Release 23.0 of  the ENZYME data  bank is distributed  with release 36  of
 SWISS-PROT. ENZYME  release 23.0  contains  information relative  to  3704
 enzymes. It also differs from  the previous release (22 of November  1997)
 in that the "DE" (Description), "AN" (Alternative Names),  "CF" (Cofactor)
 and "CC"  (Comments) lines  are now  in mixed-case  characters instead  of
 being all in UPPER case.

 Example, what was before:

 ID   1.4.4.2
 DE   GLYCINE DEHYDROGENASE (DECARBOXYLATING).
 AN   GLYCINE DECARBOXYLASE.
 AN   GLYCINE CLEAVAGE SYSTEM P-PROTEIN.
 CA   GLYCINE + LIPOYLPROTEIN = S-AMINOMETHYLDIHYDROLIPOYLPROTEIN + CO(2).
 CF   PYRIDOXAL-PHOSPHATE.
 CC   -!- LIPOAMIDE CAN ALSO ACT AS ACCEPTOR.
 CC   -!- A COMPONENT, WITH EC 2.1.2.10, OF THE GLYCINE CLEAVAGE SYSTEM,
 CC       PREVIOUSLY KNOWN AS GLYCINE SYNTHASE.
 DI   NONKETOTIC HYPERGLYCINEMIA TYPE II; MIM:238310.
 DR   P54376, GCS1_BACSU;  P54377, GCS2_BACSU;  P49361, GCSA_FLAPR;
 DR   P49362, GCSB_FLAPR;  P15505, GCSP_CHICK;  P33195, GCSP_ECOLI;
 DR   O49850, GCSP_FLAAN;  O49852, GCSP_FLATR;  P23378, GCSP_HUMAN;
 DR   Q50601, GCSP_MYCTU;  P26969, GCSP_PEA  ;  Q09785, GCSP_SCHPO;
 DR   O49954, GCSP_SOLTU;  P49095, GCSP_YEAST;
 //

 is now:

 ID   1.4.4.2
 DE   Glycine dehydrogenase (decarboxylating).
 AN   Glycine decarboxylase.
 AN   Glycine cleavage system P-protein.
 CA   GLYCINE + LIPOYLPROTEIN = S-AMINOMETHYLDIHYDROLIPOYLPROTEIN + CO(2).
 CF   Pyridoxal-phosphate.
 CC   -!- Lipoamide can also act as acceptor.
 CC   -!- A component, with EC 2.1.2.10, of the glycine cleavage system,
 CC       previously known as glycine synthase.
 DI   NONKETOTIC HYPERGLYCINEMIA TYPE II; MIM:238310.
 DR   P54376, GCS1_BACSU;  P54377, GCS2_BACSU;  P49361, GCSA_FLAPR;
 DR   P49362, GCSB_FLAPR;  P15505, GCSP_CHICK;  P33195, GCSP_ECOLI;
 DR   O49850, GCSP_FLAAN;  O49852, GCSP_FLATR;  P23378, GCSP_HUMAN;
 DR   Q50601, GCSP_MYCTU;  P26969, GCSP_PEA  ;  Q09785, GCSP_SCHPO;
 DR   O49954, GCSP_SOLTU;  P49095, GCSP_YEAST;
 //

 We plan to convert the  "CA" (Catalytic Activity) lines to mixed-case  for
 the next release.


 9.2  The PROSITE data bank

 Release 15.0 of the  PROSITE data bank is  distributed with release 36  of
 SWISS-PROT. This release  of PROSITE contains  1014 documentation  entries
 that describe 1'352 different patterns, rules and profiles/matrices.



                  10. WE NEED YOUR HELP !

 We welcome feedback from our users.  We would especially appreciate that
 you notify us  if you  find that  sequences belonging  to your  field of
 expertise are  missing from  the data  bank. We  also  would like  to be
 notified about annotations to be updated,  if, for example, the function
 of a protein has been clarified or if new post-translational information
 has become  available. To  facilitate such  feedback's  we offer  on the
 ExPASY WWW server  a form that  allows the submission  of updates and/or
 corrections to SWISS-PROT:

               http://www.expasy.ch/sprot/sp_update_form.html

 It is also  possible, from  any entries in  SWISS-PROT displayed  by the
 ExPASy server, to submit updates and/or  corrections for that particular
 entry. Finally, you  can also send  your comments by  electronic mail to
 the address:

                            swiss-prot@expasy.ch



                  11. IMPORTANT ANNOUNCEMENT

 It became obvious in the  last years that the tremendous increase  in data
 flow has created a requirement for resources which cannot be  addressed in
 full by  public funding.  This is  causing databases  to  fall behind  the
 research. We believe that the  only solution to the resource shortfall  is
 to ask commercial  users to participate  by paying a  license fee. No  fee
 will be charged to academic users, nor will any restriction  be imposed on
 their use or reuse of the data. both SWISS-PROT and  PROSITE are concerned
 by these changes while this is not the case of ENZYME.

 A document fully  describing what will  be the impact  of this change  for
 SWISS-PROT is  available with  the SWISS-PROT  distribution  files on  FTP
 (SP_98.TXT). You can also  access the document as  well as other  relevant
 ones from:

                       http://www.expasy.ch/announce/
                       http://www.ebi.ac.uk/news.html

 If you do  not have the  time to  read this document,  the most  important
 take-home message is that these changes should not have any  impact on the
 way SWISS-PROT or  PROSITE are accessed  or redistributed. Academic  users
 will not be affected by these changes. Industrial end-users will  also not
 directly be affected as long  as their employer pays the license  fee. The
 same  holds  true  for  bioinformatics  companies.  Academic  software  or
 database developers as well as providers of database distribution services
 will be only minimally affected  by these changes. We  hope to be able  to
 keep the  spirit of  SWISS-PROT and  PROSITE alive  and at  the same  time ensure their long-term financial survival.  We sincerely hope and  believe
 that in the next two  years the only change that  will matter will be  the
 increase in scope and timeliness of the databases.

 Finally, it should be noted  that release 36 of SWISS-PROT and  release 15
 of PROSITE are not concerned  by these changes. There are no  restrictions
 on their use and their distribution.

   ========================================================================


                         APPENDIX A: SOME STATISTICS


   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.58   Gln (Q) 3.99   Leu (L) 9.42   Ser (S) 7.15
   Arg (R) 5.14   Glu (E) 6.35   Lys (K) 5.93   Thr (T) 5.69
   Asn (N) 4.47   Gly (G) 6.83   Met (M) 2.37   Trp (W) 1.24
   Asp (D) 5.28   His (H) 2.24   Phe (F) 4.08   Tyr (Y) 3.18
   Cys (C) 1.67   Ile (I) 5.80   Pro (P) 4.91   Val (V) 6.56

   Asx (B) 0.001  Glx (Z) 0.001  Xaa (X) 0.01


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 6002

   The first twenty species represent 35826 sequences: 48.4 % of the total
   number of entries.


   A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 2754
                            2x:  951
                            3x:  479
                            4x:  332
                            5x:  238
                            6x:  212
                            7x:  159
                            8x:   99
                            9x:  102
                           10x:   73
                       11- 20x:  277
                       21- 50x:  176
                       51-100x:   72
                         >100x:   78


   A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        4980          Human
         2        4787          Baker's yeast (Saccharomyces cerevisiae)
         3        4416          Escherichia coli
         4        3253          Mouse
         5        2491          Rat
         6        1970          Bacillus subtilis
         7        1887          Caenorhabditis elegans
         8        1693          Haemophilus influenzae
         9        1315          Fission yeast (Schizosaccharomyces pombe)
        10        1283          Methanococcus jannaschii
        11        1088          Bovine
        12        1042          Fruit fly (Drosophila melanogaster)
        13         873          Mycobacterium tuberculosis
        14         840          Chicken
        15         719          Arabidopsis thaliana (Mouse-ear cress)
        16         706          Salmonella typhimurium
        17         697          African clawed frog (Xenopus laevis)
        18         616          Synechocystis sp. (strain PCC 6803)
        19         607          Pig
        20         563          Rabbit
        21         489          Mycoplasma pneumoniae
        22         470          Mycoplasma genitalium
        23         406          Maize
        24         403          Rhizobium sp. (strain NGR234)
        25         345          Pseudomonas aeruginosa
        26         334          Helicobacter pylori
        27         304          Rice
        28         284          Dog
        29         280          Slime mold (Dictyostelium discoideum)
        30         278          Tobacco
        31         273          Bacteriophage T4
        32         253          Vaccinia virus (strain Copenhagen)
        33         250          Mycobacterium leprae
        34         244          Sheep
        35         240          Pea
        36         219          Porphyra purpurea
        37         215          Barley
        38         212          Staphylococcus aureus
        39         209          Neurospora crassa
        40         208          Soybean
        41         205          Wheat
        42         195          Tomato
        43         193          Rhodobacter capsulatus
                   193          Human cytomegalovirus (strain AD169)
        45         192          Candida albicans
                   192          Potato
        47         191          Klebsiella pneumoniae
        48         190          Methanobacterium thermoautotrophicum
        49         185          Bacillus stearothermophilus
        50         184          Vaccinia virus (strain WR)
        51         178          Pseudomonas putida
        52         164          Agrobacterium tumefaciens
        53         160          Spinach
                   160          Guinea pig
        55         158          Chlamydomonas reinhardtii
        56         157          Rhizobium meliloti
        57         154          Autographa californica nuclear polyhedrosis virus
        58         150          Marchantia polymorpha (Liverwort)
        59         146          Variola virus
                   146          Cyanophora paradoxa
        61         145          Aspergillus nidulans
        62         139          Odontella sinensis
        63         136          Streptomyces coelicolor
                   136          Golden hamster
                   136          Lactococcus lactis (subsp. lactis)
        66         134          Orgyia pseudotsugata multicapsid polyhedrosis virus
        67         130          Horse
        68         127          Kluyveromyces lactis
        69         125          Thermus aquaticus (subsp. thermophilus)
        70         124          Trypanosoma brucei brucei
        71         122          Synechococcus sp. (strain PCC 7942)
        72         114          Anabaena sp. (strain PCC 7120)
        73         113          Bradyrhizobium japonicum
        74         111          Alcaligenes eutrophus
        75         110          Bombyx mori (Silk moth)
        76         107          Archaeoglobus fulgidus
        77         105          Yersinia enterocolitica
        78         101          Brassica napus (Rape)



   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    3048             1001-1100      667
                 51- 100    6272             1101-1200      511
                101- 150    9004             1201-1300      348
                151- 200    7032             1301-1400      233
                201- 250    6626             1401-1500      193
                251- 300    6172             1501-1600      119
                301- 350    5852             1601-1700      112
                351- 400    5882             1701-1800       86
                401- 450    4500             1801-1900       91
                451- 500    4176             1901-2000       58
                501- 550    3138             2001-2100       33
                551- 600    2191             2101-2200       68
                601- 650    1688             2201-2300       67
                651- 700    1221             2301-2400       35
                701- 750    1095             2401-2500       41
                751- 800     891             >2500          207
                801- 850     685
                851- 900     736
                901- 950     509
                951-1000     432



   A.4  Longest sequences

   The longest sequences (>=4000 residues) are listed here:

                               HTS1_COCCA  5217
                               MUC2_HUMAN  5179
                               FAT_DROME   5147
                               RYNR_RABIT  5037
                               RYNR_PIG    5035
                               RYNR_HUMAN  5032
                               RYNC_RABIT  4969
                               LRP_CAEEL   4753
                               DYHC_DICDI  4725
                               PLEC_RAT    4687
                               LRP2_RAT    4660
                               DYHC_RAT    4644
                               DYHC_DROME  4639
                               DYHC_CAEEL  4568
                               DYHB_CHLRE  4568
                               APB_HUMAN   4563
                               APOA_HUMAN  4548
                               LRP1_HUMAN  4544
                               LRP1_CHICK  4543
                               DYHC_PARTE  4540
                               RRPA_CVMJH  4488
                               DYHG_CHLRE  4485
                               DYHC_ANTCR  4466
                               DYHC_TRIGR  4466
                               GRSB_BACBR  4451
                               PKSK_BACSU  4447
                               PKSL_BACSU  4427
                               PGBM_HUMAN  4393
                               YP73_CAEEL  4385
                               DYHC_NEUCR  4367
                               DYHC_NECHA  4349
                               DYHC_EMENI  4344
                               PKD1_HUMAN  4303
                               DYHC_SCHPO  4196
                               DYHC_YEAST  4092
                               RRPA_CVH22  4085


   A.5  Statistics for journal citations


   Total number of journals cited in this release of SWISS-PROT: 913


   A.5.1 Table of the frequency of journal citations

        Journals cited 1x: 339
                       2x: 124
                       3x:  70
                       4x:  39
                       5x:  37
                       6x:  23
                       7x:  17
                       8x:  15
                       9x:  14
                      10x:  10
                  11- 20x:  63
                  21- 50x:  65
                  51-100x:  24
                    >100x:  73


   A.5.2  List of the most cited journals in SWISS-PROT

   Nb    Citations       Journal abbreviation
   --    ---------       ----------------------------------
    1    6303            J. BIOL. CHEM.
    2    3814            PROC. NATL. ACAD. SCI. U.S.A.
    3    3384            NUCLEIC ACIDS RES.
    4    2714            J. BACTERIOL.
    5    2498            GENE
    6    2058            FEBS LETT.
    7    1932            EUR. J. BIOCHEM.
    8    1780            BIOCHEM. BIOPHYS. RES. COMMUN.
    9    1732            BIOCHEMISTRY
   10    1713            EMBO J.
   11    1617            NATURE
   12    1438            BIOCHIM. BIOPHYS. ACTA
   13    1339            J. MOL. BIOL.
   14    1228            CELL
   15    1184            MOL. CELL. BIOL.
   16     953            MOL. GEN. GENET.
   17     929            PLANT MOL. BIOL.
   18     888            BIOCHEM. J.
   19     873            GENOMICS
   20     808            SCIENCE
   21     768            MOL. MICROBIOL.
   22     764            VIROLOGY
   23     682            J. BIOCHEM.
   24     515            J. VIROL.
   25     464            YEAST
   26     461            J. CELL BIOL.
   27     445            J. GEN. VIROL.
   28     417            PLANT PHYSIOL.
   29     407            GENES DEV.
   30     376            HUM. MOL. GENET.
   31     346            J. IMMUNOL.
   32     342            HUM. MUTAT.
   33     323            ARCH. BIOCHEM. BIOPHYS.
   34     319            CURR. GENET.
   35     312            ONCOGENE
   36     312            INFECT. IMMUN.
   37     305            MOL. BIOCHEM. PARASITOL.
   38     270            FEMS MICROBIOL. LETT.
   39     264            BIOL. CHEM. HOPPE-SEYLER
   40     261            STRUCTURE
   41     254            AM. J. HUM. GENET.
   42     247            NAT. GENET.
   43     239            DEVELOPMENT
   44     237            MOL. ENDOCRINOL.
   45     234            J. CLIN. INVEST.
   46     218            J. MOL. EVOL.
   47     218            J. GEN. MICROBIOL.
   48     213            HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
   49     204            MICROBIOLOGY
   50     202            GENETICS
   51     191            HUM. GENET.
   52     188            NAT. STRUCT. BIOL.
   53     186            DNA CELL BIOL.
   54     182            J. EXP. MED.
   55     181            BLOOD
   56     175            DEV. BIOL.
   57     174            APPL. ENVIRON. MICROBIOL.
   58     172            NEURON
   59     157            PROTEIN SCI.
   60     153            DNA
   61     145            IMMUNOGENETICS
   62     137            ENDOCRINOLOGY
   63     136            DNA SEQ.
   64     125            PLANT CELL
   65     115            HEMOGLOBIN
   66     113            CANCER RES.
   67     113            BIOCHIMIE
   68     109            J. NEUROCHEM.
   69     109            BIOORG. KHIM.
   70     108            MOL. BIOL. EVOL.
   71     107            AGRIC. BIOL. CHEM.
   72     106            BRAIN RES. MOL. BRAIN RES.
   73     105            PLANT J.


   ========================================================================


   APPENDIX B: RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR
               DATABASES

   The current  status of  the relationships (cross-references) between
   SWISS-PROT and some biomolecular databases is shown in the following
   schematic:


                         ***********************
                         *  EMBL Nucleotide    *
                         *  Sequence Database  *
                         *       [EBI]         *
                         ***********************
                           ^ ^ ^  ^  ^ ^ ^ ^ ^
******************         | | |  I  | | | | |         **********************
* FlyBase        * <-------+ | |  I  | | | | +-------> * MGD [Mouse]        *
******************         | | |  I  | | | | |         **********************
                           | | |  I  | | | | |
******************         | | |  I  | | | | |         **********************
* SubtiList      * <---------+ |  I  | | | +---------> * GCRDb [7TM recep.] *
* [B.subtilis]   *         | | |  I  | | | | |         **********************
******************         | | |  I  | | | | |
                           | | |  I  | | | | |         **********************
******************         | | |  I  | | +-----------> * EcoGene [E.coli]   *
* Mendel [Plant] * <-----+ | | |  I  | | | | |         **********************
******************       | | | |  I  | | | | |
                         | | | |  I  | | | | |         **********************
******************       | | | |  I  +---------------> * SGD [Yeast]        *
* MaizeDb        * <-----------+  I  | | | | |         **********************
* [Zea mays]     *       | | | |  I  | | | | |
******************       | | | |  I  | | | | |         **********************
                         | | | |  I  | +-------------> * DictyDB [D.disco.] *
******************       | | | |  I  | | | | |         **********************
* WormPep        *       | | | |  I  | | | | |
* [C.elegans]    * <---+ | | | |  I  | | | | |         **********************
******************     | | | | |  I  | | | | | +-----> * ENZYME [Nomencl.]  *
                       | | | | |  I  | | | | | |       **********************
******************     | v v v v  v  v v v v v v           v
* REBASE         *     *************************       **********************
* [Restriction   * <-- *   SWISS-PROT          * ----> * OMIM [Human]       *
*  enzymes]      *     *   Protein Sequence    *       **********************
******************     *   Data Bank           *
                       *************************       **********************
******************      ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^          * ECO2DBASE     [2D] *
* StyGene        *      | | | | | | | | | | +--------> **********************
* [S.Typhimurium]* <----+ | | | | | | | | |
******************        | | | | | | | | |            **********************
                          | | | | | | | | +----------> * Maize-2DPAGE  [2D] *
******************        | | | | | | | |              **********************
* Transfac       * <------+ | | | | | | |
******************          | | | | | | |              **********************
                            | | | | | | +------------> * SWISS-2DPAGE  [2D] *
******************          | | | | | |                **********************
* Harefield [2D] * <--------+ | | | | |
******************            | | | | |                **********************
                              | | | | +--------------> * Aarhus/Ghent  [2D] *
******************            | | | |                  **********************
* PROSITE        *            | | | |
* [Patterns and  * <----------+ | | +----------------> **********************
* profiles]      *              | |                    * YEPD [Yeast]  [2D] *
******************              | +----------------+   **********************
             |                  v                  |
             |          ***********************    +-> **********************
             +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
                        ***********************        **********************

   =End=of=SWISS-PROT=release=36=notes=====================================
  

Swiss-Prot release 35.0

Published November 1, 1997
                   SWISS-PROT RELEASE 35.0 RELEASE NOTES

                   1.  INTRODUCTION

 Release 35.0 of SWISS-PROT contains  69'113 sequence entries, comprising
 25'083'768  amino   acids  abstracted   from  59'101   references.  This
 represents an increase of 18.3% over release  34. The growth of the data
 bank is summarized below.

 Release      Date           Number of       Number of amino
                               entries                 acids
    2.0       09/86               3939               900 163
    3.0       11/86               4160               969 641
    4.0       04/87               4387             1 036 010
    5.0       09/87               5205             1 327 683
    6.0       01/88               6102             1 653 982
    7.0       04/88               6821             1 885 771
    8.0       08/88               7724             2 224 465
    9.0       11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368
   31.0       02/95              43470            15 335 248
   32.0       11/95              49340            17 385 503
   33.0       02/96              52205            18 531 384
   34.0       10/96              59021            21 210 389
   35.0       11/97              69113            25 083 768



     2.  DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 34

 2.1  Sequences and annotations

 10'189 sequences have been added since  release 34, the sequence data of
 1654 existing entries  has been  updated and  the annotations  of 15'683
 entries have been revised.


 2.2  What's happening with the model organisms

 We have selected  a number of  organisms that  are the  target of genome
 sequencing and/or mapping projects and for which we intend to:

 . Be as complete  as possible. All  sequences available at  a given time
   should be  immediately  included  in  SWISS-PROT.  This also  includes
   sequence corrections and updates;
 . Provide a higher level of annotation;
 . Provide cross-references  to  specialized  database(s)  that  contain,
   among other data, some  genetic information about the  genes that code
   for these proteins;
 . Provide specific indices or documents.

 What was done  since the  last release  or in  preparation for  the next
 release concerning model organisms:

 . We  have   added   Methanoccocus   jannaschii,   Helicobacter  pylori,
   Synechocystis PCC 6803 to  the list of model  organisms. The genome of
   these organisms has been completely sequenced  and we plan to annotate
   them fully  in SWISS-PROT.  Specific  documents have  been  added (see
   section 4) for each of these organisms.

 . We also  have  added  mouse (Mus  musculus)  as  a  model  organism. A
   significant effort has been done to  add new mouse sequences (542 have
   been added since the  last release); we  have added links  to MGD (the
   Mouse Genome Database;  see section  2.4) and we  also have  created a
   specific  document  (MGDTOSP.TXT)  that   lists  the  cross-references
   between MGD and SWISS-PROT.

 . We have  continued  our effort  in  catching up  with  the  backlog of
   sequences from  other  model  organisms. In  particular  we  added 410
   entries from  yeast,  644  from  human,  89  from  S.pombe,  527  from
   C.elegans, 95 from A.thaliana and 92 from D.melanogaster.

 . We have added  in SWISS-PROT all  the sequences  from yeast chromosome
   XIII. We plan  to integrate data  from the  remaining chromosomes (IV,
   XII, XV and XVI) very soon  so as to have a  complete set of annotated
   yeast sequences.

 . We have finished the annotation of all Mycoplasma genitalium entries.

 . We plan  to  finish  as quickly  as  possible  the  annotation  of the
   Escherichia coli and Haemophilus influenzae sequence entries which are
   not yet part of SWISS-PROT.

 Here is the current status of the model organisms in SWISS-PROT:

 Organism        Database            Index file       Number of
                 cross-referenced                     sequences
 --------------  ----------------    --------------   ---------
 A.thaliana      None yet            In preparation         658
 B.subtilis      SubtiList           SUBTILIS.TXT          1882
 C.albicans      None yet            CALBICAN.TXT           167
 C.elegans       Wormpep             CELEGANS.TXT          1735
 D.discoideum    DictyDB             DICTY.TXT              272
 D.melanogaster  FlyBase             FLY.TXT               1002
 E.coli          EcoGene             ECOLI.TXT             4098
 H.influenzae    HiDB (TIGR)         HAEINFLU.TXT          1687
 H.sapiens       MIM                 MIMTOSP.TXT           4644
 H.pylori        HpDB (TIGR)         HPYLORI.TXT            257
 M.genitalium    MgDB (TIGR)         MGENITAL.TXT           470
 M.musculus      MGD                 MGDTOSP.TXT           2971
 M.jannaschii    MjDB (TIGR)         MJANNASC.TXT          1064
 M.tuberculosis  None yet            None yet               796
 S.cerevisiae    SGD                 YEAST.TXT             4750
 S.typhimurium   StyGene             SALTY.TXT              680
 S.pombe         None yet            POMBE.TXT             1045
 S.solfataricus  None yet            None yet                42


 Collectively the entries from the above  model organisms represent 35.4%
 of all SWISS-PROT entries.


 2.3  Changes affecting the accession numbers

 With the creation of the  TrEMBL database (see section  6) and the rapid
 increase in the amount of sequence data, we  are faced with a problem of
 availability of accession numbers. Currently we  use a system based on a
 one-letter prefix followed by 5 digits. This system was also used by the
 nucleotide sequence databases  which had originally  reserved for SWISS-
 PROT the prefix letters 'P' and 'Q'. The nucleotide databases having run
 out of space (due  mainly to EST's), have  been forced to  start using a
 new format based on a two-letter prefix followed by 6 digits.

 We have  used up  all possible  numbers with  'P' and  'Q' and  the only
 letter prefix which was not  used by the nucleotide  database is 'O'. As
 we believe that  changing the  format of the  accession numbers  to that
 used now by the  nucleotide database would create  havoc on the numerous
 software packages using SWISS-PROT, we have  decided to keep a system of
 accession numbers based on a six-character  code, but with the following
 changes:

 1)   We have  started  using 'O'.  This  extra letter  should  allow the
 continuation of  the present  format (1  prefix letter  + 5  digits) for
 approximately one year.
 2)   When we will have finished using up 'O', we will introduce a system
 based on the following format:

      1        2       3          4            5            6
     [O,P,Q]  [0-9]  [A-Z, 0-9]  [A-Z, 0-9]   [A-Z, 0-9]   [0-9]

 What the above means is that we will keep a six-character code, but that
 in positions 3,  4 and  5 of this  code any  combination of  letters and
 numbers can  be  present.  This format  allows  a  total  of  14 million
 accession numbers (up from 300'000 with the current system).

 We only  allow numbers  in  positions 2  and  6 so  that  the SWISS-PROT
 accession numbers can not  be mistaken with gene  names, acronyms, other
 type of accession numbers or any type of words!

 Examples: P0A3S2, Q2ASD4, O13YX2, P9B123


 2.4  Introduction of a new CC line-type topic (DATABASE)

 There are an  increasing number of  databases that cater  for a specific
 protein or  a for  a  very limited  number  of proteins.  Most  of these
 databases are mutation databases, reporting defects  linked to a genetic
 disease. We want  to add cross-references  to these  databases when they
 are available  electronically, either  by WWW  or  by FTP.  We therefore
 added in this release, a new comments (CC) line-type 'topic': "DATABASE"
 whose syntax is the following:

  CC   -!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][;
          FTP="Address"].

 Where
 `NAME' is the name of the database;
 `NOTE' (optional) is a free text note;
 `WWW'  (optional) is the WWW address (URL) of the database;
 `FTP'  (optional) is the  anonymous FTP address (including the directory
        name) where the database file(s) are stored.

 Examples of its usage:

 CC   -!- DATABASE: NAME=CD40Lbase;
 CC       NOTE=European CD40L defect database (mutation db);
 CC       WWW="http://www.expasy.ch/www/cd40lbase.html";
 CC       FTP="ftp://www.expasy.ch/databases/cd40lbase".

 CC   -!- DATABASE: NAME=PROW; NOTE=CD guide CD80 entry;
 CC       WWW="http://www.ncbi.nlm.nih.gov/prow/cd/cd80.htm".

 Please note that this topic along  with some forms of  the DR lines (see
 next section)  are  the first  occurrence  in SWISS-PROT  of  lower case
 characters (yes, we plan to go to mixed cases soon!).

 It is also,  currently, the  only part of  SWISS-PROT where  line longer
 than 75 characters can  be found as we  do not reformat long  URL or FTP
 addresses.


 2.5  Changes concerning cross-references (DR line)

 2.5.1  TIGR

 We have added cross-references  from SWISS-PROT to the  TIGR database, a
 collection  of  genomic  databases  for  microbes,  plants  and  animals
 maintained by The  Institute for  Genomic Research (TIGR)  in Rockville,
 Maryland, USA. These cross-references are present in the DR lines:

 Data bank identifier : TIGR
 Primary identifier   : The genome Open Reading Frame (ORF) code
 Secondary identifier : Not defined, a dash ("-") is stored in that field
 Example              : DR   TIGR; HP1563; -.


 2.5.2  MGD

 We have  added  cross-references  from SWISS-PROT  to  the  Mouse Genome
 Database (MGD),  maintained by  The  Jackson Laboratory  in  Bar Harbor,
 Maine, USA. These cross-references are present in the DR lines:

 Data bank identifier : MGD
 Primary identifier   : The accession number
 Secondary identifier : The gene designation
 Example              : DR   MGD; MGI :109323; HTR2B.


 2.5.3  LISTA

 We have  removed  the  cross-references  from  SWISS-PROT  to the  LISTA
 database which is no longer maintained  and which has been superseded by
 the SGD database to which SWISS-PROT is fully cross-referenced.


 2.5.4  PROSITE

 The format for cross-references to the PROSITE protein domain and family
 database used to be:

 DR   PROSITE; ACCESSION_NUMBER; ENTRY_NAME.

 It has been changed to:

 DR   PROSITE; ACCESSION_NUMBER; ENTRY_NAME; STATUS.

 Where 'ACCESSION_NUMBER' stands for the accession  number of the PROSITE
 pattern or  profile entry;  "ENTRY_NAME" is  the name  of the  entry and
 `STATUS' is one of the following:

 n
 FALSE_NEG
 PARTIAL
 UNKNOWN_n

 Where "n"  is the  number of  hits  of the  pattern or  profile  in that
 particular protein sequence. The "FALSE_NEG" status indicates that while
 the pattern or  profile did  not detect  the protein  sequence, it  is a
 member of  that  particular  family  or  domain.  The  "PARTIAL"  status
 indicates that  the  pattern  or profile  did  not  detect  the sequence
 because that sequence is not  complete and lacks the  region on which is
 based  the  pattern/profile.  Finally  the  "UNKNOWN"  status  indicates
 uncertainties as to the fact that the sequence is a member of the family
 or domain described by the pattern/profile.

 Example of PROSITE cross-references:

 DR   PROSITE; PS00107; PROTEIN_KINASE_ATP; 1.
 DR   PROSITE; PS00028; ZINC_FINGER_C2H2; 6.
 DR   PROSITE; PS00237; G_PROTEIN_RECEPTOR; FALSE_NEG.
 DR   PROSITE; PS01128; SHIKIMATE_KINASE; PARTIAL.


 2.5.5  REBASE

 Two small changes  have been made  to the syntax  of cross-references to
 the REBASE database:

 - REBASE has recently changed its accession numbers to add an additional
   digit (an extra leading zero).
 - We are  now using  mixed case  characters in the  secondary identifier
   (the name  of the restriction system)  so as to  represent exactly the
   information as stored in REBASE.

 Example:

 DR   REBASE; RB0005; ECORI.

 has been changed to:

 DR   REBASE; RB00005; EcoRI.



                  3. PLANNED CHANGES

 3.1  Extension of the accession number system

 As already explained in  detail under 2.3, we  will extend the accession
 number system when  we will  have used  up the  'O' series  of accession
 numbers. This can be anticipated for October 1998.


 3.2  Switch to the NCBI taxonomy

 To standardize the taxonomies used by different databases we will change
 with release 37 our taxonomy. We will switch to the NCBI taxonomy, which
 is  already  used  as  the  common  taxonomy  by  the  DDBJ/EMBL/GenBank
 nucleotide sequence databases.


 3.3  Introduction of RT lines

 With release 37  we will introduce  a new  line type,  the RT (Reference
 Title) line. This  optional line will  be placed  between the  RA and RL
 line. The  RT line  gives the  title  of the  paper (or  other  work) as
 exactly as possible given the limitations of the computer character set.
 The form which will  be used is that  which would be used  in a citation
 rather than displayed at  the top of the  published paper. For instance,
 where journals capitalize major  title words this is  not preserved. The
 title is enclosed  in double quotes,  and may be  continued over several
 lines as necessary.  The title lines  are terminated by  a semicolon. An
 example of the use of RT lines is shown below:

 RT   "Sequence analysis of the genome of the unicellular cyanobacterium
 RT   Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb
 RT   region from map positions 64% to 92% of the genome.";



                  4. STATUS OF THE DOCUMENTATION FILES

 SWISS-PROT is distributed  with a  large number of  documentation files.
 Some of  these files  have  been available  for  a long  time  (the user
 manual, release  notes,  the  various  indices  for authors,  citations,
 keywords, etc.),  but  many  have  been  created  recently  and  we  are
 continuously adding new  files. Since release  34, we have  added 15 new
 document files.  The following table lists  all  the documents  that are
 currently available.

 USERMAN.TXT    User manual
 RELNOTES.TXT   Release notes
 SHORTDES.TXT   Short description of entries in SWISS-PROT
 JOURLIST.TXT   List of abbreviations for journals cited
 KEYWLIST.TXT   List of keywords in use
 SPECLIST.TXT   List of organism identification codes
 TISSLIST.TXT   List of tissues
 EXPERTS.TXT    List of on-line experts for PROSITE and SWISS-PROT
 SUBMIT.TXT     Submission of sequence data to SWISS-PROT

 ACINDEX.TXT    Accession number index
 AUTINDEX.TXT   Author index
 CITINDEX.TXT   Citation index
 KEYINDEX.TXT   Keyword index
 SPEINDEX.TXT   Species index
 DELETEAC.TXT   Deleted accession number index [1]

 7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
 AATRNASY.TXT   List of aminoacyl-tRNA synthetases
 ALLERGEN.TXT   Nomenclature and index of allergen sequences
 BLOODGRP.TXT   List of blood group antigen proteins [1]
 CALBICAN.TXT   Index   of  Candida  albicans  entries   and  their
                corresponding gene designations
 CDLIST.TXT     CD  nomenclature  for  surface  proteins  of  human
                leucocytes
 CELEGANS.TXT   Index  of Caenorhabditis elegans entries  and their
                corresponding gene Wormpep cross-references
 DICTY.TXT      Index   of  Dictyostelium  discoideum  entries  and
                their  corresponding gene designations  and DictyDb
                cross-references
 EC2DTOSP.TXT   Index  of  Escherichia coli  Gene-protein  database
                entries referenced in SWISS-PROT
 ECOLI.TXT      Index  of Escherichia coli K12  chromosomal entries
                and their corresponding EcoGene cross-references
 EMBLTOSP.TXT   Index  of   EMBL  Database  entries  referenced  in
                SWISS-PROT
 EXTRADOM.TXT   Nomenclature of extracellular domains
 FLY.TXT        Index  of  Drosophila  entries and  FlyBase  cross-
                references [1]
 GLYCOSID.TXT   Classification  of glycosyl hydrolase  families and
                index of glycosyl hydrolase entries
 HAEINFLU.TXT   Index  of  Haemophilus  influenzae  RD  chromosomal
                entries
 HOXLIST.TXT    Vertebrate  homeotic Hox proteins: nomenclature and
                index
 HPYLORI.TXT    Index   of   Helicobacter   pylori   strain   26695
                chromosomal entries [1]
 HUMCHR18.TXT   Index of protein  sequence entries encoded on human
                chromosome 18 [1]
 HUMCHR19.TXT   Index of protein  sequence entries encoded on human
                chromosome 19 [1]
 HUMCHR20.TXT   Index of protein  sequence entries encoded on human
                chromosome 20
 HUMCHR21.TXT   Index of protein  sequence entries encoded on human
                chromosome 21
 HUMCHR22.TXT   Index of protein  sequence entries encoded on human
                chromosome 22
 HUMCHRX.TXT    Index of protein  sequence entries encoded on human
                chromosome X
 HUMCHRY.TXT    Index of protein  sequence entries encoded on human
                chromosome Y
 INITFACT.TXT   List and index of translation initiation factors [1]
 MIMTOSP.TXT    Index of MIM entries referenced in SWISS-PROT
 METALLO.TXT    Classification  of  metallothioneins and  index  of
                entries in SWISS-PROT [1]
 MGDTOSP.TXT    Index of MGD entries referenced in SWISS-PROT [1]
 MGENITAL.TXT   Index  of Mycoplasma genitalium chromosomal entries
                [1]
 MJANNASC.TXT   Index of Methanococcus jannaschii entries [1]
 NGR234.TXT     Table  of   putative  genes  in  Rhizobium  plasmid
                pNGR234a [1]
 NOMLIST.TXT    List   of  nomenclature   related  references   for
                proteins
 PCC6803.TXT    Index of Synechocystis strain PCC 6803 entries [1]
 PDBTOSP.TXT    Index  of X-ray  crystallography Protein Data  Bank
                (PDB) entries referenced in SWISS-PROT
 PEPTIDAS.TXT   Classification  of peptidase families and  index of
                peptidase entries
 PLASTID.TXT    List of chloroplast and cyanelle encoded proteins
 POMBE.TXT      Index   of  Schizosaccharomyces  pombe  entries  in
                SWISS-PROT    and    their    corresponding    gene
                designations
 RESTRIC.TXT    List of restriction enzyme and methylase entries
 RIBOSOMP.TXT   Index of  ribosomal proteins classified by families
                on the basis of sequence similarities
 SALTY.TXT      Index  of  Salmonella typhimurium  LT2  chromosomal
                entries  and  their  corresponding  StyGene  cross-
                references
 SUBTILIS.TXT   Index of  Bacillus subtilis 168 chromosomal entries
                and their corresponding SubtiList cross-references
 UPFLIST.TXT    UPF  (Uncharacterized  Protein Families)  list  and
                index of members [1]
 YEAST.TXT      Index   of  Saccharomyces  cerevisiae  entries  and
                their corresponding gene designations
 YEAST1.TXT     Yeast Chromosome I entries
 YEAST2.TXT     Yeast Chromosome II entries
 YEAST3.TXT     Yeast Chromosome III entries
 YEAST5.TXT     Yeast Chromosome V entries
 YEAST6.TXT     Yeast Chromosome VI entries
 YEAST7.TXT     Yeast Chromosome VII entries
 YEAST8.TXT     Yeast Chromosome VIII entries
 YEAST9.TXT     Yeast Chromosome IX entries
 YEAST10.TXT    Yeast Chromosome X entries
 YEAST11.TXT    Yeast Chromosome XI entries
 YEAST13.TXT    Yeast Chromosome XIII entries [1]
 YEAST14.TXT    Yeast Chromosome XIV entries

 Notes:
 [1]  New in release 35.

 We have  continued  to include  in  some SWISS-PROT  document  files the
 references of  World-Wide  Web  sites  relevant  to  the  subject  under
 consideration. There are now 12 documents that include such links.



                  5. THE EXPASY WORLD-WIDE WEB SERVER

 5.1  Background information

 The most  efficient and  user-friendly  way to  browse  interactively in
 SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases. is to use
 the World-Wide  Web (WWW)  molecular biology  server ExPASy.  The ExPASy
 server was  made  available  to the  public in  September  1993,  it  is
 reachable at the following address:

                              http://www.expasy.ch/

 The ExPASy WWW server  allows access, using  the user-friendly hypertext
 model, to the  SWISS-PROT, PROSITE,  ENZYME, SWISS-2DPAGE, SWISS-3DIMAGE
 and CD40Lbase  databases and,  through any  SWISS-PROT  protein sequence
 entry, to  other databases  such  as EMBL,  Eco2DBASE,  EcoCyc, FlyBase,
 GCRDb, MaizeDB, SubtiList/NRSub,  OMIM, PDB, HSSP,  ProDom, REBASE, SGD,
 YEPD and  Medline. ExPAsy  also offers  many tools  for the  analysis of
 protein sequences and 2D gels.


 5.2  SWISS-SHOP

 We provide, on  ExPASy, a  service called SWISS-SHOP.  SWISS-Shop allows
 any users of SWISS-PROT  to indicate what proteins  he/she is interested
 in. This can be done using various criteria that can be combined:

 -    By entering  one  or  more words  that  should  be  present  in the
      description line;
 -    By entering one or more species name(s) or taxonomic division(s);
 -    By entering one or more keywords;
 -    By entering one or more author names;
 -    By entering  the  accession number  (or  entry name)  of  a PROSITE
      pattern or a user-defined sequence pattern;
 -    By entering the  accession number  (or entry  name) of  an existing
      SWISS-PROT entry or by entering a "private" sequence.

 Every week, the  new sequences  entered in SWISS-PROT  are automatically
 compared with all the criteria that have been defined by the users. If a
 sequence corresponds to the  selection criteria defined by  a user, that
 sequence is sent by electronic mail.


 5.3  What is new on ExPASy

 ExPASy is constantly modified  and improved. If you  wish to be informed
 on the changes made to the server you can either:

 -    Read  the  document  "History  of  changes,  improvements  and  new
      features" which is available at the address:

              http://www.expasy.ch/www/history.html

 -    Subscribe to SWISS-Flash, a service that reports news of databases,
      software and services developments. By subscribing to this service,
      you will  automatically  get  SWISS-Flash  bulletins  by electronic
      mail. To subscribe use the address:

              http://www.expasy.ch/www/swiss-flash.html



                  6. TREMBL - A SUPPLEMENT TO SWISS-PROT

 The ongoing  genome sequencing  and mapping  projects  have dramatically
 increased the number of protein sequences to be incorporated into SWISS-
 PROT. Since we do not want to dilute the quality standards of SWISS-PROT
 by incorporating  sequences into  the database  without  proper sequence
 analysis and annotation,  we cannot  speed up  the incorporation  of new
 incoming data indefinitely.  But as we  also want to  make the sequences
 available as  fast as  possible, we  have  introduced with  SWISS-PROT a
 computer annotated supplement.  This supplement  consists of  entries in
 SWISS-PROT-like format  derived  from  the  translation  of  all  coding
 sequences (CDS) in the  EMBL nucleotide sequence  database, except those
 already included in SWISS-PROT.

 We name  this  supplement  TrEMBL (Translation  from  EMBL).  It  can be
 considered as  a  preliminary  section  of  SWISS-PROT. This  SWISS-PROT
 release is supplemented by TrEMBL release 5. TrEMBL is split in two main
 sections; SP-TrEMBL and REM-TrEMBL:

 - SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (140'555 in release
   5) which should  be incorporated into SWISS-PROT. SWISS-PROT accession
   numbers have been assigned for all SP-TrEMBL entries.

 - REM-TrEMBL (REMaining TrEMBL) contains  the entries (25'806 in release
   5) that  we do not want to include  in  SWISS-PROT  for  a  variety of
   reasons (synthetic  sequences,  pseudogenes, translations of uncorrect
   open reading frames,  fragments  with  less  than eight  amino  acids,
   patent-derived sequences, immunoglobulins and T-cell receptors, etc.)

 TrEMBL is available  by FTP from  the EBI server  (ftp.ebi.ac.uk) in the
 directory '/pub/databases/trembl'. It can  be queried on WWW  by the EBI
 SRS server (http://www.ebi.ac.uk/). It  is also available  on the SWISS-
 PROT CD-ROM and is searchable on the  FASTA, BIC_SW and BLAST servers of
 the EBI.



                  7. WEEKLY UPDATES OF SWISS-PROT

 Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
 are updated at each update:

 new_seq.dat    Contains all the new entries since the last full release;
 upd_seq.dat    Contains the entries for which the sequence data has been
                updated since the last release;
 upd_ann.dat    Contains  the entries  for which  one or  more annotation
                fields have been updated since the last release.

 Currently these  files  are  available on  the  following  anonymous FTP
 servers:

 Organization   ExPASy (Geneva University Expert Protein Analysis System)
 Address        expasy.hcuge.ch  (or 129.195.254.61)
 Directory      /databases/swiss-prot/updates

 Organization   European Bioinformatics Institute (EBI)
 Address        ftp.ebi.ac.uk (or 193.62.196.6)
 Directory      /pub/databases/swissprot/new


 !! Important notes !!!

 - Although we  try to follow  a regular schedule,  we do  not promise to
   update these files every week. In some cases two weeks will elapse in-
   between two updates.
 - Due to the current mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free.



                  8.  ENZYME and PROSITE

 8.1  The ENZYME data bank

 Release 22.0 of the ENZYME  data bank is distributed  with release 35 of
 SWISS-PROT. ENZYME release  22.0 contains  information relative  to 3651
 enzymes.


 8.2  The PROSITE data bank

 Release 14.0 of the PROSITE data bank  is distributed with release 35 of
 SWISS-PROT. This release  of PROSITE contains  997 documentation entries
 that describe  1'335  different patterns,  rules  and profiles/matrices.
 Release 14.0  is  the  first completely  new  release  of  PROSITE since
 November 1995. Since  that date we  have added 114  entries and modified
 566 entries. The long time that  elapsed between this release of PROSITE
 and the  last  one is  partially  due  to a  complete  rewriting  of the
 software  tools  that  maintain  the  database  and  allows  it  be  bi-
 directionally inked to SWISS-PROT. Thanks to  those changes, we will now
 be able to  produce PROSITE releases  at each release  of SWISS-PROT and
 also to offer on the ExPASy server frequent updates of the database.



                  9. WE NEED YOUR HELP !

 We welcome feedback from our users.  We would especially appreciate that
 you notify us  if you  find that  sequences belonging  to your  field of
 expertise are  missing from  the data  bank. We  also  would like  to be
 notified about annotations to be updated,  if, for example, the function
 of a protein has been clarified or if new post-translational information
 has become  available. To  facilitate such  feedback's  we offer  on the
 ExPASY WWW server  a form that  allows the submission  of updates and/or
 corrections to SWISS-PROT:

               http://www.expasy.ch/sprot/sp_update_form.html

 It is also  possible, from  any entries in  SWISS-PROT displayed  by the
 ExPASy server, to submit updates and/or  corrections for that particular
 entry. Finally, you  can also send  your comments by  electronic mail to
 the address:

                            swiss-prot@expasy.ch


 ========================================================================


                         APPENDIX A: SOME STATISTICS


   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.57   Gln (Q) 4.00   Leu (L) 9.39   Ser (S) 7.15
   Arg (R) 5.15   Glu (E) 6.34   Lys (K) 5.95   Thr (T) 5.70
   Asn (N) 4.50   Gly (G) 6.83   Met (M) 2.35   Trp (W) 1.24
   Asp (D) 5.29   His (H) 2.23   Phe (F) 4.08   Tyr (Y) 3.18
   Cys (C) 1.68   Ile (I) 5.78   Pro (P) 4.91   Val (V) 6.55

   Asx (B) 0.001  Glx (Z) 0.001  Xaa (X) 0.01


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 5713

   The first twenty species represent 34020 sequences: 49.2 % of the total
   number of entries.


   A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 2609
                            2x:  891
                            3x:  480
                            4x:  321
                            5x:  225
                            6x:  209
                            7x:  148
                            8x:   94
                            9x:  113
                           10x:   58
                       11- 20x:  261
                       21- 50x:  165
                       51-100x:   64
                         >100x:   75


   A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        4750          Baker's yeast (Saccharomyces cerevisiae)
         2        4644          Human
         3        4098          Escherichia coli
         4        2971          Mouse
         5        2398          Rat
         6        1882          Bacillus subtilis
         7        1735          Caenorhabditis elegans
         8        1687          Haemophilus influenzae
         9        1064          Methanococcus jannaschii
        10        1047          Bovine
        11        1045          Fission yeast (Schizosaccharomyces pombe)
        12        1002          Fruit fly (Drosophila melanogaster)
        13         799          Chicken
        14         786          Mycobacterium tuberculosis
        15         680          Salmonella typhimurium
        16         658          Arabidopsis thaliana (Mouse-ear cress)
        17         648          African clawed frog (Xenopus laevis)
        18         551          Pig
        19         541          Rabbit
        20         494          Synechocystis sp. (strain PCC 6803)
        21         489          Mycoplasma pneumoniae
        22         470          Mycoplasma genitalium
        23         403          Rhizobium sp. (strain NGR234)
        24         398          Maize
        25         340          Pseudomonas aeruginosa
        26         292          Rice
        27         273          Bacteriophage T4
        28         272          Slime mold (Dictyostelium discoideum)
        29         257          Helicobacter pylori
        30         256          Tobacco
        31         253          Vaccinia virus (strain Copenhagen)
        32         248          Dog
        33         231          Pea
        34         223          Sheep
        35         219          Porphyra purpurea
        36         209          Barley
        37         203          Neurospora crassa
        38         199          Wheat
                   199          Staphylococcus aureus
        40         196          Mycobacterium leprae
        41         193          Human cytomegalovirus (strain AD169)
        42         192          Soybean
        43         190          Klebsiella pneumoniae
        44         184          Vaccinia virus (strain WR)
        45         183          Rhodobacter capsulatus
                   183          Pseudomonas putida
        47         180          Bacillus stearothermophilus
        48         175          Potato
        49         174          Tomato
        50         167          Candida albicans
        51         162          Agrobacterium tumefaciens
        52         156          Spinach
        53         154          Rhizobium meliloti
                   154          Autographa californica nuclear polyhedrosis virus
        55         151          Chlamydomonas reinhardtii
        56         150          Marchantia polymorpha (Liverwort)
        57         149          Guinea pig
        58         146          Variola virus
        59         145          Cyanophora paradoxa
        60         139          Odontella sinensis
        61         138          Aspergillus nidulans
        62         134          Orgyia pseudotsugata multicapsid polyhedrosis virus
        63         132          Lactococcus lactis (subsp. lactis)
        64         131          Streptomyces coelicolor
        65         122          Thermus aquaticus (subsp. thermophilus)
        66         120          Horse
        67         116          Golden hamster
        68         113          Trypanosoma brucei brucei
                   113          Anabaena sp. (strain PCC 7120)
                   113          Synechococcus sp. (strain PCC 7942)
        71         108          Kluyveromyces lactis
        72         107          Bombyx mori (Silk moth)
        73         105          Bradyrhizobium japonicum
                   105          Alcaligenes eutrophus
        75         102          Yersinia enterocolitica



   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    2882             1001-1100      627
                 51- 100    5886             1101-1200      484
                101- 150    8453             1201-1300      339
                151- 200    6661             1301-1400      226
                201- 250    6184             1401-1500      186
                251- 300    5742             1501-1600      115
                301- 350    5369             1601-1700      102
                351- 400    5392             1701-1800       79
                401- 450    4149             1801-1900       86
                451- 500    3905             1901-2000       52
                501- 550    2927             2001-2100       30
                551- 600    2053             2101-2200       67
                601- 650    1560             2201-2300       64
                651- 700    1159             2301-2400       32
                701- 750    1032             2401-2500       39
                751- 800     831             >2500          203
                801- 850     652
                851- 900     685
                901- 950     464
                951-1000     396



   A.4  Longest sequences

   The longest sequences (>=4000 residues) are listed here:

                               HTS1_COCCA  5217
                               MUC2_HUMAN  5179
                               FAT_DROME   5147
                               RYNR_RABIT  5037
                               RYNR_PIG    5035
                               RYNR_HUMAN  5032
                               RYNC_RABIT  4969
                               LRP_CAEEL   4753
                               DYHC_DICDI  4725
                               PLEC_RAT    4687
                               LRP2_RAT    4660
                               DYHC_RAT    4644
                               DYHC_DROME  4639
                               DYHC_CAEEL  4568
                               DYHB_CHLRE  4568
                               APB_HUMAN   4563
                               APOA_HUMAN  4548
                               LRP1_HUMAN  4544
                               LRP1_CHICK  4543
                               DYHC_PARTE  4540
                               RRPA_CVMJH  4488
                               DYHG_CHLRE  4485
                               DYHC_ANTCR  4466
                               DYHC_TRIGR  4466
                               GRSB_BACBR  4451
                               PKSK_BACSU  4447
                               PKSL_BACSU  4427
                               PGBM_HUMAN  4393
                               YP73_CAEEL  4385
                               DYHC_NEUCR  4367
                               DYHC_NECHA  4349
                               DYHC_EMENI  4344
                               PKD1_HUMAN  4303
                               DYHC_YEAST  4092
                               RRPA_CVH22  4085


   A.5  Statistics for journal citations


   Total number of journals cited in this release of SWISS-PROT: 861


   A.5.1 Table of the frequency of journal citations

        Journals cited 1x: 326
                       2x: 117
                       3x:  61
                       4x:  39
                       5x:  30
                       6x:  23
                       7x:  14
                       8x:  13
                       9x:  10
                      10x:  12
                  11- 20x:  66
                  21- 50x:  58
                  51-100x:  23
                    >100x:  69


   A.5.2  List of the most cited journals in SWISS-PROT

   Citations       Journal abbreviation
   ---------       ----------------------------------
   6038            J. BIOL. CHEM.
   3672            PROC. NATL. ACAD. SCI. U.S.A.
   3356            NUCLEIC ACIDS RES.
   2604            J. BACTERIOL.
   2352            GENE
   1992            FEBS LETT.
   1853            EUR. J. BIOCHEM.
   1693            BIOCHEM. BIOPHYS. RES. COMMUN.
   1651            EMBO J.
   1596            BIOCHEMISTRY
   1540            NATURE
   1367            BIOCHIM. BIOPHYS. ACTA
   1244            J. MOL. BIOL.
   1177            CELL
   1137            MOL. CELL. BIOL.
    920            MOL. GEN. GENET.
    899            PLANT MOL. BIOL.
    850            BIOCHEM. J.
    764            SCIENCE
    750            VIROLOGY
    748            GENOMICS
    731            MOL. MICROBIOL.
    661            J. BIOCHEM.
    502            J. VIROL.
    444            J. CELL BIOL.
    439            YEAST
    435            J. GEN. VIROL.
    418            PLANT PHYSIOL.
    381            GENES DEV.
    333            HUM. MOL. GENET.
    323            J. IMMUNOL.
    313            CURR. GENET.
    305            ARCH. BIOCHEM. BIOPHYS.
    303            INFECT. IMMUN.
    287            ONCOGENE
    287            MOL. BIOCHEM. PARASITOL.
    262            BIOL. CHEM. HOPPE-SEYLER
    248            FEMS MICROBIOL. LETT.
    230            MOL. ENDOCRINOL.
    230            HUM. MUTAT.
    220            J. CLIN. INVEST.
    220            AM. J. HUM. GENET.
    219            NAT. GENET.
    219            DEVELOPMENT
    216            J. GEN. MICROBIOL.
    213            HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
    194            J. MOL. EVOL.
    185            GENETICS
    180            STRUCTURE
    178            MICROBIOLOGY
    177            BLOOD
    172            HUM. GENET.
    169            DNA CELL BIOL.
    168            J. EXP. MED.
    163            APPL. ENVIRON. MICROBIOL.
    158            DEV. BIOL.
    156            NEURON
    152            DNA
    136            IMMUNOGENETICS
    124            ENDOCRINOLOGY
    123            DNA SEQ.
    122            PLANT CELL
    115            NAT. STRUCT. BIOL.
    109            HEMOGLOBIN
    108            PROTEIN SCI.
    108            BIOCHIMIE
    106            AGRIC. BIOL. CHEM.
    105            BIOORG. KHIM.
    101            CANCER RES.


 ===========================================================================


   APPENDIX B: RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR
               DATABASES

   The current  status of  the relationships (cross-references) between
   SWISS-PROT and some biomolecular databases is shown in the following
   schematic:


                         ***********************
                         *  EMBL Nucleotide    *
                         *  Sequence Database  *
                         *       [EBI]         *
                         ***********************
                           ^ ^ ^  ^  ^ ^ ^ ^ ^
******************         | | |  I  | | | | |         **********************
* FlyBase        * <-------+ | |  I  | | | | +-------> * MGD [Mouse]        *
******************         | | |  I  | | | | |         **********************
                           | | |  I  | | | | |
******************         | | |  I  | | | | |         **********************
* SubtiList      * <---------+ |  I  | | | +---------> * GCRDb [7TM recep.] *
* [B.subtilis]   *         | | |  I  | | | | |         **********************
******************         | | |  I  | | | | |
                           | | |  I  | | | | |         **********************
******************         | | |  I  | | +-----------> * EcoGene [E.coli]   *
* Mendel [Plant] * <-----+ | | |  I  | | | | |         **********************
******************       | | | |  I  | | | | |
                         | | | |  I  | | | | |         **********************
******************       | | | |  I  +---------------> * SGD [Yeast]        *
* MaizeDb        * <-----------+  I  | | | | |         **********************
* [Zea mays]     *       | | | |  I  | | | | |
******************       | | | |  I  | | | | |         **********************
                         | | | |  I  | +-------------> * DictyDB [D.disco.] *
******************       | | | |  I  | | | | |         **********************
* WormPep        *       | | | |  I  | | | | |
* [C.elegans]    * <---+ | | | |  I  | | | | |         **********************
******************     | | | | |  I  | | | | | +-----> * ENZYME [Nomencl.]  *
                       | | | | |  I  | | | | | |       **********************
******************     | v v v v  v  v v v v v v           v
* REBASE         *     *************************       **********************
* [Restriction   * <-- *   SWISS-PROT          * ----> * OMIM [Human]       *
*  enzymes]      *     *   Protein Sequence    *       **********************
******************     *   Data Bank           *
                       *************************       **********************
******************      ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^          * ECO2DBASE     [2D] *
* StyGene        *      | | | | | | | | | | +--------> **********************
* [S.Typhimurium]* <----+ | | | | | | | | |
******************        | | | | | | | | |            **********************
                          | | | | | | | | +----------> * Maize-2DPAGE  [2D] *
******************        | | | | | | | |              **********************
* Transfac       * <------+ | | | | | | |
******************          | | | | | | |              **********************
                            | | | | | | +------------> * SWISS-2DPAGE  [2D] *
******************          | | | | | |                **********************
* Harefield [2D] * <--------+ | | | | |
******************            | | | | |                **********************
                              | | | | +--------------> * Aarhus/Ghent  [2D] *
******************            | | | |                  **********************
* PROSITE        *            | | | |
* [Patterns and  * <----------+ | | +----------------> **********************
* profiles]      *              | |                    * YEPD [Yeast]  [2D] *
******************              | +----------------+   **********************
             |                  v                  |
             |          ***********************    +-> **********************
             +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
                        ***********************        **********************

 ===End=of=SWISS-PROT=release=35=notes=====================================
  

Swiss-Prot release 34.0

Published October 1, 1996

                    SWISS-PROT RELEASE 34.0 RELEASE NOTES


                               1.  INTRODUCTION

   Release 34.0  of SWISS-PROT contains 59'021 sequence entries, comprising
   21'210'389  amino   acids  abstracted   from  50'052   references.  This
   represents an  increase of 14.5% over release 33. The growth of the data
   bank is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   2.0        09/86               3939               900 163
   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368
   31.0       02/95              43470            15 335 248
   32.0       11/95              49340            17 385 503
   33.0       02/96              52205            18 531 384
   34.0       10/96              59021            21 210 389



      2.  DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 33

   2.1  Sequences and annotations

   6'892 sequences  have been  added since release 33, the sequence data of
   1118 existing  entries has  been updated  and the  annotations of 10'629
   entries have been revised.

   2.2  What's happening with the model organisms

   We have  selected a  number of  organisms that  are the target of genome
   sequencing and/or mapping projects and for which we intend to:

   -  Be as  complete as  possible. All sequences available at a given time
      should be  immediately included  in SWISS-PROT.  This  also  includes
      sequence corrections and updates;
   -  Provide a higher level of annotation;
   -  Provide cross-references  to specialized  database(s)  that  contain,
      among other  data, some genetic information about the genes that code
      for these proteins;
   -  Provide specific indices or documents.

   What was  done since  the last  release or  in preparation  for the next
   release concerning model organisms:

   -  We have  added  Mycobacterium  tuberculosis  to  the  list  of  model
      organisms. The  genome  of  this  important  pathogenic  bacteria  is
      currently being  sequenced at the Sanger Genome Center in Hinxton. We
      have already annotated 474 putative proteins from M.tuberculosis.

   -  We have  continued our  effort in  catching up  with the  backlog  of
      sequences from eukaryotic model organisms. In particular we added 687
      entries from  yeast, 525  from human,  316  from  S.pombe,  202  from
      C.elegans, 62 from A.thaliana and 92 from Drosophila.

   -  We have  added in SWISS-PROT, all the sequences from yeast chromosome
      VII and XIV. We plan to integrate data from the remaining chromosomes
      (IV, XII, XIII, XV and XVI) very soon so as to have a complete set of
      annotated yeast sequences.

   -  We plant  to finish,  for the  next release,  the annotation  of  the
      Haemophilus influenzae  and Mycoplasma  genitalium  sequence  entries
      which are not yet part of SWISS-PROT.


   Here is the current status of the model organisms:


   Organism         Database               Index file       Number of
                    cross-referenced                        sequences
   --------------   ---------------------  --------------   ---------
   A.thaliana       None yet               In preparation         562
   B.subtilis       SubtiList              SUBTILIS.TXT          1783
   C.albicans       None yet               CALBICAN.TXT           124
   C.elegans        WormPep                CELEGANS.TXT          1208
   D.discoideum     DictyDB                DICTY.TXT              265
   D.melanogaster   FlyBase                In preparation         910
   E.coli           EcoGene                ECOLI.TXT             3606
   H.influenzae     None yet               HAEINFLU.TXT          1591
   H.sapiens        MIM                    MIMTOSP.TXT           4000
   M.genitalium     None yet               In preparation         425
   M.tuberculosis   None yet               None yet               474
   S.cerevisiae     LISTA/SGD              YEAST.TXT             4340
   S.typhimurium    StyGene                SALTY.TXT              617
   S.pombe          None yet               POMBE.TXT              956
   S.solfataricus   None yet               None yet                42


   Collectively the  entries from the above model organisms represent 35.4%
   of all SWISS-PROT entries.



   2.3  Change in the GN line

   Starting with  release 34,  we allow  more than  a single  GN line to be
   present in  an entry.  This small change was rendered necessary to allow
   the representation  of all  gene names for a number of protein sequences
   encoded by a multiplicity of genes or for genes with many synonyms.

   Examples:

   GN   (MSP-31 OR R05F9.13) AND (MSP-40 OR C33F10.9) AND (MSP-142 OR
   GN   K05F1.2) AND C34F11.4 AND F58A6.8 AND K07F5.1 AND ZK1248.6.

   GN   (RPL44A OR RPL44 OR SCL41A OR RPL41A OR YNL162W OR N1722) AND
   GN   (RPL44B OR RPL44 OR SCL41B OR RPL41B OR MAK18 OR YHR141C).


   2.4  Changes concerning cross-references (DR line)

   We have  added cross-references  from SWISS-PROT  to the Maize genome 2D
   Electrophoresis database.  These cross-references  are present in the DR
   lines:

   Data bank identifier:  MAIZE-2DPAGE
   Primary identifier:    The protein spot unique identifier [1]
   Secondary identifier:  The tissue of origin [2]
   Example:               MAIZE-2DPAGE; P80607; COLEOPTILE.


   [1]  The Maize-2PAGE  database uses SWISS-PROT primary accession numbers
      as the  alphanumeric designation  of spots  that are linked to SWISS-
      PROT entries
   [2]  Currently only `COLEOPTILE' is used.


   Small changes  have been  made to  the syntax of cross-references to the
   MIM and REBASE databases:

   o  In DR  lines pointing  to MIM, the secondary identifier which used to
      be the  release number  of that  database has  been replaced by a '-'
      (dash). This  change became  necessary due to the fact the MIM is now
      updated on a daily basis and that there are no longer release numbers
      for this database.

   o  REBASE  has  recently  introduced  accession  numbers.  We  therefore
      changed the  format of  DR lines  pointing to  this database. The new
      REBASE accession  numbers are  used as  primary identifiers  and  the
      names of the restriction systems as secondary identifiers.

   Examples:

   DR   MIM; 249900; -.
   DR   REBASE; RB0005; ECORI.



                             3.0  PLANNED CHANGES

   3.1  Accession numbers

   With the  creation of  the TREMBL database (see section 6) and the rapid
   increase in  the amount of sequence data, we are faced with a problem of
   availability of  accession numbers. Currently we use a system based on a
   one-letter prefix followed by 5 digits. This system was also used by the
   nucleotide sequence  databases which  had originally reserved for SWISS-
   PROT the prefix letters 'P' and 'Q'. The nucleotide databases having run
   out of  space (due  mainly to  EST's), have been forced to start using a
   new format based on a two-letter prefix followed by 6 digits.

   We will  soon have used up all possible numbers with 'P' and 'Q' and the
   only letter prefix which was not used by the nucleotide database is 'O'.
   As we  believe that changing the format of the accession numbers to that
   used now  by the nucleotide database would create havocs on the numerous
   software packages  using SWISS-PROT, we have decided to keep a system of
   accession numbers  based on a six-character code, but with the following
   planned changes:

   1) As soon  as we  have used  up all  'P' and 'Q' numbers, we will start
      using 'O'.  This extra  letter should  allow the  continuation of the
      present format (1 prefix letter + 5 digits) for at least a year.

   2) When we  will have  finished using up 'O', we will introduce a system
      based on the following format:

       1        2       3          4            5            6
       [O,P,Q]  [0-9]  [A-Z, 0-9]  [A-Z, 0-9]   [A-Z, 0-9]   [0-9]

      What the  above means is that we  will keep a six-character code, but
      that in  positions 3, 4 and 5 of this code any combination of letters
      and numbers  can be present. This format allows a total of 14 million
      accession numbers (up from 300'000 with the current system).

      We only  allow numbers  in positions  2 and  6 so that the SWISS-PROT
      accession numbers  can not  be mistaken  with gene  names,  acronyms,
      other type of accession numbers or any type of words !

      Examples: P0A3S2, Q2ASD4, O13YX2, P9B123



   3.2  Introduction of a new CC line-type topic (DATABASE)

   There are  an increasing  number of databases that caters for a specific
   protein or  a for  a very  limited number  of proteins.  Most  of  these
   databases are  mutation databases, reporting defects linked to a genetic
   disease. We  want to  add cross-references  to these databases when they
   are available  electronically, either by WWW or by FTP. We are therefore
   adding, in  the next  release, a  new comments  (CC) line-type  'topic':
   "DATABASE" whose syntax will be the following:

   CC   -!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][;
            FTP="Address"].

   Where:

   NAME is the name of the database;
   NOTE is an optional free text note;
   WWW  is the WWW address (URL) of the database;
   FTP  is the  anonymous FTP  address (including the directory name) where
        the database file(s) are stored.


   Examples of its usage:

   CC   -!- DATABASE: NAME=CD40LBASE;
   CC       WWW="HTTP://www.expasy.ch/www/cd40lbase.html";
   CC       FTP="www.expasy.ch/databases/cd40lbase".

   CC   -!- DATABASE: NAME=HAEMB; NOTE=HAEMOPHILIA B DATABASE;
   CC       FTP="ftp.ebi.ac.uk/pub/databases/haemb/".

   Please note  that this  is the  first part  of SWISS-PROT to allow lower
   case characters (yes, we plan to go to mixed cases soon !).



                    4.  STATUS OF THE DOCUMENTATION FILES

   SWISS-PROT is  distributed with  a large  number of documentation files.
   Some of  these files  have been  available for  a long  time  (the  user
   manual, release  notes, the  various  indices  for  authors,  citations,
   keywords, etc.),  but  many  have  been  created  recently  and  we  are
   continuously adding  new files.  Since release  33, we  have added 6 new
   document files.  The following  table list  all the  documents that  are
   either currently  available or  that we  plan to  add in  the  next  few
   months.

   USERMAN .TXT   User manual
   RELNOTES.TXT   Release notes
   SHORTDES.TXT   Short description of entries in SWISS-PROT

   JOURLIST.TXT   List of abbreviations for journals cited
   KEYWLIST.TXT   List of keywords in use
   SPECLIST.TXT   List of organism identification codes
   TISSLIST.TXT   List of tissues (in RC line) [1]
   EXPERTS .TXT   List of on-line experts for PROSITE and SWISS-PROT
   SUBMIT  .TXT   Submission of sequence data to SWISS-PROT

   ACINDEX .TXT   Accession number index
   AUTINDEX.TXT   Author index
   CITINDEX.TXT   Citation index
   KEYINDEX.TXT   Keyword index
   SPEINDEX.TXT   Species index

   7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
   AATRNASY.TXT   List of aminoacyl-tRNA synthetases
   ALLERGEN.TXT   Nomenclature and index of allergen sequences
   CALBICAN.TXT   Index of Candida albicans entries and their corresponding
                  gene designations
   CDLIST  .TXT   CD nomenclature for surface proteins of human leucocytes
   CELEGANS.TXT   Index  of   Caenorhabditis  elegans   entries  and  their
                  corresponding gene designations and WormPep cross-
                  references
   DICTY   .TXT   Index  of  Dictyostelium  discoideum  entries  and  their
                  corresponding gene designations and DictyDB cross-
                  references
   EC2DTOSP.TXT   Index of  Escherichia coli  Gene-protein database entries
                  referenced in SWISS-PROT
   ECOLI   .TXT   Index of  Escherichia coli  K12 chromosomal  entries  and
                  their corresponding EcoGene cross-references
   EMBLTOSP.TXT   Index of EMBL Database entries referenced in SWISS-PROT
   EXTRADOM.TXT   Nomenclature of extracellular domains
   GLYCOSID.TXT   Classification of  glycosyl hydrolases families and index
                  of glycosyl hydrolase entries
   HAEINFLU.TXT   Index of Haemophilus influenzae RD chromosomal entries
   HOXLIST .TXT   Vertebrate homeotic Hox proteins: nomenclature and index
   HUMCHR20.TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome 20 [1]
   HUMCHR21.TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome 21
   HUMCHR22.TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome 22
   HUMCHRX .TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome X [1]
   HUMCHRY .TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome Y
   MIMTOSP .TXT   Index of MIM entries referenced in SWISS-PROT
   MYGENIT .TXT   Index of Mycoplasma genitalium chromosomal entries [2]
   NOMLIST .TXT   List of nomenclature related references for proteins
   PDBTOSP .TXT   Index of  X-ray crystallography  Protein Data  Bank (PDB)
                  entries referenced in SWISS-PROT
   PEPTIDAS.TXT   Classification  of   peptidase  families   and  index  of
                  peptidase entries
   PLASTID .TXT   List of chloroplast and cyanelle encoded proteins
   POMBE   .TXT   Index of  Schizosaccharomyces pombe entries in SWISS-PROT
                  and their corresponding gene designations
   RESTRIC .TXT   List of restriction enzyme and methylase entries
   RIBOSOMP.TXT   Index of ribosomal proteins classified by families on the
                  basis of sequence similarities [1]
   SALTY   .TXT   Index of  Salmonella typhimurium  LT2 chromosomal entries
                  and their corresponding StyGene cross-references
   SUBTILIS.TXT   Index of  Bacillus subtilis  168 chromosomal  entries and
                  their corresponding SubtiList cross-references
   YEAST   .TXT   Index  of  Saccharomyces  cerevisiae  entries  and  their
                  corresponding gene designations
   YEAST1  .TXT   Yeast Chromosome I entries
   YEAST2  .TXT   Yeast Chromosome II entries
   YEAST3  .TXT   Yeast Chromosome III entries
   YEAST5  .TXT   Yeast Chromosome V entries
   YEAST6  .TXT   Yeast Chromosome VI entries
   YEAST7  .TXT   Yeast Chromosome VII entries [1]
   YEAST8  .TXT   Yeast Chromosome VIII entries
   YEAST9  .TXT   Yeast Chromosome IX entries
   YEAST10 .TXT   Yeast Chromosome X entries
   YEAST11 .TXT   Yeast Chromosome XI entries
   YEAST13 .TXT   Yeast Chromosome XIII entries [2]
   YEAST14 .TXT   Yeast Chromosome XIV entries [1]

   Notes:

   [1]  New in release 34.
   [2]  Will be available starting with release 35 of February 1997.


   We have  continued to  include in  some SWISS-PROT  document  files  the
   references of  World-Wide  Web  sites  relevant  to  the  subject  under
   consideration. There are now 12 documents that include such links.



                     5.  THE EXPASY WORLD-WIDE WEB SERVER

   5.1  Background information

   The most  efficient and  user-friendly way  to browse  interactively  in
   SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases. is to use
   the World-Wide  Web (WWW)  molecular biology  server ExPASy.  WWW  is  a
   global information  retrieval system  merging the  power  of  world-wide
   networks, hypertext  and multimedia.  Through hypertext  links, it gives
   access to  documents and  information available  on thousands of servers
   around the  world. To  access a  WWW server  one needs  a  WWW  browser.
   Currently, the  most popular  browser  is  Netscape  Navigator(TM)  from
   Netscape Communications Corp. (available from ftp.netscape.com). Using a
   WWW browser, one has access to all the hypertext documents stored on the
   ExPASy server as well as many other WWW servers.

   The ExPASy server was made available to the public in September 1993. On
   October 1996  a cumulative  total of 8 million connections was attained.
   It may  be accessed  through its  Uniform Resource  Locator (URL  -  the
   addressing system defined in WWW), which is:

        http://www.expasy.ch/

   The ExPASy  WWW server  allows access, using the user-friendly hypertext
   model, to  the SWISS-PROT,  PROSITE, ENZYME, SWISS-2DPAGE, SWISS-3DIMAGE
   and CD40Lbase  databases and,  through any  SWISS-PROT protein  sequence
   entry, to  other databases  such as  EMBL, Eco2DBASE,  EcoCyc,  FlyBase,
   GCRDb, MaizeDB,  SubtiList/NRSub, OMIM,  PDB, HSSP, ProDom, REBASE, SGD,
   YEPD and  Medline. Using  a browser  which is able to display images one
   can also  remotely access  2D gels  image data from SWISS-2DPAGE. ExPAsy
   also offers  many tools  for the  analysis of  protein sequences  and 2D
   gels.

   For more  information on  the  ExPASy  WWW  server,  you  can  read  the
   following article:

      Appel R.D., Bairoch A., Hochstrasser D.F.
      A new  generation of  information retrieval tools for biologists: the
      example of the ExPASy WWW server.
      Trends Biochem. Sci. 19:258-260(1994).

   Or you can contact Dr. Ron Appel:

      Email: ron.appel@dim.hcuge.ch


   5.2  SWISS-SHOP

   Thanks to the work of Manuel Peitsch from the Geneva Glaxo Institute for
   Molecular Biology,  we can  provide, on ExPASy, a  service called SWISS-
   SHOP. SWISS-Shop  allows  any  users  of  SWISS-PROT  to  indicate  what
   proteins he/she  is interested  in.  This  can  be  done  using  various
   criteria that can be combined:

   -  By entering  one  or  more  words  that  should  be  present  in  the
      description line;
   -  By entering one or more species name(s) or taxonomic division(s);
   -  By entering one or more keywords;
   -  By entering one or more author names;
   -  By entering the accession number (or entry name) of a PROSITE pattern
      or a user-defined sequence pattern;
   -  By entering  the accession  number (or  entry name)  of  an  existing
      SWISS-PROT entry or by entering a "private" sequence.

   Every week,  the new  sequences entered  in SWISS-PROT are automatically
   compared with all the criteria that have been defined by the users. If a
   sequence corresponds  to the  selection criteria defined by a user, that
   sequence is sent by electronic mail.


   5.3  What is new on ExPASy

   Since  the   last  release,  there  has  been  a  large  number  of  new
   developments on the ExPASy WWW server. Here are some highlights of these
   changes:

   -  CD40Lbase, The  European CD40L  Defect Database  prepared  by  Manuel
      Peitsch, has  been made accessible through the ExPASy WWW server. The
      purpose of  CD40Lbase is  to collect  clinical and  molecular data on
      CD40 ligand defects leading to X-linked Hyper-IgM syndrome.

   -  Two new tool are available from the "Tools" page:

      PeptideMass: this  program is  designed to  calculate the theoretical
      masses of peptides generated by the chemical or enzymatic cleavage of
      proteins,  to   assist  in   the  interpretation   of  peptide   mass
      fingerprinting and  peptide mapping  experiments.  When  proteins  of
      interest are  specified from  SWISS-PROT, the  program considers  all
      annotations for that protein in the database, and uses these in order
      to generate  the correct peptide masses and warn users about peptides
      that are  not likely  to  be  found  when  undertaking  peptide  mass
      fingerprinting. Many  protein post-translational  modifications which
      affect the masses of peptides can thus be taken into consideration.

      TagIdent: this  a protein  identification tool  which improves on and
      superspeed the tool previously known as 'GuessProt'. The user can now
      identify  proteins  from  2-D  gels  by  giving  protein  pI  and  MW
      estimates, a  species or  organism classification  of interest, and a
      short sequence  tag of  up to  6 amino acids. This tag can be derived
      from the  N-terminus, the  C-terminus or  from internal peptides of a
      protein. The  results are  now sent  to the  user by e-mail, allowing
      many searches to be done at the same time.

   -  In PROSITE  and Enzyme,  we have  added the  possibility to  save all
      referenced SWISS-PROT entries to a user-defined file on our anonymous
      FTP server "outgoing" directory.

   -  At the  end of  each page displaying a SWISS-PROT entry we have added
      links to  some of our sequence analysis tools so as to allow users to
      directly submit the displayed sequence to these tools.

   -  An email  option has  been added to the tool ScanProsite, if you want
      to scan  a pattern  against SWISS-PROT,  you have  now the  option of
      having sent  the results  of your  query by email, which should avoid
      previously frequent  timeout problems  and is particularly useful for
      complex patterns.

   -  WWW links  have  been  implemented  between  SWISS-PROT  entries  and
      nucleotide entries from DDBJ, the DNA Data Bank of Japan (in addition
      to the  existing links  to EMBL  at EBI and GenBank at NCBI). We have
      also added  direct WWW  links to:  SubtiList, the  Bacillus  subtilis
      genomic database (http://www.pasteur.fr/Bio/SubtiList.html); YPD, the
      Yeast Protein  Database (http://quest7.proteome.com/YPDhome.html) and
      ECO2DBASE,     the      Escherichia     coli      2DPAGE     database
      (http://pcsf.brcf.med.umich.edu/eco2dbase).

   -  Links have  been established  from most  feature (FT) lines of SWISS-
      PROT entries  to  pages  that  highlight the subsequence in question,
      both in 1- and in 3-letter amino acid codes.

   -  2D Hunt,  a database  created and  continuously updated by the Marvin
      (http://www.hon.ch/MedHunt/Marvin.html) robot contains  sites related
      to electrophoresis and  more specifically  to 2-D electrophoresis. It
      is accessible from the SWISS-2DPAGE top page of ExPASy.

   -  We have  continued to build a list of Biomolecular servers, this list
      is available on the ExPASy top page or directly from:

                http://www.expasy.ch/www/amos_www_links.html

   -  Many other changes have been made to all parts of the server.


                   6.  TREMBL - A SUPPLEMENT TO SWISS-PROT

   The ongoing  genome sequencing  and mapping  projects have  dramatically
   increased the number of protein sequences to be incorporated into SWISS-
   PROT. Since we do not want to dilute the quality standards of SWISS-PROT
   by incorporating  sequences into  the database  without proper  sequence
   analysis and  annotation, we  cannot speed  up the  incorporation of new
   incoming data  indefinitely. But  as we  also want to make the sequences
   available as  fast as  possible, we  have introduced  with SWISS-PROT an
   computer annotated  supplement. This  supplement consists  of entries in
   SWISS-PROT-like format  derived  from  the  translation  of  all  coding
   sequences (CDS)  in the  EMBL nucleotide sequence database, except those
   already  included   in  SWISS-PROT.   We  name  this  supplement  TREMBL
   (TRanslation from  EMBL). It  can be considered as a preliminary section
   of SWISS-PROT.  TREMBL is split in two main sections; SP-TREMBL and REM-
   TREMBL:

   SP-TREMBL (SWISS-PROT TREMBL) contains the entries (86'040) which should
   be incorporated  into SWISS-PROT. SWISS-PROT accession numbers have been
   assigned for all SP-TREMBL entries.

   REM-TREMBL (REMaining  TREMBL) contains  the entries (19'255) that we do
   not want  to include  in SWISS-PROT  for a variety of reasons (synthetic
   sequences, pseudogenes,  translations of  uncorrect open reading frames,
   fragments with  less than  eight amino  acids, patent-derived sequences,
   immunoglobulins and T-cell receptors, etc.)

   TREMBL is  available by  FTP from  the EBI server (ftp.ebi.ac.uk) in the
   directory '/pub/databases/trembl'.  It can  be queried on WWW by the EBI
   SRS server (http://www.ebi.ac.uk/srs/srsc). It  is also available on the
   SWISS-PROT CD-ROM and is searchable on the FASTA and BLITZ email servers
   of the EBI.



                       7.  WEEKLY UPDATES OF SWISS-PROT

   Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
   are updated at each update:

   new_seq.dat    Contains all the new entries since the last full release;
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release;
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since the last release.

   Currently these  files are  available on  the  following  anonymous  ftp
   servers:

   Organization   ExPASy (Geneva University Expert Protein Analysis System)
   Address        www.expasy.ch
   Directory      /databases/swiss-prot/updates

   Organization   National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov
   Directory      /repository/swiss-prot/updates

   Organization   European Bioinformatics Institute (EBI)
   Address        ftp.ebi.ac.uk
   Directory      /pub/databases/swissprot/new

   Organization   Bioinformatics Unit, Weizmann Institute of Science (WIS)
   Address        bioinformatics.weizmann.ac.il
   Directory      /pub/databases/swiss-prot/updates


   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.

   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free.



                            8.  ENZYME AND PROSITE

   8.1  The ENZYME data bank

   Release 21.0  of the  ENZYME data bank is distributed with release 34 of
   SWISS-PROT. ENZYME  release 21.0  contains information  relative to 3646
   enzymes.

   8.2  The PROSITE data bank

   Release 13.2  of the PROSITE data bank is distributed with release 34 of
   SWISS-PROT. This  release of  PROSITE contains 889 documentation entries
   that describe  1'167 different  patterns, rules  and  profiles/matrices.
   Release 13.2  does not  really represent a new release; the only changes
   between releases  13.0 and  13.2 are  updating of  the pointers  to  the
   SWISS-PROT entries whose name have been modified between releases 32 and
   34. The  next release of PROSITE (14.0) will be distributed with release
   35 of SWISS-PROT.



                           9.  WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.


   ========================================================================


                         APPENDIX A: SOME STATISTICS


   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.55   Gln (Q) 4.02   Leu (L) 9.33   Ser (S) 7.22
   Arg (R) 5.15   Glu (E) 6.32   Lys (K) 5.93   Thr (T) 5.74
   Asn (N) 4.52   Gly (G) 6.84   Met (M) 2.35   Trp (W) 1.25
   Asp (D) 5.30   His (H) 2.24   Phe (F) 4.07   Tyr (Y) 3.19
   Cys (C) 1.69   Ile (I) 5.72   Pro (P) 4.92   Val (V) 6.52

   Asx (B) 0.001  Glx (Z) 0.001  Xaa (X) 0.01


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 5389

   The first twenty species represent 28511 sequences: 48.3 % of the total
   number of entries.


   A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 2447
                            2x:  844
                            3x:  477
                            4x:  299
                            5x:  220
                            6x:  199
                            7x:  131
                            8x:   98
                            9x:  116
                           10x:   51
                       11- 20x:  229
                       21- 50x:  162
                       51-100x:   54
                         >100x:   62


   A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        4340          Baker's yeast (Saccharomyces cerevisiae)
         2        4000          Human
         3        3606          Escherichia coli
         4        2429          Mouse
         5        2121          Rat
         6        1783          Bacillus subtilis
         7        1591          Haemophilus influenzae
         8        1208          Caenorhabditis elegans
         9         956          Fission yeast (Schizosaccharomyces pombe)
        10         910          Fruit fly (Drosophila melanogaster)
        11         899          Bovine
        12         709          Chicken
        13         617          Salmonella typhimurium
        14         582          African clawed frog (Xenopus laevis)
        15         562          Arabidopsis thaliana (Mouse-ear cress)
        16         502          Rabbit
        17         474          Mycobacterium tuberculosis
        18         446          Pig
        19         425          Mycoplasma genitalium
        20         381          Maize
        21         276          Rice
        22         275          Bacteriophage T4
        23         265          Slime mold (Dictyostelium discoideum)
        24         262          Pseudomonas aeruginosa
        25         253          Vaccinia virus (strain Copenhagen)
        26         229          Tobacco
        27         219          Porphyra purpurea
        28         217          Pea
        29         207          Dog
        30         193          Wheat
                   193          Human cytomegalovirus (strain AD169)
        32         190          Barley
        33         186          Staphylococcus aureus
                   186          Soybean
        35         184          Vaccinia virus (strain WR)
        36         183          Sheep
        37         173          Neurospora crassa
        38         172          Pseudomonas putida
        39         171          Rhodobacter capsulatus
        40         169          Mycobacterium leprae
        41         161          Potato
        42         157          Synechocystis sp. (strain PCC 6803)
                   157          Klebsiella pneumoniae
        44         154          Tomato
                   154          Bacillus stearothermophilus
                   154          Autographa californica nuclear polyhedrosis virus
        47         150          Marchantia polymorpha (Liverwort)
        48         148          Spinach
        49         146          Variola virus
        50         142          Cyanophora paradoxa
        51         139          Agrobacterium tumefaciens
        52         138          Odontella sinensis
        53         137          Rhizobium meliloti
        54         127          Lactococcus lactis (subsp. lactis)
        55         125          Chlamydomonas reinhardtii
        56         124          Candida albicans
        57         121          Guinea pig
        58         116          Streptomyces coelicolor
        59         109          Aspergillus nidulans
        60         108          Horse
        61         107          Trypanosoma brucei brucei
        62         101          Anabaena sp. (strain PCC 7120)



   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    2831             1001-1100      534
                 51- 100    5243             1101-1200      405
                101- 150    7359             1201-1300      290
                151- 200    5678             1301-1400      186
                201- 250    5207             1401-1500      165
                251- 300    4745             1501-1600       99
                301- 350    4445             1601-1700       84
                351- 400    4533             1701-1800       70
                401- 450    3420             1801-1900       80
                451- 500    3320             1901-2000       47
                501- 550    2455             2001-2100       30
                551- 600    1735             2101-2200       53
                601- 650    1292             2201-2300       63
                651- 700     971             2301-2400       27
                701- 750     877             2401-2500       34
                751- 800     721             >2500          176
                801- 850     544
                851- 900     570
                901- 950     391
                951-1000     341



   A.4  Longest sequences

   The longest sequences (>=4000 residues) are listed here:

                               HTS1_COCCA  5217
                               FAT_DROME   5147
                               RYNR_RABIT  5037
                               RYNR_PIG    5035
                               RYNR_HUMAN  5032
                               RYNC_RABIT  4969
                               LRP_CAEEL   4753
                               DYHC_DICDI  4725
                               PLEC_RAT    4687
                               LRP2_RAT    4660
                               DYHC_RAT    4644
                               DYHC_DROME  4639
                               APB_HUMAN   4563
                               APOA_HUMAN  4548
                               LRP1_HUMAN  4544
                               LRP1_CHICK  4543
                               RRPA_CVMJH  4488
                               DYHC_ANTCR  4466
                               DYHC_TRIGR  4466
                               GRSB_BACBR  4451
                               PKSK_BACSU  4447
                               PKSL_BACSU  4427
                               PGBM_HUMAN  4393
                               YP73_CAEEL  4385
                               DYHC_NEUCR  4367
                               DYHC_EMENI  4344
                               PKD1_HUMAN  4303
                               DYHC_YEAST  4092
                               RRPA_CVH22  4085


   A.5  Statistics for journal citations


   Total number of journals cited in this release of SWISS-PROT: 776


   A.5.1 Table of the frequency of journal citations

        Journals cited 1x: 295
                       2x:  97
                       3x:  64
                       4x:  31
                       5x:  29
                       6x:  23
                       7x:   9
                       8x:   8
                       9x:  13
                      10x:  10
                  11- 20x:  68
                  21- 50x:  42
                  51-100x:  23
                    >100x:  64


   A.5.2  List of the most cited journals in SWISS-PROT

   Citations          Journal abbreviation
   ---------          ----------------------------------
   5458               J. BIOL. CHEM.
   3394               PROC. NATL. ACAD. SCI. U.S.A.
   3266               NUCLEIC ACIDS RES.
   2322               J. BACTERIOL.
   2059               GENE
   1825               FEBS LETT.
   1713               EUR. J. BIOCHEM.
   1540               EMBO J.
   1526               BIOCHEM. BIOPHYS. RES. COMMUN.
   1425               NATURE
   1384               BIOCHEMISTRY
   1235               BIOCHIM. BIOPHYS. ACTA
   1090               J. MOL. BIOL.
   1069               CELL
   1043               MOL. CELL. BIOL.
    860               MOL. GEN. GENET.
    834               PLANT MOL. BIOL.
    768               BIOCHEM. J.
    736               VIROLOGY
    677               SCIENCE
    645               MOL. MICROBIOL.
    613               J. BIOCHEM.
    535               GENOMICS
    486               J. VIROL.
    423               J. GEN. VIROL.
    378               J. CELL BIOL.
    370               PLANT PHYSIOL.
    349               YEAST
    341               GENES DEV.
    288               CURR. GENET.
    286               HUM. MOL. GENET.
    282               J. IMMUNOL.
    267               ARCH. BIOCHEM. BIOPHYS.
    259               BIOL. CHEM. HOPPE-SEYLER
    256               INFECT. IMMUN.
    252               MOL. BIOCHEM. PARASITOL.
    231               ONCOGENE
    214               MOL. ENDOCRINOL.
    213               HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
    208               J. GEN. MICROBIOL.
    201               AM. J. HUM. GENET.
    198               FEMS MICROBIOL. LETT.
    195               J. CLIN. INVEST.
    179               DEVELOPMENT
    165               NAT. GENET.
    164               J. MOL. EVOL.
    160               GENETICS
    151               DNA
    150               HUM. MUTAT.
    148               J. EXP. MED.
    143               BLOOD
    140               DNA CELL BIOL.
    138               HUM. GENET.
    129               NEURON
    128               DEV. BIOL.
    123               APPL. ENVIRON. MICROBIOL.
    114               PLANT CELL
    109               IMMUNOGENETICS
    109               HEMOGLOBIN
    105               AGRIC. BIOL. CHEM.
    103               DNA SEQ.
    101               BIOCHIMIE
    101               BIOORG. KHIM.
    101               ENDOCRINOLOGY

   ========================================================================

           APPENDIX B: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:


                         ***********************
******************       *  EMBL Nucleotide    *       **********************
* EPD [Euk.Prom] * <---> *  Sequence Database  * <---- * ECDC [E.coli map]  *
******************       *       [EBI]         *       **********************
                         ***********************
                          ^  ^ ^  ^  ^ ^ ^  ^
******************        |  | |  I  | | |  |
* FlyBase        * <------+  | |  I  | | |  |          **********************
* [D.melanogas.] *        |  | |  I  | | |  +--------> * GCRDb [7TM recep.] *
******************        |  | |  I  | | |  |          **********************
                          |  | |  I  | | |  |
******************        |  | |  I  | | |  |          **********************
* SubtiList      * <---------+ |  I  | | +-----------> * EcoGene [E.coli]   *
* [B.subtilis]   *        |  | |  I  | | |  |          **********************
******************        |  | |  I  | | |  |
                          |  | |  I  | | |  |          **********************
******************        |  | |  I  +---------------> * SGD [Yeast]        *
* MaizeDb        * <-----------+  I  | | |  |          **********************
* [Zea mays]     *        |  | |  I  | | |  |
******************        |  | |  I  | | |  |          **********************
                          |  | |  I  | +-------------> * DictyDB [D.disco.] *
******************        |  | |  I  | | |  |          **********************
* WormPep        *        |  | |  I  | | |  |
* [C.elegans]    * <----+ |  | |  I  | | |  |          **********************
******************      | |  | |  I  | | |  | +------  * ENZYME [Nomencl.]  *
                        | |  | |  I  | | |  | |        **********************
******************      | v  v v  v  v v v  v v            v
* REBASE         *      ***********************        **********************
* [Restriction   * <--- *  SWISS-PROT         * -----> * OMIM [Human]       *
*  enzymes]      *      *  Protein Sequence   *        **********************
******************      *  Data Bank          *
                        ***********************        **********************
******************      ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^          * ECO2DBASE     [2D] *
* StyGene        *      | | | | | | | | | | +--------> **********************
* [S.Typhimurium]* <----+ | | | | | | | | |
******************        | | | | | | | | |            **********************
                          | | | | | | | | +----------> * Maize-2DPAGE  [2D] *
******************        | | | | | | | |              **********************
* Transfac       * <------+ | | | | | | |
******************          | | | | | | |              **********************
                            | | | | | | +------------> * SWISS-2DPAGE  [2D] *
******************          | | | | | |                **********************
* Harefield [2D] * <--------+ | | | | |
******************            | | | | |                **********************
                              | | | | +--------------> * Aarhus/Ghent  [2D] *
******************            | | | |                  **********************
* PROSITE        *            | | | |
* [Patterns and  * <----------+ | | +----------------> **********************
* profiles]      *              | |                    * YEPD [Yeast]  [2D] *
******************              | +----------------+   **********************
             |                  v                  |
             |          ***********************    +-> **********************
             +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
                        ***********************        **********************

   =End=of=SWISS-PROT=release=34=notes=====================================
  

Swiss-Prot release 33.0

Published February 1, 1996



                    SWISS-PROT RELEASE 33.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 33.0  of SWISS-PROT contains 52'205 sequence entries, comprising
   18'531'384  amino   acids  abstracted   from  45'351   references.  This
   represents an  increase of  6.5% over release 32. The growth of the data
   bank is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   2.0        09/86               3939               900 163
   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368
   31.0       02/95              43470            15 335 248
   32.0       11/95              49340            17 385 503
   33.0       02/96              52205            18 531 384



      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 32


   2.1  Sequences and annotations

   2'910 sequences  have been  added since release 32, the sequence data of
   1085 existing  entries has  been updated  and the  annotations of  6'340
   entries have been revised.

   Major annotations and sequences updates have been made in preparation of
   the changes that will take place in release 33 (see section 3.1 of these
   notes).


   2.2  What's happening with the model organisms

   We have  selected a  number of  organisms that  are the target of genome
   sequencing and/or mapping projects and for which we intend to:

   -  Be as  complete as  possible. All sequences available at a given time
      should be  immediately included  in SWISS-PROT.  This  also  includes
      sequence corrections and updates;
   -  Provide a higher level of annotation;
   -  Provide cross-references  to specialized  database(s)  that  contain,
      among other  data, some genetic information about the genes that code
      for these proteins;
   -  Provide specific indices or documents.

   What was  done since  the last  release or  in preparation  for the next
   release concerning model organisms:

   -  We have  added Mycoplasma  genitalium to the list of model organisms.
      It is the second bacterial genome to be completely sequenced. We have
      already annotated  344 of  the 470  putative proteins encoded by this
      small genome.

   -  We have  started a  major effort  in catching  up with the backlog of
      sequences from eukaryotic model organisms. In particular we added 262
      entries from  yeast, 194  from  human,  180  from  S.pombe,  82  from
      C.elegans, 68 from A.thaliana and 50 from Drosophila.

   -  We have  added in SWISS-PROT, all the sequences from yeast chromosome
      X. We plan to integrate data from chromosome XIII very soon.


   Here is the current status of the model organisms:

   Organism         Database               Index file       Number of
                    cross-referenced                        sequences
   --------------   ---------------------  --------------   ---------
   A.thaliana       None yet               In preparation        500
   B.subtilis       SubtiList              SUBTILIS.TXT         1389
   C.albicans       None yet               CALBICAN.TXT          106
   C.elegans        WormPep                CELEGANS.TXT         1006
   D.discoideum     DictyDB                DICTY.TXT             213
   D.melanogaster   FlyBase                In preparation        818
   E.coli           EcoGene                ECOLI.TXT            3471
   H.influenzae     None yet               HAEINFLU.TXT         1577
   H.sapiens        MIM                    MIMTOSP.TXT          3475
   M.genitalium     None yet               In preparation        344
   S.cerevisiae     LISTA/SGD              YEAST.TXT            3653
   S.typhimurium    StyGene                SALTY.TXT             603
   S.pombe          None yet               POMBE.TXT             640
   S.solfataricus   None yet               None yet               61



   2.3  Major changes to the cross-references to EMBL

   In this  release, the  format of the DR (Database cross-Reference) lines
   pointing to  EMBL Nucleotide Sequence Database entries have been changed
   from:

   DR   EMBL; ACCESSION_NUMBER; ENTRY_NAME.

   to:

   DR   EMBL; ACCESSION_NUMBER; PID; STATUS_IDENTIFIER.

   Where 'PID'  stands for  the "Protein  IDentification" number.  It is  a
   number that  you  find  in  EMBL  and  GenBank  in  a  qualifier  called
   "/db_xref" which  is tagged  to every  CDS in  the nucleotide  database.
   Example:

   FT   CDS             54..1382
   FT                   /note="ribulose-1,5-bisphosphate carboxylase/
   FT                   oxygenase activase precursor"
   FT                   /db_xref="PID:g1006835"

   When an EMBL database CDS exists as a sequence report in SWISS-PROT, the
   SWISS-PROT DR  lines of  the corresponding  SWISS-PROT  entry  has  been
   updated by  citing the PID as secondary identifier. In all cases where a
   PID has  been integrated  into SWISS-PROT, a "/db_xref" qualifier citing
   the corresponding  SWISS-PROT entry  has been added to the EMBL database
   CDS labeled with this PID. Example:

   FT   CDS             14556__15696
   FT                   /gene="cytochrome b"
   FT                   /codon_start=1
   FT                   /product="apoprotein"
   FT                   /db_xref="PID:g463170"
   FT                   /db_xref="SWISS-PROT:P12778"

   This approach  enables us  to point  precisely from  a given  SWISS-PROT
   entry to one of potentially many CDS in the corresponding EMBL entry and
   vice versa.  This change  also allows  the development of software tools
   that automatically retrieve the part of a nucleotide sequence entry that
   codes for  a specific  protein. This is especially useful in the context
   of World-Wide  Web as  it will  render obsolete  the  current  situation
   where, for  example, one  needs to  retrieve the  complete sequence of a
   yeast chromosome  when one  wants the  nucleotide sequence  coding for a
   specific protein encoded on that chromosome.

   An additional  important principle  of the PID system is that whenever a
   change is  made to  the nucleotide  entry or  to the annotations of that
   entry and  that this  change produces  a modification  in the translated
   protein sequence,  the PID  number corresponding  to the modified CDS is
   replaced by  a completely  new number.  The old number will be kept in a
   special field tagged to the CDS. The exact syntax of this field is under
   discussion at the international nucleotide databases.

   The  new   cross-referencing   system   will   allow   a   much   closer
   interconnection between  SWISS-PROT  and  the  international  nucleotide
   sequence databases.  For example, it will allow us to automatically take
   into account  sequence updates  made to  the nucleotide entry when these
   updates have an impact on the derived protein sequence(s).

   It should also be noted that the "PID" numbers in the context of GenBank
   replace the  "NCBI gi" numbering system which was present in the "/note"
   qualifier. The "gi" identifiers for the nucleic acid sequences have been
   replaced by "NID" (nucleic acid identifier) numbers.

   The 'STATUS_IDENTIFIER'  provides  information  about  the  relationship
   between the  sequence in  the  SWISS-PROT  entry  and  the  CDS  in  the
   corresponding EMBL entry.

   a) In  most cases  the translation  of the  EMBL nucleotide sequence CDS
   results in  the same  sequence as  shown in the corresponding SWISS-PROT
   entry or  the differences  are mentioned  in the SWISS-PROT feature (FT)
   lines as  CONFLICT, VARIANT  or VARSPLIC  and in  the RP lines. In these
   cases the status identifier shows a dash ("-").

   Example:

   DR   EMBL; Y00312; G63880; -.

   b) In  some cases  the translation  of the  EMBL nucleotide sequence CDS
   results  in  a  sequence  different  from  the  sequence  shown  in  the
   corresponding SWISS-PROT  entry  and  the  differences  are  either  not
   mentioned in  the SWISS-PROT  feature (FT) lines as CONFLICT, VARIANT or
   VARSPLIC and  in the  RP lines,  or do  simply not meet the criteria for
   such situations.

   1) If the  difference is  due to a different start of the sequence (e.g.
      SWISS-PROT believes  that the  start of  the sequence  is upstream or
      downstream of  the site annotated as the start of the sequence in the
      EMBL database),  the status  identifier shows the comment "ALT_INIT".
      Example:

        DR   EMBL; L29151; G466334; ALT_INIT.

   2) If the  difference is  due to a different termination of the sequence
      (e.g. SWISS-PROT  believes that  the termination  of the  sequence is
      upstream or  downstream of  the site  annotated as  the  end  of  the
      sequence in  the EMBL  database), the  status  identifier  shows  the
      comment "ALT_TERM". Example:

        DR   EMBL; L20562; G398099; ALT_TERM.

   3) If the  difference is  due to  frameshifts in  the EMBL sequence, the
      status identifier shows the comment "ALT_FRAME". Example:

        DR   EMBL; M95935; G146416; ALT_FRAME.

   4) If the difference is not due to the cases mentioned above (e.g. wrong
      intron-exon boundaries  given in  the EMBL  entry) or to a mixture of
      the cases  mentioned above,  the status  identifier shows the comment
      "ALT_SEQ". Example:

        DR   EMBL; X79206; G809602; ALT_SEQ.

   c) In some cases the nucleotide sequence of a complete CDS is divided in
   exons present in different EMBL entries. We point to the exon containing
   EMBL entries  by citing  the PID  as secondary identifier and adding the
   comment "JOINED"  into the status identifier. These EMBL entries are not
   containing a  CDS feature,  they contain  exons joined  to a CDS feature
   which is labeled with the given PID.

   Example:

   DR   EMBL; M63397; G177196; -.
   DR   EMBL; M63395; G177196; JOINED.
   DR   EMBL; M63396; G177196; JOINED.

   In the  above example  the SWISS-PROT  sequence is  derived from the CDS
   labeled with  the PID G177196. This CDS feature can be found in the EMBL
   entry M63397.  Exons belonging  to this  CDS are  not only found in EMBL
   entry M63397, but also in the EMBL entries M63395 and M63396.

   d) In  some cases  there is  no CDS  feature key  annotating  a  protein
   translation in  an EMBL entry and thus no PID for that CDS. Therefore it
   is not  possible for  us to point to a PID as a secondary identifier. In
   these cases  we point  to the  relevant EMBL entries by including a dash
   ("-") in  the position  of the  missing PID and "NOT_ANNOTATED_CDS" into
   the status identifier.

   Example:

   DR   EMBL; J04126; -; NOT_ANNOTATED_CDS.



   2.4  New cross-references

   We have added cross-references from SWISS-PROT to the Harefield Hospital
   2D gel  protein databases  prepared under the supervisation of Mike Dunn
   (see Corbett  J.M., Wheeler C.H., Baker C.S., Yacoub M.H. and Dunn M.J.;
   Electrophoresis 15:1459-1465(1994)).  These cross-references are present
   in the DR lines:

   Data bank identifier: HSC-2DPAGE
   Primary identifier:   The protein spot unique identifier [1]
   Secondary identifier: The species of origin [2]
   Example:              HSC-2DPAGE; P47985; HUMAN.

   [1] Harefield 2D databases uses SWISS-PROT primary accession numbers as
       the alphanumeric designation of spots that are linked to SWISS-PROT
       entries
   [2] Currently only  `HUMAN' is  used, but 'RAT' and 'DOG' will be added
       in the next release.


   2.5  Introduction of a new CC line-type topic (MASS SPECTROMETRY)

   We have  introduced a  new 'topic' for the comments (CC) line-type: MASS
   SPECTROMETRY. This topic is used to report the exact molecular weight of
   a protein  or part  of a  protein as  determined by  mass  spectrometric
   methods. The syntax of this new topic is:

   CC   -!- MASS SPECTROMETRY: MW=XXX[; MW_ERR=XX]; METHOD=XX[;
            RANGE=XX-XX].

   Where

   -  "MW=XX" is the determined molecular weight (MW);
   -  "MW_ERR=XX" (optional)  is the  accuracy or  error range  of  the  MW
      measurement;
   -  "METHOD=XX" is the masss spectrometric method;
   -  "RANGE=XX-XX" (optional) is used to indicate what part of the protein
      sequence entry corresponds to the molecular weight. If this qualifier
      is not  present, the  MW value  corresponds to the full length of the
      protein sequence.

   Examples of its usage:

   CC   -!- MASS SPECTROMETRY: MW=13423.3; METHOD=ELECTROSPRAY.
   CC   -!- MASS SPECTROMETRY: MW=71890; MW_ERR=7; METHOD=ELECTROSPRAY.
   CC   -!- MASS SPECTROMETRY: MW=8597.5; METHOD=ELECTROSPRAY;
   CC       RANGE=40-119.

   It should  be noted  that the  syntax of this topic may evolve in future
   releases as  we  expect  feedback  from  groups  using  MS  for  protein
   identification on  2D gels,  MW determination  and  characterization  of
   post-translational modifications.



   2.6  Change in the syntax of the SQ line

   The SQ  (SeQuence header)  line marks the beginning of the sequence data
   and gives a quick summary of its content. The format of the SQ line used
   to be:

   SQ   SEQUENCE  XXXX AA; XXXXX MW;  XXXXX CN;

   The line  contains the  length  of  the  sequence  in  amino-acids  (AA)
   followed by  the molecular weight (MW) rounded to the nearest gram and a
   checking number (CN) as shown in the example:

   SQ   SEQUENCE 104 AA; 11530 MW; 54319 CN;

   Starting with this release, we have replaced the checking number (CN) by
   a 32-bit CRC (Cyclic Redundancy Check) value. The new syntax is:

   SQ   SEQUENCE  XXXX AA; XXXXX MW;  XXXXXXXX CRC32;

   Example:

   SQ   SEQUENCE   104 AA;  11530 MW;  7A70363C CRC32;



   2.7  Status of the documentation files

   SWISS-PROT is  distributed with  a large  number of documentation files.
   Some of  these files  have been  available for  a long  time  (the  user
   manual, release  notes, the  various  indices  for  authors,  citations,
   keywords, etc.),  but  many  have  been  created  recently  and  we  are
   continuously adding  new files.  Since release  32, we  have added 2 new
   document files.  The following  table list  all the  documents that  are
   either currently  available or  that we  plan to  add in  the  next  few
   months.

   USERMAN .TXT   User manual
   RELNOTES.TXT   Release notes
   SHORTDES.TXT   Short description of entries in SWISS-PROT

   JOURLIST.TXT   List of abbreviations for journals cited
   KEYWLIST.TXT   List of keywords in use
   SPECLIST.TXT   List of organism identification codes
   EXPERTS .TXT   List of on-line experts for PROSITE and SWISS-PROT
   SUBMIT  .TXT   Submission of sequence data to the SWISS-PROT data bank

   ACINDEX .TXT   Accession number index
   AUTINDEX.TXT   Author index
   CITINDEX.TXT   Citation index
   KEYINDEX.TXT   Keyword index
   SPEINDEX.TXT   Species index
   7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
   AATRNASY.TXT   List of aminoacyl-tRNA synthetases
   ALLERGEN.TXT   Nomenclature and index of allergen sequences
   CALBICAN.TXT   Index of Candida albicans entries and their corresponding
                  gene designations
   CDLIST  .TXT   CD nomenclature for surface proteins of human leucocytes
   CELEGANS.TXT   Index  of   Caenorhabditis  elegans   entries  and  their
                  corresponding gene
                  designations and WormPep cross-references
   DICTY   .TXT   Index  of  Dictyostelium  discoideum  entries  and  their
                  corresponding gene
                  designations and DictyDB cross-references
   EC2DTOSP.TXT   Index of  Escherichia coli  Gene-protein database entries
                  referenced in SWISS-PROT
   ECOLI   .TXT   Index of  Escherichia coli  K12 chromosomal  entries  and
                  their corresponding EcoGene cross-reference
   EMBLTOSP.TXT   Index of  EMBL Database  entries referenced in SWISS-PROT
                  [3]
   EXTRADOM.TXT   Nomenclature of extracellular domains
   GLYCOSYL.TXT   Classification of  glycosyl hydrolases families and index
                  of glycosyl hydrolase entries [1]
   HAEINFLU.TXT   Index of Haemophilus influenzae RD chromosomal entries
   HOXLIST .TXT   Vertebrate homeotic Hox proteins: nomenclature and index
   HUMCHR21.TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome 21
   HUMCHR22.TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome 22
   HUMCHRY .TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome Y
   MIMTOSP .TXT   Index of MIM entries referenced in SWISS-PROT
   MYGENIT .TXT   Index of Mycoplasma genitalium chromosomal entries [2]
   NOMLIST .TXT   List of nomenclature related references for proteins
   PDBTOSP .TXT   Index of Brookhaven PDB entries referenced in SWISS-PROT
   PEPTIDAS.TXT   Classification  of   peptidase  families   and  index  of
                  peptidases entries
   PLASTID .TXT   List of chloroplast and cyanelle encoded proteins
   POMBE   .TXT   Index of  Schizosaccharomyces pombe entries in SWISS-PROT
                  and their corresponding gene designations
   RESTRIC .TXT   List of restriction enzymes and methylases entries
   RIBOSOMP.TXT   Index of ribosomal proteins classified by families on the
                  basis of sequence similarities [2]
   SALTY   .TXT   Index of  Salmonella typhimurium  LT2 chromosomal entries
                  and their corresponding StyGene cross-references
   SUBTILIS.TXT   Index of  Bacillus subtilis  168 chromosomal  entries and
                  their corresponding SubtiList cross-references
   YEAST   .TXT   Index  of  Saccharomyces  cerevisiae  entries  and  their
                  corresponding gene designations
   YEAST1  .TXT   Yeast Chromosome I entries
   YEAST2  .TXT   Yeast Chromosome II entries
   YEAST3  .TXT   Yeast Chromosome III entries
   YEAST5  .TXT   Yeast Chromosome V entries
   YEAST6  .TXT   Yeast Chromosome VI entries
   YEAST8  .TXT   Yeast Chromosome VIII entries
   YEAST9  .TXT   Yeast Chromosome IX entries
   YEAST10 .TXT   Yeast Chromosome X entries [1]
   YEAST11 .TXT   Yeast Chromosome XI entries
   YEAST13 .TXT   Yeast Chromosome XIII entries [2]

   Notes:

   [1]  New in release 33.
   [2]  Will be available starting with release 34 of October 1996.
   [3]  The format of that file was completely changed to take into account
        the new format of cross-references to  EMBL that includes the "PID"
        (see section 2.3).


   We have  continued to  include in  some SWISS-PROT  document  files  the
   references of  World-Wide  Web  sites  relevant  to  the  subject  under
   consideration. There are now 11 documents that include such links.



   2.8  The ExPASy World-Wide Web server


        2.8.1  Background information

   The most  efficient and  user-friendly way  to browse  interactively  in
   SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases. is to use
   the World-Wide  Web (WWW)  molecular biology  server ExPASy.  WWW  is  a
   global information  retrieval system  merging the  power  of  world-wide
   networks, hypertext  and multimedia.  Through hypertext  links, it gives
   access to  documents and  information available  on thousands of servers
   around the  world. To  access a  WWW server  one needs  a  WWW  browser.
   Currently, the  most popular  browser  is  Netscape  Navigator(TM)  from
   Netscape Communications Corp. (available from ftp.netscape.com). Using a
   WWW browser, one has access to all the hypertext documents stored on the
   ExPASy server as well as many other WWW servers.

   The ExPASy server was made available to the public in September 1993. On
   February 1996  a cumulative total of 4 million connections was attained.
   It may  be accessed  through its  Uniform Resource  Locator (URL  -  the
   addressing system defined in WWW), which is:

        http://expasy.hcuge.ch/

   The ExPASy  WWW server  allows access, using the user-friendly hypertext
   model, to  the SWISS-PROT,  PROSITE,  ENZYME,  SWISS-2DPAGE  and  SWISS-
   3DIMAGE databases and, through any SWISS-PROT protein sequence entry, to
   other databases  such as  EMBL, EcoCyc,  FlyBase, GCRDb, LISTA, MaizeDB,
   SubtiList, OMIM, PDB, HSSP, ProDom, REBASE, SGD, YEPD and Medline. Using
   a browser  which is  able to display images one can also remotely access
   2D gels  image data from SWISS-2DPAGE. ExPAsy also offers many tools for
   the analysis of protein sequences and 2D gels.

   For more  information on  the  ExPASy  WWW  server,  you  can  read  the
   following article:

      Appel R.D., Bairoch A., Hochstrasser D.F.
      A new  generation of  information retrieval tools for biologists: the
      example of the ExPASy WWW server.
      Trends Biochem. Sci. 19:258-260(1994).

   Or you can contact Dr. Ron Appel:

      Email: ron.appel@dim.hcuge.ch
      Fax: +41-22-372 61 98



        2.8.2  SWISS-SHOP

   Thanks to the work of Manuel Peitsch from the Geneva Glaxo Institute for
   Molecular Biology,  we can  provide, on ExPASy, a  service called SWISS-
   SHOP. SWISS-Shop  allows  any  users  of  SWISS-PROT  to  indicate  what
   proteins he/she  is interested  in.  This  can  be  done  using  various
   criteria that can be combined:

   -  By entering  one  or  more  words  that  should  be  present  in  the
      description line;
   -  By entering one or more species name(s) or taxonomic division(s);
   -  By entering one or more keywords;
   -  By entering one or more author names;
   -  By entering the accession number (or entry name) of a PROSITE pattern
      or a user-defined sequence pattern;
   -  By entering  the accession  number (or  entry name)  of  an  existing
      SWISS-PROT entry or by entering a "private" sequence.

   Every week,  the new  sequences entered  in SWISS-PROT are automatically
   compared with all the criteria that have been defined by the users. If a
   sequence corresponds  to the  selection criteria defined by a user, that
   sequence is sent by electronic mail.


        2.8.3  What is new on ExPASy

   Since  the   last  release,  there  has  been  a  large  number  of  new
   developments on the ExPASy WWW server. Here are some highlights of these
   changes:

   -  ProtScale is  a new tool which we have implemented and that allows to
      compute and  represent the profile produced by an amino acid scale on
      a selected  protein in  SWISS-PROT or  entered by the user. 50 scales
      are provided,  including 'classics'  such as  the Kyte  and Doolittle
      hydrophobicity scale.

   -  We have added a new tool, SIM which computes a user defined number of
      best non-intersecting  alignments between  two sequences. The results
      of the alignment can be viewed graphically using the LALNVIEW program
      developed  by   Laurent  Duret   (duret@dim.hcuge.ch)  and  which  is
      available (it  can directly  be downloaded  from ExPASy) for PC under
      MS-Windows, Macs and UNIX.

   -  We have recently started to create a list of Biomolecular servers for
      our own  usage, this  list is  available on  the ExPASy  top page  or
      directly from:

                http://expasy.hcuge.ch/www/amos_www_links.html

   -  WWW links  have been  implemented between some SWISS-PROT entries and
      HSC-2DPAGE (see section 2.4).

   -  Many other changes have been made to all parts of the server.



   2.9  Weekly updates of SWISS-PROT

   Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
   are updated at each update:

   new_seq.dat    Contains all the new entries since the last full release;
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release;
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since the last release.

   Currently these  files are  available on  the  following  anonymous  ftp
   servers:

   Organization   ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/swiss-prot/updates

   Organization   National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/swiss-prot/updates

   Organization   European Bioinformatics Institute (EBI)
   Address        ftp.ebi.ac.uk (or 193.62.196.6)
   Directory      /pub/databases/swissprot/new

   Organization   Bioinformatics Unit, Weizmann Institute of Science (WIS)
   Address        bioinformatics.weizmann.ac.il (or 132.76.55.12)
   Directory      /pub/databases/swiss-prot/updates


   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.

   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free.



                      3.0  IMPORTANT FORTHCOMING CHANGE

   3.1  TREMBL - a supplement to SWISS-PROT

   The ongoing  genome sequencing  and mapping  projects have  dramatically
   increased the number of protein sequences to be incorporated into SWISS-
   PROT. Since we do not want to dilute the quality standards of SWISS-PROT
   by incorporating  sequences  into  SWISS-PROT  without  proper  sequence
   analysis and  annotation, we  cannot speed  up the  incorporation of new
   incoming data  indefinitely. But  as we  also want to make the sequences
   available as  fast as  possible, we  will introduce  with SWISS-PROT  an
   computer annotated supplement to SWISS-PROT. This supplement consists of
   entries in  SWISS-PROT-like format  derived from  the translation of all
   coding sequences  (CDS) in the EMBL nucleotide sequence database, except
   the CDS already included in SWISS-PROT.

   We name  this supplement  TREMBL  (TRanslation  from  EMBL),  since  the
   translation tools  used to  create the translations of the CDS are based
   on the  program  'trembl'  written  by  Thure  Etzold  at  the  EMBL  in
   Heidelberg.

   We will  translate all  CDS's in  the EMBL  Nucleotide Sequence Database
   into TREMBL  preentries. The  preentries already  as sequence reports in
   SWISS-PROT will be excluded from TREMBL. Then the remaining entries will
   be automatically  merged  whenever  possible  to  reduce  redundancy  in
   TREMBL.

   We will split TREMBL in two main sections; SP-TREMBL and REM-TREMBL:

   SP-TREMBL (SWISS-PROT  TREMBL) will  contain the entries which should be
   incorporated into  SWISS-PROT. SP-TREMBL  will  be  partially  redundant
   against SWISS-PROT,  since approximately half of these SP-TREMBL entries
   will be  only additional  sequence reports of proteins already in SWISS-
   PROT. We  will try  to merge  these sequence reports as fast as possible
   with the  already existing  SWISS-PROT entries for these proteins, so as
   to make SWISS-PROT and TREMBL completely nonredundant.

   REM-TREMBL (REMaining  TREMBL) will  contain the  entries that we do not
   want to  include in  SWISS-PROT. This  section will be organized in four
   subsections:

   1) Most REM-TREMBL entries will be immunoglobulins and T-cell receptors.
      We stopped  entering immunoglobulins and T-cell receptors into SWISS-
      PROT, because  we only  want to  keep  the  germ  line  gene  derived
      translations of  these proteins  in  SWISS-PROT  and  not  all  known
      somatic recombinated  variations of  these proteins. We are expecting
      more than  10'000 immunoglobulins  and T-cell receptors in TREMBL. We
      would like  to create  a  specialized  database  dealing  with  these
      sequences as  a further  supplement to  SWISS-PROT and  keep  only  a
      representative cross-section of these proteins in SWISS-PROT.

   2) Another category of data which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to leave these entries in
      TREMBL.  Ideally   one  should   build  a  specialized  database  for
      artificial sequences as a further supplement to SWISS-PROT.

   3) A third  subsection consists  of fragments with less than seven amino
      acids.

   4) The last subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   The first  full release of TREMBL will be distributed with release 34 of
   SWISS-PROT. However  we are  making available,  with release  33, a beta
   release so that users and software developers can send us feedback about
   this new supplement to SWISS-PROT.



                            4. ENZYME AND PROSITE


   4.1  The ENZYME data bank

   Release 20.0  of the  ENZYME data bank is distributed with release 33 of
   SWISS-PROT. ENZYME  release 20.0  contains information  relative to 3601
   enzymes.


   4.2  The PROSITE data bank

   Release 13.1  of the PROSITE data bank is distributed with release 33 of
   SWISS-PROT. This  release of  PROSITE contains 889 documentation entries
   that describe  1'167 different  patterns, rules  and  profiles/matrices.
   Release 13.1  does not  really represent a new release; the only changes
   between releases  13.0 and  13.1 are  updating of  the pointers  to  the
   SWISS-PROT entries whose name have been modified between releases 32 and
   33. The  next release of PROSITE (14.0) will be distributed with release
   35 of SWISS-PROT.


                             WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.


   ========================================================================


                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.54   Gln (Q) 4.02   Leu (L) 9.31   Ser (S) 7.19
   Arg (R) 5.15   Glu (E) 6.31   Lys (K) 5.94   Thr (T) 5.76
   Asn (N) 4.54   Gly (G) 6.86   Met (M) 2.36   Trp (W) 1.26
   Asp (D) 5.29   His (H) 2.23   Phe (F) 4.06   Tyr (Y) 3.21
   Cys (C) 1.70   Ile (I) 5.72   Pro (P) 4.91   Val (V) 6.52

   Asx (B) 0.001  Glx (Z) 0.001  Xaa (X) 0.02


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin


   Total number of species represented in this release of SWISS-PROT: 5020


   A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 2250
                            2x:  808
                            3x:  446
                            4x:  285
                            5x:  209
                            6x:  189
                            7x:  129
                            8x:   96
                            9x:  105
                           10x:   44
                       11- 20x:  204
                       21- 50x:  154
                       51-100x:   42
                         >100x:   59



   A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        3653          Baker's yeast (Saccharomyces cerevisiae)
         2        3475          Human
         3        3471          Escherichia coli
         4        2137          Mouse
         5        1866          Rat
         6        1577          Haemophilus influenzae
         7        1389          Bacillus subtilis
         8        1006          Caenorhabditis elegans
         9         833          Bovine
        10         818          Fruit fly (Drosophila melanogaster)
        11         642          Chicken
        12         640          Fission yeast (Schizosaccharomyces pombe)
        13         603          Salmonella typhimurium
        14         508          African clawed frog (Xenopus laevis)
        15         500          Arabidopsis thaliana (Mouse-ear cress)
        16         469          Rabbit
        17         397          Pig
        18         344          Mycoplasma genitalium
        19         326          Maize
        20         275          Bacteriophage T4
        21         256          Rice
        22         253          Vaccinia virus (strain Copenhagen)
        23         240          Pseudomonas aeruginosa
        24         214          Slime mold (Dictyostelium discoideum)
        25         213          Tobacco
        26         203          Pea
        27         193          Human cytomegalovirus (strain AD169)
        28         187          Wheat
        29         184          Vaccinia virus (strain WR)
        30         176          Soybean
        31         175          Barley
        32         171          Staphylococcus aureus
                   171          Dog
        34         165          Pseudomonas putida
                   165          Neurospora crassa
        36         159          Sheep
        37         158          Rhodobacter capsulatus
        38         154          Autographa californica nuclear polyhedrosis virus
        39         150          Marchantia polymorpha (Liverwort)
                   150          Klebsiella pneumoniae
        41         146          Variola virus
                   146          Bacillus stearothermophilus
        43         142          Spinach
                   142          Cyanophora paradoxa
        45         141          Potato
        46         139          Tomato
        47         130          Rhizobium meliloti
        48         123          Odontella sinensis
        49         122          Mycobacterium leprae
        50         119          Lactococcus lactis (subsp. lactis)
        51         117          Agrobacterium tumefaciens
        52         112          Synechocystis sp. (strain PCC 6803)
        53         108          Chlamydomonas reinhardtii
        54         106          Candida albicans
        55         105          Guinea pig
        56         104          Streptomyces coelicolor
                   104          Horse
        58         101          Trypanosoma brucei brucei
                   101          Aspergillus nidulans




   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    2706             1001-1100      471
                 51- 100    4851             1101-1200      340
                101- 150    6660             1201-1300      258
                151- 200    5047             1301-1400      169
                201- 250    4552             1401-1500      146
                251- 300    4075             1501-1600       88
                301- 350    3857             1601-1700       68
                351- 400    3897             1701-1800       63
                401- 450    2963             1801-1900       69
                451- 500    2974             1901-2000       41
                501- 550    2141             2001-2100       24
                551- 600    1521             2101-2200       53
                601- 650    1120             2201-2300       56
                651- 700     824             2301-2400       24
                701- 750     761             2401-2500       31
                751- 800     607             >2500          156
                801- 850     477
                851- 900     481
                901- 950     345
                951-1000     289



   A.4  Longest sequences

   The longest sequences (>=4000 residues) are listed here:

                               HTS1_COCCA  5217
                               FAT_DROME   5147
                               RYNR_RABIT  5037
                               RYNR_PIG    5035
                               RYNR_HUMAN  5032
                               RYNC_RABIT  4969
                               DYHC_DICDI  4725
                               DYHC_RAT    4644
                               DYHC_DROME  4639
                               APB_HUMAN   4563
                               APOA_HUMAN  4548
                               RRPA_CVMJH  4488
                               DYHC_ANTCR  4466
                               DYHC_TRIGR  4466
                               GRSB_BACBR  4451
                               PKSK_BACSU  4447
                               PKSL_BACSU  4427
                               YP73_CAEEL  4385
                               DYHC_NEUCR  4367
                               DYHC_EMENI  4344
                               PLEC_RAT    4140
                               DYHC_YEAST  4092
                               RRPA_CVH22  4085



   A.5  Statistics for journal citations


   Total number of journals cited in this release of SWISS-PROT: 710


   A.5.1 Table of the frequency of journal citations

        Journals cited 1x: 275
                       2x:  99
                       3x:  43
                       4x:  28
                       5x:  28
                       6x:  14
                       7x:  10
                       8x:  13
                       9x:  13
                      10x:  10
                  11- 20x:  54
                  21- 50x:  45
                  51-100x:  21
                    >100x:  57


   A.5.2  List of the most cited journals in SWISS-PROT

   Citations          Journal abbreviation
   ---------          ----------------------------------
   5010               J. BIOL. CHEM.
   3191               NUCLEIC ACIDS RES.
   3152               PROC. NATL. ACAD. SCI. U.S.A.
   2136               J. BACTERIOL.
   1828               GENE
   1706               FEBS LETT.
   1584               EUR. J. BIOCHEM.
   1436               EMBO J.
   1392               BIOCHEM. BIOPHYS. RES. COMMUN.
   1359               NATURE
   1300               BIOCHEMISTRY
   1092               BIOCHIM. BIOPHYS. ACTA
   1023               J. MOL. BIOL.
    996               CELL
    956               MOL. CELL. BIOL.
    811               MOL. GEN. GENET.
    756               PLANT MOL. BIOL.
    713               VIROLOGY
    708               BIOCHEM. J.
    636               SCIENCE
    585               MOL. MICROBIOL.
    575               J. BIOCHEM.
    458               J. VIROL.
    407               J. GEN. VIROL.
    367               GENOMICS
    335               J. CELL BIOL.
    299               GENES DEV.
    291               PLANT PHYSIOL.
    286               YEAST
    266               CURR. GENET.
    255               J. IMMUNOL.
    255               BIOL. CHEM. HOPPE-SEYLER
    240               ARCH. BIOCHEM. BIOPHYS.
    233               INFECT. IMMUN.
    221               MOL. BIOCHEM. PARASITOL.
    213               HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
    204               HUM. MOL. GENET.
    202               J. GEN. MICROBIOL.
    193               MOL. ENDOCRINOL.
    182               ONCOGENE
    177               J. CLIN. INVEST.
    169               FEMS MICROBIOL. LETT.
    167               AM. J. HUM. GENET.
    149               DNA
    140               J. EXP. MED.
    140               GENETICS
    137               J. MOL. EVOL.
    134               DEVELOPMENT
    123               BLOOD
    120               HUM. MUTAT.
    117               HUM. GENET.
    116               NEURON
    114               DNA CELL BIOL.
    110               NAT. GENET.
    110               APPL. ENVIRON. MICROBIOL.
    109               HEMOGLOBIN
    104               AGRIC. BIOL. CHEM.


   ========================================================================

           APPENDIX B: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                         ***********************
******************       *  EMBL Nucleotide    *       **********************
* EPD [Euk.Prom] * <---> *  Sequence Database  * <---- * ECDC [E.coli map]  *
******************       *       [EBI]         *       **********************
                         ***********************
                          ^  ^ ^  ^  ^ ^ ^  ^
******************        |  | |  I  | | |  |
* FlyBase        * <------+  | |  I  | | |  |          **********************
* [D.melanogas.] *        |  | |  I  | | |  +--------> * GCRDb [7TM recep.] *
******************        |  | |  I  | | |  |          **********************
                          |  | |  I  | | |  |
******************        |  | |  I  | | |  |          **********************
* SubtiList      * <---------+ |  I  | | +-----------> * EcoGene [E.coli]   *
* [B.subtilis]   *        |  | |  I  | | |  |          **********************
******************        |  | |  I  | | |  |
                          |  | |  I  | | |  |          **********************
******************        |  | |  I  +---------------> * LISTA [Yeast]      *
* MaizeDb        * <-----------+  I  | | |  |          **********************
* [Zea mays]     *        |  | |  I  | | |  |
******************        |  | |  I  | | |  |          **********************
                          |  | |  I  | +-------------> * SGD [Yeast]        *
******************        |  | |  I  | | |  |          **********************
* WormPep        *        |  | |  I  | | |  |
* [C.elegans]    * <----+ |  | |  I  | | |  |          **********************
******************      | |  | |  I  | | |  | +------> * DictyDB [D.disco.] *
                        | |  | |  I  | | |  | |        **********************
******************      | v  v v  v  v v v  v v
* REBASE         *      ***********************        **********************
* [Restriction   * <--- *  SWISS-PROT         * <----- * ENZYME [Nomencl.]  *
*  enzymes]      *      *  Protein Sequence   *        **********************
******************      *  Data Bank          *            v
                        ***********************        **********************
******************      ^ ^ ^ ^ ^ ^ ^ | ^ ^ |          * OMIM [Human]       *
* StyGene        *      | | | | | | | | | | +--------> **********************
* [S.Typhimurium]* <----+ | | | | | | | | |
******************        | | | | | | | | |            **********************
                          | | | | | | | | +----------> * ECO2DBASE     [2D] *
******************        | | | | | | | |              **********************
* Transfac       * <------+ | | | | | | |
******************          | | | | | | |              **********************
                            | | | | | | +------------> * SWISS-2DPAGE  [2D] *
******************          | | | | | |                **********************
* Harefield [2D] * <--------+ | | | | |
******************            | | | | |                **********************
                              | | | | +--------------> * Aarhus/Ghent  [2D] *
******************            | | | |                  **********************
* PROSITE        *            | | | |
* [Patterns and  * <----------+ | | +----------------> **********************
* profiles]      *              | |                    * YEPD [Yeast]  [2D] *
******************              | +----------------+   **********************
             |                  v                  |
             |          ***********************    +-> **********************
             +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
                        ***********************        **********************


   =End=of=SWISS-PROT=release=33=notes=====================================
  

Swiss-Prot release 32.0

Published November 1, 1995

                    SWISS-PROT RELEASE 32.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 32.0  of SWISS-PROT contains 49'340 sequence entries, comprising
   17'385'503  amino   acids  abstracted   from  43'056   references.  This
   represents an  increase of  11.3% over  release 31. The recent growth of
   the data bank is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   2.0        09/86               3939               900 163
   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368
   31.0       02/95              43470            15 335 248
   32.0       11/95              49340            17 385 503


      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 31

   2.1  Sequences and annotations

   5'959 sequences  have been  added since release 31, the sequence data of
   921 existing  entries has  been updated  and the  annotations of  10'691
   entries have been revised.





<PAGE>



   Major annotations and sequences updates have been made in preparation of
   the changes that will take place in release 33 (see section 3.1 of these
   notes).



   2.2  What's happening with the model organisms

   We have  selected a  number of  organisms that  are the target of genome
   sequencing and/or mapping projects and for which we intend to:

   -  Be as  complete as  possible. All sequences available at a given time
      should be  immediately included  in SWISS-PROT.  This  also  includes
      sequence corrections and updates;
   -  Provide a higher level of annotation;
   -  Provide cross-references  to specialized  database(s)  that  contain,
      among other  data, some genetic information about the genes that code
      for these proteins;
   -  Provide specific indices or documents.

   What was  done since  the last  release or  in preparation  for the next
   release concerning model organisms:

   -  We have added two species to the list of model organisms:

      Haemophilus influenzae. Haemophilus influenzae is the first bacterial
      genome to be completely sequenced. Its 1'830 Kb sequence was recently
      (Science 269:496-512(1995))  determined by  a team from the Institute
      of Genomic  Research (TIGR).  This  bacterial  genome  codes  for  an
      estimated 1'740  protein sequences. We  have  already  annotated  and
      incorporated about  85% of  this data  into SWISS-PROT.  What is left
      will be made available in the following weeks.

      Candida albicans. We have added Candida albicans to the list of model
      organisms because  of the  extensive work  being done by Stew Scherer
      and colleagues at the Department of Microbiology of the University of
      Minnesota to  organize data  from that fungal organism. Their data is
      available from a WWW server:

                    http://alces.med.umn.edu/Candida.html

      We currently have in SWISS-PROT all the publicly available C.albicans
      protein sequences.

   -  We have  started a  major effort  in catching  up with the backlog of
      sequences from  Arabidopsis thaliana.  About 150  entries  have  been
      added since release 31. This effort will be continued and expanded in
      the next months.

   -  We have  added in SWISS-PROT, all the sequences from yeast chromosome
      VI. All  the data  from yeast chromosome X is also in preparation and
      will be  available in  a few  days with  the first  weekly update  of
      SWISS-PROT. Yeast  sequence entries  are now cross-referenced to both




<PAGE>



      LISTA and  SGD (see  section 2.3).  We plan to work on chromosome XII
      and XIII entries very soon.

   -  We are  regularly adding  data coming  from the  S.pombe chromosome I
      sequencing project.  About  180  S.pombe  entries  were  added  since
      release 31.

   -  Although we  added 234 entries from C.elegans, we have not yet caught
      up with  the backlog  of  sequence  data  produced  from  the  genome
      sequencing project of that organism. We hope to be able to clear up a
      significant part of that backlog for release 33.

   -  We are  almost up  to date  concerning Bacillus subtilis (306 entries
      added),  Escherichia   coli  (317   entries  added)   and  Salmonella
      typhimurium (35 entries added).

   -  A big  effort needs  to be  done to  take care  of human (214 entries
      added) and Drosophila (57 entries added) sequences.

   -  We plan  to add Mycoplasma genitalium (the second bacterial genome to
      be completely sequenced) as a model organism in release 33.

   Here is the current status of the model organisms:

   Organism         Database               Index file       Number of
                    cross-referenced                        sequences
   --------------   ---------------------  --------------   ---------
   A.thaliana       None yet               In preparation         432
   B.subtilis       SubtiList              SUBTILIS.TXT          1389
   C.albicans       None yet               CALBICAN.TXT           100
   C.elegans        WormPep                CELEGANS.TXT           924
   D.discoideum     DictyDB                DICTY.TXT              213
   D.melanogaster   FlyBase                In preparation         768
   E.coli           EcoGene                ECOLI.TXT             3468
   H.influenzae     None yet               HAEINFLU.TXT          1575
   H.sapiens        MIM                    MIMTOSP.TXT           3281
   S.cerevisiae     LISTA/SGD              YEAST.TXT             3391
   S.typhimurium    StyGene                SALTY.TXT              603
   S.pombe          None yet               POMBE.TXT              460
   S.solfataricus   None yet               None yet                61


   2.3  Changes in the DR line and other news about cross-references

   We have  added cross-references  from SWISS-PROT  to  the  Saccharomyces
   Genome Database  (SGD) (previously  known as SacchDB) prepared under the
   supervisation of  Michael Cherry  at the  Stanford University  School of
   Medicine. These cross-references are present in the DR lines:

   Data bank identifier: SGD
   Primary identifier:   Unique identifier attributed by  SGD to  the  gene
                         coding for the protein
   Secondary identifier: The gene designation (name)
   Example:              DR   SGD; L0000008; AAR2.



<PAGE>


   We started  very recently  to receive  directly from  PDB pre-release of
   protein 3D-structure entries. Thanks to this new development, we will be
   able to keep the cross-references between SWISS-PROT and PDB up to date.
   Currently there  are 920 SWISS-PROT entries that are cross-referenced to
   PDB, but  we need  to catch up with a small backlog corresponding to the
   significant increase  in the  number of  PDB entries  in  the  last  six
   months. We  plan to be in synchronization with PDB starting with release
   33.

   There are  currently 174'439  DR lines in SWISS-PROT, an average of 3.53
   cross-references per entry.


   2.4  Replacement of RM line by RX line

   In this  release, the RM (Reference Medline) line has been replaced by a
   more 'generic'  line called  RX (Reference cross-references). The format
   of that line is:

   RX   BIBLIOGRAPHIC_DATABASE_NAME; IDENTIFIER.

   As of  this release, the only "bibliographic_database_name" that is used
   is "MEDLINE"  and the associated "identifier" is the eight digit Medline
   Unique  Identifier   (UID).  But   it  is   'rumored'  that   additional
   bibliographic databases  are interested  to be  linked to  the  sequence
   databases.

   Example:

   RM   91002678

   has been changed to:

   RX   MEDLINE; 91002678.

   There are currently 64'668 Medline cross-references (RX) in SWISS-PROT.



   2.5  Status of the documentation files

   SWISS-PROT is  distributed with  a large  number of documentation files.
   Some of  these files  have been  available for  a long  time  (the  user
   manual, release  notes, the  various  indices  for  authors,  citations,
   keywords, etc.),  but  many  have  been  created  recently  and  we  are
   continuously adding  new files.  Since release  31, we  have added 8 new
   document files.  The following  table list  all the  documents that  are
   either currently  available or  that we  plan to  add in  the  next  few
   months.

   USERMAN .TXT   User manual
   RELNOTES.TXT   Release notes
   SHORTDES.TXT   Short description of entries in SWISS-PROT





<PAGE>



   JOURLIST.TXT   List of abbreviations for journals cited
   KEYWLIST.TXT   List of keywords in use
   SPECLIST.TXT   List of organism identification codes
   EXPERTS .TXT   List of on-line experts for PROSITE and SWISS-PROT
   SUBMIT  .TXT   Submission of sequence data to the SWISS-PROT data bank [1]

   ACINDEX .TXT   Accession number index
   AUTINDEX.TXT   Author index
   CITINDEX.TXT   Citation index
   KEYINDEX.TXT   Keyword index
   SPEINDEX.TXT   Species index

   7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
   AATRNASY.TXT   List of aminoacyl-tRNA synthetases [1]
   ALLERGEN.TXT   Nomenclature and index of allergen sequences [1]
   CALBICAN.TXT   Index of Candida albicans entries and their corresponding
                  gene designations [1]
   CDLIST  .TXT   CD nomenclature for surface proteins of human leucocytes
   CELEGANS.TXT   Index of Caenorhabditis elegans entries and their
                  corresponding gene designations and WormPep cross-
                  references
   DICTY   .TXT   Index of Dictyostelium discoideum entries and their
                  corresponding gene designations and DictyDB cross-
                  references
   EC2DTOSP.TXT   Index of Escherichia coli Gene-protein database entries
                  referenced in SWISS-PROT
   ECOLI   .TXT   Index of Escherichia coli K12 chromosomal entries and
                  their corresponding EcoGene cross-reference
   EMBLTOSP.TXT   Index of EMBL Database entries referenced in SWISS-PROT
   EXTRADOM.TXT   Nomenclature of extracellular domains
   GLYCOSYL.TXT   Index of glycosyl hydrolases classified by families on the
                  basis of sequence similarities [2]
   HAEINFLU.TXT   Index of Haemophilus influenzae RD chromosomal entries [1]
   HOXLIST .TXT   Vertebrate homeotic Hox proteins: nomenclature and index
   HUMCHR21.TXT   Index of protein sequence entries encoded on human
                  chromosome 21
   HUMCHR22.TXT   Index of protein sequence entries encoded on human
                  chromosome 22 [1]
   HUMCHRY .TXT   Index of protein sequence entries encoded on human
                  chromosome Y
   MIMTOSP .TXT   Index of MIM entries referenced in SWISS-PROT
   NOMLIST .TXT   List of nomenclature related references for proteins
   PDBTOSP .TXT   Index of Brookhaven PDB entries referenced in SWISS-PROT
   PEPTIDAS.TXT   Classification of peptidase families and index of peptidases
                  entries [1]
   PLASTID .TXT   List of chloroplast and cyanelle encoded proteins
   POMBE   .TXT   Index of Schizosaccharomyces pombe entries in SWISS-PROT
                  and their corresponding gene designations
   RESTRIC .TXT   List of restriction enzymes and methylases entries
   RIBOSOMP.TXT   Index of ribosomal proteins classified by families on the
                  basis of sequence similarities [2]






<PAGE>



   SALTY   .TXT   Index of Salmonella typhimurium  LT2 chromosomal entries
                  and their corresponding StyGene cross-references
   SUBTILIS.TXT   Index of Bacillus subtilis 168 chromosomal entries and
                  their corresponding SubtiList cross-references
   YEAST   .TXT   Index of Saccharomyces cerevisiae entries and their
                  corresponding gene designations [3]
   YEAST1  .TXT   Yeast Chromosome I entries
   YEAST2  .TXT   Yeast Chromosome II entries
   YEAST3  .TXT   Yeast Chromosome III entries
   YEAST5  .TXT   Yeast Chromosome V entries
   YEAST6  .TXT   Yeast Chromosome VI entries [1]
   YEAST8  .TXT   Yeast Chromosome VIII entries
   YEAST9  .TXT   Yeast Chromosome IX entries
   YEAST10 .TXT   Yeast Chromosome X entries [2]
   YEAST11 .TXT   Yeast Chromosome XI entries

   Notes:

   [1]  New in release 32.
   [2]  Will be available starting with release 33 in February 1996.
   [3]  The format of that file was changed to add cross-references to SGD.


   We also  started to  include in  SWISS-PROT document  files  listing  of
   World-Wide Web  (sites) relevant to the subject under consideration. For
   example, in the "POMBE.TXT" file, you will find the following lines:

   More  information   on  Schizosaccharomyces,  its  genome,  biology  and
   genetics, is available from the following WWW pages:

   NIH : http://www.nih.gov/sigs/yeast/fission.html
   Salk: http://flosun.salk.edu/users/forsburg/lab.html
   UCL : http://t-chappell.mcbl.ucl.ac.uk/


   2.6  The Expasy World-Wide Web server

        2.6.1  Background information

   The most  efficient and  user-friendly way  to browse  interactively  in
   SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases. is to use
   the World-Wide  Web (WWW)  molecular biology  server ExPASy.  WWW  is  a
   global information  retrieval system  merging the  power  of  world-wide
   networks, hypertext  and multimedia.  Through hypertext  links, it gives
   access to  documents and  information available  on thousands of servers
   around the  world. To  access a  WWW server  one needs  a  WWW  browser.
   Popular  browsers   available  for   most  computer   platforms  include
   Mosaic(TM),  developed   at  the   National  Center  for  Supercomputing
   Applications (NCSA)  of the  University of Illinois at Champaign (it may
   be obtained  by  anonymous  ftp  from  ftp.ncsa.uiuc.edu)  and  Netscape
   Navigator(TM)  from   Netscape  Communications   Corp.  (available  from
   ftp.netscape.com). Using  a WWW  browser, one  has  access  to  all  the
   hypertext documents  stored on  the ExPASy  server as well as many other
   WWW servers.



<PAGE>



   The ExPASy server was made available to the public in September 1993. On
   November 1995  a cumulative total of 3 million connections was attained.
   It may  be accessed  through its  Uniform Resource  Locator (URL  -  the
   addressing system defined in WWW), which is:

        http://expasy.hcuge.ch/

   The ExPASy  WWW server  allows access, using the user-friendly hypertext
   model, to  the SWISS-PROT,  PROSITE,  ENZYME,  SWISS-2DPAGE  and  SWISS-
   3DIMAGE databases and, through any SWISS-PROT protein sequence entry, to
   other databases  such as  EMBL, EcoCyc,  FlyBase, GCRDb, LISTA, MaizeDB,
   SubtiList, OMIM, PDB, HSSP, ProDom, REBASE, SGD, YEPD and Medline. Using
   a browser  which is  able to display images one can also remotely access
   2D gels  image data from SWISS-2DPAGE. ExPAsy also offers many tools for
   the analysis of protein seqiuences and 2D gels.

   For more  information on  the  ExPASy  WWW  server,  you  can  read  the
   following article:

      Appel R.D., Bairoch A., Hochstrasser D.F.
      A new  generation of  information retrieval tools for biologists: the
      example of the ExPASy WWW server.
      Trends Biochem. Sci. 19:258-260(1994).

   Or you can contact Dr. Ron Appel:

      Email: ron.appel@dim.hcuge.ch
      Fax: +41-22-372 61 98


        2.6.2  SWISS-SHOP

   Thanks to the work of Manuel Peitsch from the Geneva Glaxo Institute for
   Molecular Biology,  we can  provide, on ExPASy, a  service called SWISS-
   SHOP. SWISS-Shop  allows  any  users  of  SWISS-PROT  to  indicate  what
   proteins he/she  is interested  in.  This  can  be  done  using  various
   criteria that can be combined:

   -  By entering  one  or  more  words  that  should  be  present  in  the
      description line;
   -  By entering one or more species name(s) or taxonomic division(s);
   -  By entering one or more keywords;
   -  By entering one or more author names;
   -  By entering the accession number (or entry name) of a PROSITE pattern
      or a user-defined sequence pattern;
   -  By entering  the accession  number (or  entry name)  of  an  existing
      SWISS-PROT entry or by entering a "private" sequence.

   Every week,  the new  sequences entered  in SWISS-PROT are automatically
   compared with all the criteria that have been defined by the users. If a
   sequence corresponds  to the  selection criteria defined by a user, that
   sequence is sent by electronic mail.





<PAGE>


        2.6.3  What is new on ExPASy

   Since  the   last  release,  there  has  been  a  large  number  of  new
   developments on the ExPASy WWW server. Here are some highlights of these
   changes:

   -  A new option has been introduced that allows to search in SWISS-PROT,
      PROSITE and  SeqAnalRef by  citation. When  you call this option, you
      are prompted  to enter  the name of a journal and optionally a volume
      number and/or  a year.  The program is written in such a way that you
      can enter  either the  full name  of a  journal or  its official  (as
      listed in  the  JOURLIST.TXT  file)  abbreviation  (with  or  without
      periods). It  is also able to recognize special abbreviations such as
      JBC, NAR, PNAS, etc. So, for example, you can either enter:

      Journal of Biological Chemistry
      J. Biol. Chem.
      J Biol Chem
      JBC

      If you  do not  enter a  valid journal  name or abbreviation, it will
      show you the list of those that could potentially match your input.


   -  We have  improved the  options that allow you to search in SWISS-PROT
      by 'description' or by 'full text':

        If your  search criteria  return a list that contains more than two
        entries, you  now have  the option  that to  save these  SWISS-PROT
        entries into  a file  which is  stored (for  up to  a week)  on the
        ExPASy FTP  anonymous server.  Thus it is now possible for users to
        create custom subsets of the database and to download them on their
        computer.

        If your  search criteria does not return any entry, you can, if you
        believe that the sequence(s) that you are looking for are currently
        missing in  SWISS-PROT, send  a message  to the  SWISS-PROT team so
        that they  can take steps to insure that these sequence(s) be added
        to the database.

   -  The Journal  of Biological  Chemistry (JBC)  has a  WWW server  where
      abstracts and  full text of articles are made available. We are happy
      to announce  the implementation  of what  we believe  to be the first
      direct link  in a  sequence database between a reference and the full
      text version  of a cited article. Recent JBC references in SWISS-PROT
      and PROSITE  are directly  linked to the corresponding entry point in
      the JBC server.

   -  ProtParam is a new tool which we have implemented and that allows the
      computation of  various physical  and chemical parameters for a given
      protein stored  in SWISS-PROT  or for  a user  entered sequence.  The
      computed parameters  include the  molecular weight,  theoretical  pI,
      amino acid  composition, extinction coefficient, estimated half-life,
      instability index and aliphatic index.




<PAGE>



   -  RandSeq is  a new  tool which generates random protein sequences. You
      can choose the length of the sequence to be created as well as choose
      between four  different options  for the composition of the generated
      sequence: equal  composition for all amino acids; use the composition
      of  a   specific  sequence   from  SWISS-PROT;   average  amino  acid
      composition (computed from SWISS-PROT); user specified composition in
      percent.

   -  WWW links  have been implemented between SWISS-PROT yeast entries and
      SGD  (see   section  2.3),  as  well  between  Escherichia  coli  K12
      chromosomal entries  and the  EcoCyc database, the encyclopedia of E.
      coli Gene and Metabolism.

   -  Most SWISS-PROT  documents are  now directly  linked to  relevant WWW
      servers or specific documents (see section 2.5).

   -  Many other changes have been made to all parts of the server.


   2.7  Weekly updates of SWISS-PROT

   Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
   are updated at each update:

   new_seq.dat    Contains all the new entries since the last full release;
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release;
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since the last release.

   Currently these  files are  available on  the  following  anonymous  ftp
   servers:

   Organization   ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/swiss-prot/updates

   Organization   National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/swiss-prot/updates

   Organization   European Bioinformatics Institute (EBI)
   Address        ftp.ebi.ac.uk (or 193.62.196.6)
   Directory      /pub/databases/swissprot/new

   Organization   Bioinformatics Unit, Weizmann Institute of Science (WIS)
   Address        bioinformatics.weizmann.ac.il (or 132.76.55.12)
   Directory      /pub/databases/swiss-prot/updates

   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.



<PAGE>


   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free.



                      3.0  IMPORTANT FORTHCOMING CHANGE

   3.1  Major changes to the cross-references to EMBL

   In the  next release,  the format  of the  DR (Database cross-Reference)
   lines pointing  to EMBL  Nucleotide Sequence  Database entries  will  be
   changed from:

   DR   EMBL; ACCESSION_NUMBER; ENTRY_NAME.

   to:

   DR   EMBL; ACCESSION_NUMBER; PID; STATUS_IDENTIFIER.

   Where 'PID'  stands for  the "Protein  IDentification" number.  It is  a
   number that  you will  find from  EMBL release  45 onwards  (and Genbank
   release 94.0  onwards) in  a qualifier called "/db_xref" which is tagged
   to every CDS in the nucleotide database. Example:

   FT   CDS            54..1382
   FT                  /note="ribulose-1,5-bisphosphate carboxylase/
   FT                  oxygenase activase precursor"
   FT                  /db_xref="PID:g1006835"

   When an EMBL database CDS exists as a sequence report in SWISS-PROT, the
   SWISS-PROT DR  lines of  the  corresponding  SWISS-PROT  entry  will  be
   updated by  citing the PID as secondary identifier. In all cases where a
   PID will  have been  integrated into  SWISS-PROT, a "/db_xref" qualifier
   citing the  corresponding SWISS-PROT  entry will  be added  to the  EMBL
   database CDS labeled with this PID. Example:

   FT   CDS             14556__15696
   FT                   /gene="cytochrome b"
   FT                   /codon_start=1
   FT                   /product="apoprotein"
   FT                   /db_xref="PID:g463170"
   FT                   /db_xref="SWISS-PROT:P12778"

   This approach  enables us  to point  precisely from  a given  SWISS-PROT
   entry to one of potentially many CDS in the corresponding EMBL entry and
   vice versa.  This change  will allow  the development  of software tools
   that automatically retrieve the part of a nucleotide sequence entry that
   codes for  a specific  protein. This  will be  especially useful  in the
   context of  World-Wide Web  as  it  will  render  obsolete  the  current
   situation where,  for  example,  one  needs  to  retrieve  the  complete
   sequence of  a yeast  chromosome when  one wants the nucleotide sequence
   coding for a specific protein encoded on that chromosome.

   This major  changes has  been in preparation for the last six months, it
   is one of the reasons that release 32 was delayed so long. In the course



<PAGE>



   of cross-referencing at the level of the "PID", we had to manually check
   thousands of  problem cases.  This lead  to many sequence and annotation
   updates.

   An additional  important principle  of the PID system is that whenever a
   change is  made to  the nucleotide  entry or  to the annotations of that
   entry and  that this  change produces  a modification  in the translated
   protein sequence,  the PID  number corresponding  to the modified CDS is
   replaced by  a completely  new number.  The old number will be kept in a
   special field tagged to the CDS. The exact syntax of this field is under
   discussion at the international nucleotide databases.

   The  new   cross-referencing   system   will   allow   a   much   closer
   interconnection between  SWISS-PROT  and  the  international  nucleotide
   sequence databases.  For example, it will allow us to automatically take
   into account  sequence updates  made to  the nucleotide entry when these
   updates have an impact on the derived protein sequence(s).

   It should also be noted that the "PID" numbers in the context of GenBank
   replace the  "NCBI gi" numbering system which was present in the "/note"
   qualifier. The "gi" identifiers for the nucleic acid sequences have been
   replaced by "NID" (nucleic acid identifier) numbers.

   The 'STATUS_IDENTIFIER'  provides  information  about  the  relationship
   between the  sequence in  the  SWISS-PROT  entry  and  the  CDS  in  the
   corresponding EMBL entry.

   a) In  most cases  the translation  of the  EMBL nucleotide sequence CDS
   results in  the same  sequence as  shown in the corresponding SWISS-PROT
   entry or  the differences  are mentioned  in the SWISS-PROT feature (FT)
   lines as  CONFLICT, VARIANT  or VARSPLIC  and in  the RP lines. In these
   cases the status identifier shows a dash ("-").

   Example:

   DR   EMBL; Y00312; G63880; -.

   b) In  some cases  the translation  of the  EMBL nucleotide sequence CDS
   results  in  a  sequence  different  from  the  sequence  shown  in  the
   corresponding SWISS-PROT  entry  and  the  differences  are  either  not
   mentioned in  the SWISS-PROT  feature (FT) lines as CONFLICT, VARIANT or
   VARSPLIC and  in the  RP lines,  or do  simply not meet the criteria for
   such situations.

   1) If the  difference is  due to a different start of the sequence (e.g.
      SWISS-PROT believes  that the  start of  the sequence  is upstream or
      downstream of  the site annotated as the start of the sequence in the
      EMBL database),  the status  identifier shows the comment "ALT_INIT".
      Example:

        DR   EMBL; L29151; G466334; ALT_INIT.






<PAGE>


   2) If the  difference is  due to a different termination of the sequence
      (e.g. SWISS-PROT  believes that  the termination  of the  sequence is
      upstream or  downstream of  the site  annotated as  the  end  of  the
      sequence in  the EMBL  database), the  status  identifier  shows  the
      comment "ALT_TERM". Example:

        DR   EMBL; L20562; G398099; ALT_TERM.


   3) If the  difference is  due to  frameshifts in  the EMBL sequence, the
      status identifier shows the comment "ALT_FRAME". Example:

        DR   EMBL; M95935; G146416; ALT_FRAME.


   4) If the difference is not due to the cases mentioned above (e.g. wrong
      intron-exon boundaries  given in  the EMBL  entry) or to a mixture of
      the cases  mentioned above,  the status  identifier shows the comment
      "ALT_SEQ". Example:

        DR   EMBL; X79206; G809602; ALT_SEQ.

   c) In some cases the nucleotide sequence of a complete CDS is divided in
   exons present in different EMBL entries. We point to the exon containing
   EMBL entries  by citing  the PID  as secondary identifier and adding the
   comment "JOINED"  into the status identifier. These EMBL entries are not
   containing a  CDS feature,  they contain  exons joined  to a CDS feature
   which is labeled with the given PID.

   Example:

   DR   EMBL; M63397; G177196; -.
   DR   EMBL; M63395; G177196; JOINED.
   DR   EMBL; M63396; G177196; JOINED.

   In the  above example  the SWISS-PROT  sequence is  derived from the CDS
   labeled with  the PID G177196. This CDS feature can be found in the EMBL
   entry M63397.  Exons belonging  to this  CDS are  not only found in EMBL
   entry M63397, but also in the EMBL entries M63395 and M63396.

   d) In  some cases  there is  no CDS  feature key  annotating  a  protein
   translation in  an EMBL entry and thus no PID for that CDS. Therefore it
   is not  possible for  us to point to a PID as a secondary identifier. In
   these cases  we point  to the  relevant EMBL entries by including a dash
   ("-") in  the position  of the  missing PID and "NOT_ANNOTATED_CDS" into
   the status identifier.

   Example:

   DR   EMBL; J04126; -; NOT_ANNOTATED_CDS.








<PAGE>


   3.2  TREMBL - a supplement to SWISS-PROT

   The ongoing  genome sequencing  and mapping  projects have  dramatically
   increased the number of protein sequences to be incorporated into SWISS-
   PROT. Since we do not want to dilute the quality standards of SWISS-PROT
   by incorporating  sequences  into  SWISS-PROT  without  proper  sequence
   analysis and  annotation, we  cannot speed  up the  incorporation of new
   incoming data  indefinitely. But  as we  also want to make the sequences
   available as  fast as  possible, we  will introduce  with SWISS-PROT  an
   computer annotated supplement to SWISS-PROT. This supplement consists of
   entries in  SWISS-PROT-like format  derived from  the translation of all
   coding sequences  (CDS) in the EMBL nucleotide sequence database, except
   the CDS already included in SWISS-PROT.

   We name  this supplement  TREMBL  (TRanslation  from  EMBL),  since  the
   translation tools  used to  create the translations of the CDS are based
   on the  program  'trembl'  written  by  Thure  Etzold  at  the  EMBL  in
   Heidelberg.

   We will  translate all  CDS's in  the EMBL  Nucleotide Sequence Database
   into TREMBL  preentries. The  preentries already  as sequence reports in
   SWISS-PROT will be excluded from TREMBL. Then the remaining entries will
   be automatically  merged  whenever  possible  to  reduce  redundancy  in
   TREMBL. This  step will  lead to  approximately 90'000  TREMBL  entries,
   which are supplementing SWISS-PROT.

   We will split TREMBL in two main sections; SP-TREMBL and REM-TREMBL:

   SP-TREMBL (SWISS-PROT  TREMBL) will  contain the  entries (about 75'000)
   which  should   be  incorporated  into  SWISS-PROT.  SP-TREMBL  will  be
   partially redundant  against SWISS-PROT,  since approximately  40'000 of
   these SP-TREMBL  entries will  be only  additional sequence  reports  of
   proteins already  in SWISS-PROT.  We will  try to  merge these  sequence
   reports as fast as possible with the already existing SWISS-PROT entries
   for these  proteins, so  as to  make SWISS-PROT  and  TREMBL  completely
   nonredundant.

   REM-TREMBL (REMaining  TREMBL) will  contain the  entries (about 15'000)
   that we  do not  want to  include in  SWISS-PROT. This  section will  be
   organized in four subsections:

   1) Most REM-TREMBL entries will be immunoglobulins and T-cell receptors.
      We stopped  entering immunoglobulins and T-cell receptors into SWISS-
      PROT, because  we only  want to  keep  the  germ  line  gene  derived
      translations of  these proteins  in  SWISS-PROT  and  not  all  known
      somatic recombinated  variations of  these proteins. We are expecting
      more than  10'000 immunoglobulins  and T-cell receptors in TREMBL. We
      would like  to create  a  specialized  database  dealing  with  these
      sequences as  a further  supplement to  SWISS-PROT and  keep  only  a
      representative cross-section of these proteins in SWISS-PROT.

   2) Another category of data which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to leave these entries in
      TREMBL.  Ideally   one  should   build  a  specialized  database  for
      artificial sequences as a further supplement to SWISS-PROT.



<PAGE>


   3) A third  subsection consists  of fragments with less than seven amino
      acids.

   4) The last subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   The first  full release of TREMBL will be distributed with release 34 of
   SWISS-PROT. However  we will  make available,  with release  33, a  beta
   release so that users and software developers can send us feedback about
   this new supplement to SWISS-PROT.



   3.3  Introduction of a new CC line-type topic (MASS SPECTROMETRY)

   We will  introduce in  the next  release a  new 'topic' for the comments
   (CC) line-type: MASS SPECTROMETRY. This topic will be used to report the
   exact molecular  weight of  a protein or part of a protein as determined
   by mass spectrometric methods. The syntax of this new topic will be:

   CC   -!-  MASS SPECTROMETRY: MW=XXX[; MW_ERR=XX]; METHOD=XX[; RANGE=XX-XX].

   Where:

   -  "MW=XX" is the determined molecular weight (MW);
   -  "MW_ERR=XX" (optional)  is the  accuracy or  error range  of  the  MW
      measurement;
   -  "METHOD=XX" is the masss spectrometric method: "ELECTROSPRAY" is used
      for electrospray  ionization (ESI)  and "MALDI"  is used  for matrix-
      assisted laser desorption/ionization;
   -  "RANGE=XX-XX" (optional) is used to indicate what part of the protein
      sequence entry corresponds to the molecular weight. If this qualifier
      is not  present, the  MW value  corresponds to the full length of the
      protein sequence.

   Examples of its usage:

   CC   -!- MASS SPECTROMETRY: MW=13423.3; METHOD=ELECTROSPRAY.
   CC   -!- MASS SPECTROMETRY: MW=71890; MR_ER=7; METHOD=ELECTROSPRAY.
   CC   -!- MASS SPECTROMETRY: MW=8597.5; METHOD=ELECTROSPRAY; RANGE=40-119.

   It should  be noted  that the  syntax of this topic may evolve in future
   releases as  we expect  feedback from groups using mass spectrometry for
   protein identification on 2D gels, MW determination and characterization
   of post-translational modifications.



   3.4  Change in the syntax of the SQ line

   The SQ  (SeQuence header)  line marks the beginning of the sequence data
   and gives  a quick  summary of its content. The format of the SQ line is
   currently:

   SQ   SEQUENCE  XXXX AA; XXXXX MW;  XXXXX CN;



<PAGE>



   The line  contains the  length  of  the  sequence  in  amino-acids  (AA)
   followed by  the molecular weight (MW) rounded to the nearest gram and a
   checking number (CN) as shown in the example:

   SQ   SEQUENCE 104 AA; 11530 MW; 54319 CN;

   Starting with the next release, we will replace the checking number (CN)
   by a 32-bit CRC (Cyclic Redundancy Check) value. The new syntax will be:

   SQ   SEQUENCE  XXXX AA; XXXXX MW;  XXXXXXXX CRC32;

   Example:

   SQ   SEQUENCE   104 AA;  11530 MW;  7A70363C CRC32;




                            4. ENZYME AND PROSITE


   4.1  The ENZYME data bank

        4.1.1  Content of the release

   Release 19.0  of the  ENZYME data bank is distributed with release 32 of
   SWISS-PROT. ENZYME  release 19.0  contains information  relative to 3601
   enzymes. We  have updated the data bank with new information released by
   the Nomenclature Committee of IUBMB.


        4.1.2  Improvements in the ENZYME section of the ExPASy WWW server

   On ExPASY,  the display of ENZYME entries has been completely changed to
   be made  more readable.  One of the changes is that each compound listed
   in a reaction is presented on a separate line. Example:

   UDP-GLUCOSE + 2 NAD(+) + H(2)O = UDP-GLUCURONATE + 2 NADH.

   is now shown as:

         UDP-GLUCOSE
      +  2 NAD(+)
      +  H(2)O
     >=< UDP-GLUCURONATE
      +  2 NADH.


   Links have  been added  to the  Klotho database  of metabolic  compounds
   maintained by  Tonic Kazic  at the Institute for Biomedical Computing at
   Washington University in St. Louis.






<PAGE>


   4.2  The PROSITE data bank

        4.2.1  Statistics for release 13

   Release 13.0  of the PROSITE data bank is distributed with release 32 of
   SWISS-PROT. This  release of  PROSITE contains 889 documentation entries
   that describe  1'167 different  patterns, rules  and  profiles/matrices.
   Since the  last full  release (12.0  of June  1994)  we  added  104  new
   documentation entries  and updated  499 entries.  Therefore 68%  of  all
   PROSITE entries are either new or updated.

   Out of  a total  of 49'340  entries in  SWISS-PROT,  24'137  are  cross-
   referenced in  PROSITE (excluding  the false  positives). This tally for
   exactly 49% of the sequences in SWISS-PROT.

        4.2.2  List of the new entries in release 13

      C1q domain signature
      Death domain profile
      Forkhead-associated (FHA) domain profile
      PH domain profile
      Src homology 2 (SH2) domain profile
      Src homology 3 (SH3) domain profile
      WW/rsp5/WWP domain signature and profile
      S-layer homology domain signature
      Prokaryotic dksA/traR C4-type zinc finger
      Copper-fist domain
      Bacterial regulatory proteins, iclR family signature
      Bacterial regulatory proteins, marR family signature
      Bacterial regulatory proteins, tetR family signature
      Sigma-70 factors ECF subfamily signature
      Ribosomal protein L10 signature
      Ribosomal protein L24 signature
      Ribosomal protein L31 signature
      Ribosomal protein L7Ae signature
      Ribosomal protein L13e signature
      Ribosomal protein L18e signature
      Ribosomal protein L24e signature
      Ribosomal protein L27e signature
      Ribosomal protein L31e signature
      Ribosomal protein L34e signatures
      Ribosomal protein L35Ae signature
      Ribosomal protein L37e signature
      Ribosomal protein S6 signature
      Homoserine dehydrogenase signature
      Aspartate-semialdehyde dehydrogenase signature
      Pyridoxamine 5'-phosphate oxidase signature
      Respiratory-chain NADH dehydrogenase 20 Kd subunit signature
      Respiratory-chain NADH dehydrogenase 24 Kd subunit signature
      NNMT/PNMT/TEMT family of methyltransferases signature
      Ribosomal RNA adenine dimethylases signature
      Squalene and phytoene synthases signatures
      ROK family signature





<PAGE>



      Casein kinase II regulatory subunit signature
      Shikimate kinase signature
      Prokaryotic diacylglycerol kinase signature
      Acetate and butyrate kinases family signatures
      RNA polymerases H / 23 Kd subunits signature
      RNA polymerases L / 13 to 16 Kd subunits signature
      RNA polymerases N / 8 Kd subunits signature
      RNA polymerases RPB6 / 6 Kd subunits signature
      Lipolytic enzymes "G-D-S-L" family, serine active site
      Class A bacterial acid phosphatases signature
      Phosphatidylinositol-specific phospholipase C profiles
      DNA/RNA non-specific endonucleases active site
      Thermonuclease family signature
      Chitinases family 18 signature
      Glycosyl hydrolases family 45 active site
      ATP-dependent serine proteases, lon family, serine active site
      Interleukin-1 beta converting enzyme family active sites
      Hydroxymethylglutaryl-coenzyme A lyase active site
      DNA photolyases class 2 signatures
      Adenylate cyclases class-I signatures
      Ribulose-phosphate 3-epimerase family signatures
      PpiC-type peptidyl-prolyl cis-trans isomerase signature
      Glucosamine/galactosamine-6-phosphate isomerases signature
      Terpene synthases signature
      SAICAR synthetase signatures
      NAD-dependent DNA ligase signatures
      Transposases, IS30 family, signature
      Molybdenum cofactor biosynthesis proteins signatures
      Radical activating enzymes signature
      Electron transfer flavoprotein beta-subunit signature
      Heavy-metal-associated domain
      Bacterial extracellular solute-binding proteins, family 1 signature
      Bacterial extracellular solute-binding proteins, family 3 signature
      Bacterial extracellular solute-binding proteins, family 5 signature
      Sulfate transporters signature
      Xanthine/uracil permeases family signature
      OmpA-like domain
      GPR1/FUN34/yaaH family signature
      FtsZ protein signatures
      Kinesin light chain repeat
      Bacterial microcompartiments proteins signature
      Flagella transport protein fliP family signatures
      Macrophage migration inhibitory factor family signature
      Scorpion short toxins signature
      GrpE protein signature
      Bacterial type II secretion system protein C signature
      Bacterial type II secretion system protein N signature
      Protein secE/sec61-gamma signature
      Fimbrial biogenesis outer membrane usher protein signature
      Apoptosis regulator proteins, Bcl-2 family signature
      GTP-binding nuclear protein ran signature
      Elongation factor Ts signatures
      Translation initiation factor SUI1 signature
      Calponin family repeat



<PAGE>



      CAP protein signatures
      Hydrogenases expression/synthesis hupF/hypC family signature
      NOL1/NOP2/fmu family signature
      Hypothetical SUA5/yciO/yrdC family signature
      Hypothetical YBL055c/yjjV family signatures
      Hypothetical YBR002c family signature
      Hypothetical YBR177c/yheT family signature
      Hypothetical YER057c/yjgF family signature
      Hypothetical YKL151c/yjeF family signatures
      Hypothetical hesB/yadR/yfhF family signature
      Hypothetical yabO/yceC/yfiI family signature
      Hypothetical yciL/yejD/yjbC family signature
      Hypothetical yedF/yeeD/yhhP family signature
      Hypothetical yhdG/yjbN/yohI family signature



        4.2.3  Status of profiles in PROSITE

   This is  the second  release of PROSITE to include weight matrices (also
   known as  profiles). The last release included only two profile entries;
   this release includes 16 profiles. Seven of these profiles are described
   by documentation entries that are linked to both a signature pattern and
   a profile.

   As in  general, a  profile is  much more  sensitive than  a pattern, you
   should try  to make  use of  the profile  if  you  have  access  to  the
   necessary software tools to do so.

   Many new  profiles are being prepared and will be progressively added to
   PROSITE. We also plan to upgrade some unsatisfactory patterns entries to
   profiles.

        4.2.4  Software to make use of the profiles

   A set  of two programs (for Unix systems) have been developed by Philipp
   Bucher to make use of the PROSITE profile entries:

   pfscan    scans a single sequence for the occurrences of several
             PROSITE profile entries.
   pfsearch  searches a sequence database for occurrences of a single
             PROSITE profile entry.

   These programs  are  available  from  the  ISREC  anonymous  ftp  server
   "ulrec3.unil.ch"; the files are located in the directory "/pub/pftools".

   From WWW,  you can  use "ProfileScan",  an ISREC  service that allows to
   scan a sequence against the profile entries in PROSITE; the URL for this
   service is:

               http://ulrec3.unil.ch/software/profilescan.html

   A link to this tool is also provided by the ExPASy WWW server.




<PAGE>



        4.2.5  Changes in the format of the PROSITE.DAT file

   In the  NR line  (Numerical  Results)  we  changed  the  format  of  the
   "/FALSE_NEG" qualifier and added a new qualifier, "/PARTIAL".

   The syntax  of the  "/FALSE_NEG" qualifier  which reports  the number of
   known  missed   sequences  used  to  be:  "/FALSE_NEG=x(y);"  where  `x'
   represented the  number of  hits and  `y' the  number of  sequences;  we
   simplified this  syntax to  "/FALSE_NEG=y;"  where  `y'  represents  the
   number of sequences.

   The new  qualifier "/PARTIAL"  is used to indicate the number of partial
   sequences which  belong to  the set  in consideration, but which are not
   hit by  the pattern  or profile  because  they  are  partial  (fragment)
   sequences. Its  syntax is  "/PARTIAL=y;" where `y' represents the number
   of sequences.


   Example of a complete block of NR lines:

   NR   /RELEASE=32,49340;
   NR   /TOTAL=123(56); /POSITIVE=115(51); /UNKNOWN=5(2); /FALSE_POS=3(3);
   NR   /FALSE_NEG=3; /PARTIAL=2;


   In the  above example  the scan for the pattern (or profile) was done on
   release 32  of SWISS-PROT  which contains  49'340 sequence entries, that
   pattern (or  profile) was  found 123  times in  56  different  sequences
   (/TOTAL). Out  of those  123 `hits',  115 were  produced by 51 sequences
   that belong  to the  set under  consideration (/POSITIVE),  5 hits  were
   produced by  two sequences  which  could  possible  belong  to  the  set
   (/UNKNOWN) and  3 hits  were produced by 3 other sequences (/FALSE_POS).
   That particular  pattern missed  3 sequences (/FALSE_NEG) and there were
   two partial  sequences that  belong to  the set  under consideration but
   which do  not include the region that contains that pattern (or profile)
   (/PARTIAL).



        4.2.6  New feature in the PROSITE.DOC file

   Starting with  release 13,  we added  a new  form of  references in  the
   PROSITE documentation  file (PROSITE.DOC).  These references  are of the
   form "[En]",  where "n"  is a number. These references are used to point
   to electronic documents available on the Word-Wide Web. Example:












<PAGE>



   {BEGIN}
   ********************************
   * AAA-protein family signature *
   ********************************

   A large  family of  ATPases has  been described  [1 to  5,E1] whose  key
   feature is that they  share  a conserved region of about 220 amino acids
   that contains an ATP-binding site. This  family  is now called AAA,  for
   'A'TPases 'A'ssociated

   ..Lots of lines deleted..

   [ 5] Confalonieri F., Duguet M.
        BioEssays 17:639-650(1995).
   [E1] http://yeamob.pci.chemie.uni-tuebingen.de/AAA/Description.html
   {END}

   It is  of course  possible, on  the ExPASY WWW server, when displaying a
   PROSITE  documentation   entry  to   directly  access  these  electronic
   references. While  this change  seems minor, we consider it as the first
   step in  the establishment  of a  on-line decentralized encyclopedia for
   protein families.



                             WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.
























<PAGE>



                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.58   Gln (Q) 4.03   Leu (L) 9.29   Ser (S) 7.17
   Arg (R) 5.17   Glu (E) 6.31   Lys (K) 5.91   Thr (T) 5.76
   Asn (N) 4.52   Gly (G) 6.88   Met (M) 2.36   Trp (W) 1.26
   Asp (D) 5.30   His (H) 2.23   Phe (F) 4.04   Tyr (Y) 3.21
   Cys (C) 1.70   Ile (I) 5.70   Pro (P) 4.92   Val (V) 6.52

   Asx (B) 0.001  Glx (Z) 0.001  Xaa (X) 0.02


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 4921

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 2231
                            2x:  776
                            3x:  441
                            4x:  272
                            5x:  200
                            6x:  198
                            7x:  117
                            8x:   95
                            9x:  103
                           10x:   50
                       11- 20x:  194
                       21- 50x:  147
                       51-100x:   49
                         >100x:   48













<PAGE>



        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        3468          Escherichia coli
         2        3391          Baker's yeast (Saccharomyces cerevisiae)
         3        3281          Human
         4        1978          Mouse
         5        1773          Rat
         6        1575          Haemophilus influenzae
         7        1389          Bacillus subtilis
         8         924          Caenorhabditis elegans
         9         800          Bovine
        10         768          Fruit fly (Drosophila melanogaster)
        11         605          Chicken
        12         603          Salmonella typhimurium
        13         479          African clawed frog (Xenopus laevis)
        14         460          Fission yeast (Schizosaccharomyces pombe)
        15         432          Rabbit
                   432          Arabidopsis thaliana (Mouse-ear cress)
        17         376          Pig
        18         282          Maize
        19         275          Bacteriophage T4
        20         251          Vaccinia virus (strain Copenhagen)
        21         236          Rice
        22         232          Pseudomonas aeruginosa
        23         213          Slime mold (Dictyostelium discoideum)
        24         205          Tobacco
        25         193          Human cytomegalovirus (strain AD169)
        26         190          Pea
        27         183          Vaccinia virus (strain WR)
                   183          Wheat
        29         173          Barley
        30         165          Staphylococcus aureus
        31         161          Soybean
        32         160          Pseudomonas putida
                   160          Dog
        34         157          Rhodobacter capsulatus
        35         155          Neurospora crassa
        36         154          Autographa californica nuclear polyhedrosis virus
        37         150          Marchantia polymorpha (Liverwort)
        38         148          Sheep
                   148          Klebsiella pneumoniae
        40         146          Variola virus
                   146          Bacillus stearothermophilus
        42         138          Spinach
        43         130          Tomato
        44         124          Potato
        45         122          Rhizobium meliloti
                   122          Mycobacterium leprae
        47         117          Lactococcus lactis (subsp. lactis)
        48         116          Agrobacterium tumefaciens
        49         100          Candida albicans
                   100          Chlamydomonas reinhardtii
                   100          Streptomyces coelicolor



<PAGE>



   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    2622             1001-1100      445
                 51- 100    4679             1101-1200      318
                101- 150    6342             1201-1300      239
                151- 200    4810             1301-1400      151
                201- 250    4339             1401-1500      134
                251- 300    3837             1501-1600       83
                301- 350    3650             1601-1700       61
                351- 400    3624             1701-1800       60
                401- 450    2762             1801-1900       64
                451- 500    2777             1901-2000       40
                501- 550    1982             2001-2100       23
                551- 600    1412             2101-2200       51
                601- 650    1036             2201-2300       56
                651- 700     782             2301-2400       23
                701- 750     713             2401-2500       30
                751- 800     568             >2500          145
                801- 850     431
                851- 900     457
                901- 950     322
                951-1000     272


   A.4  Longest sequences

   The longest sequences (>=4000 residues) are listed here:

                               HTS1_COCCA  5217
                               FAT_DROME   5147
                               RYNR_RABIT  5037
                               RYNR_PIG    5035
                               RYNR_HUMAN  5032
                               RYNC_RABIT  4969
                               DYHC_DICDI  4725
                               DYHC_RAT    4644
                               DYHC_DROME  4639
                               APB_HUMAN   4563
                               APOA_HUMAN  4548
                               RRPA_CVMJH  4488
                               DYHC_TRIGR  4466
                               DYHC_ANTCR  4466
                               GRSB_BACBR  4451
                               PKSK_BACSU  4447
                               PKSL_BACSU  4427
                               YP73_CAEEL  4385
                               DYHC_NEUCR  4367
                               DYHC_EMENI  4344
                               PLEC_RAT    4140
                               DYHC_YEAST  4092
                               RRPA_CVH22  4085





<PAGE>


   A.5  List of the most cited journals in SWISS-PROT

   Citations            Journal abbreviation

   4793                 J. BIOL. CHEM.
   3162                 NUCLEIC ACIDS RES.
   3037                 PROC. NATL. ACAD. SCI. U.S.A.
   2087                 J. BACTERIOL.
   1706                 GENE
   1644                 FEBS LETT.
   1535                 EUR. J. BIOCHEM.
   1394                 EMBO J.
   1323                 NATURE
   1304                 BIOCHEM. BIOPHYS. RES. COMMUN.
   1235                 BIOCHEMISTRY
   1023                 BIOCHIM. BIOPHYS. ACTA
    973                 J. MOL. BIOL.
    956                 CELL
    923                 MOL. CELL. BIOL.
    786                 MOL. GEN. GENET.
    716                 PLANT MOL. BIOL.
    705                 VIROLOGY
    684                 BIOCHEM. J.
    610                 SCIENCE
    570                 MOL. MICROBIOL.
    551                 J. BIOCHEM.
    452                 J. VIROL.
    404                 J. GEN. VIROL.
    316                 J. CELL BIOL.
    304                 GENOMICS
    287                 GENES DEV.
    258                 YEAST
    253                 BIOL. CHEM. HOPPE-SEYLER
    250                 CURR. GENET.
    233                 PLANT PHYSIOL.
    232                 ARCH. BIOCHEM. BIOPHYS.
    229                 J. IMMUNOL.
    223                 INFECT. IMMUN.
    213                 HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
    212                 MOL. BIOCHEM. PARASITOL.
    197                 J. GEN. MICROBIOL.
    179                 MOL. ENDOCRINOL.
    175                 HUM. MOL. GENET.
    169                 J. CLIN. INVEST.
    167                 ONCOGENE
    156                 FEMS MICROBIOL. LETT.
    151                 AM. J. HUM. GENET.
    145                 DNA
    136                 J. EXP. MED.
    129                 J. MOL. EVOL.
    129                 GENETICS
    115                 BLOOD
    112                 DEVELOPMENT
    108                 NEURON
    108                 HEMOGLOBIN
    102                 AGRIC. BIOL. CHEM.


<PAGE>

           APPENDIX B: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                         ***********************
******************       *  EMBL Nucleotide    *       **********************
* EPD [Euk.Prom] * <---> *  Sequence Database  * <---- * ECDC [E.coli map]  *
******************       *       [EBI]         *       **********************
                         ***********************
                          ^  ^ ^  ^  ^ ^ ^  ^
******************        |  | |  I  | | |  |
* FlyBase        * <------+  | |  I  | | |  |          **********************
* [D.melanogas.] *        |  | |  I  | | |  +--------> * GCRDb [7TM recep.] *
******************        |  | |  I  | | |  |          **********************
                          |  | |  I  | | |  |
******************        |  | |  I  | | |  |          **********************
* SubtiList      * <---------+ |  I  | | +-----------> * EcoGene [E.coli]   *
* [B.subtilis]   *        |  | |  I  | | |  |          **********************
******************        |  | |  I  | | |  |
                          |  | |  I  | | |  |          **********************
******************        |  | |  I  +---------------> * LISTA [Yeast]      *
* MaizeDb        * <-----------+  I  | | |  |          **********************
* [Zea mays]     *        |  | |  I  | | |  |
******************        |  | |  I  | | |  |          **********************
                          |  | |  I  | +-------------> * SGD [Yeast]        *
******************        |  | |  I  | | |  |          **********************
* WormPep        *        |  | |  I  | | |  |
* [C.elegans]    * <----+ |  | |  I  | | |  |          **********************
******************      | |  | |  I  | | |  | +------> * DictyDB [D.disco.] *
                        | |  | |  I  | | |  | |        **********************
******************      | v  v v  v  v v v  v v
* REBASE         *      ***********************        **********************
* [Restriction   * <--- *  SWISS-PROT         * <----- * ENZYME [Nomencl.]  *
*  enzymes]      *      *  Protein Sequence   *        **********************
******************      *  Data Bank          *            v
                        ***********************        **********************
******************      ^ ^ ^  ^ ^  ^ | ^ ^ |          * OMIM [Human]       *
* StyGene        *      | | |  | |  | | | | +--------> **********************
* [S.Typhimurium]* <----+ | |  | |  | | | |
******************        | |  | |  | | | |            **********************
                          | |  | |  | | | +----------> * ECO2DBASE     [2D] *
******************        | |  | |  | | |              **********************
* Transfac       * <------+ |  | |  | | |
******************          |  | |  | | |              **********************
                            |  | |  | | +------------> * SWISS-2DPAGE  [2D] *
******************          |  | |  | |                **********************
* PROSITE        * <--------+  | |  | |
* [Patterns and  *             | |  | |                **********************
* profiles]      *             | |  | +--------------> * Aarhus/Ghent  [2D] *
******************             | |  |                  **********************
             |                 | |  |
             |                 | |  +----------------> **********************
             |                 | |                     * YEPD [Yeast]  [2D] *
             |                 | +-----------------+   **********************
             |                 v                   |
             |          ***********************    +-> **********************
             +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
                        ***********************        **********************
<PAGE>
  

Swiss-Prot release 31.0

Published February 1, 1995

                    SWISS-PROT RELEASE 31.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 31.0  of SWISS-PROT  contains 43470 sequence entries, comprising
   15'335'248 amino acids abstracted from 39750 references. This represents
   an increase  of 8.3% over release 30. The recent growth of the data bank
   is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   2.0        09/86               3939               900 163
   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368
   31.0       02/95              43470            15 335 248


      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 30

   2.1  Sequences and annotations

   About 3243 sequences have been added since release 30, the sequence data
   of 493  existing entries  has been  updated and  the annotations of 6729
   entries have been revised.






<PAGE>



   2.2  What's happening with the model organisms

   As we  announced in  the last four releases we have selected a number of
   organisms that  are the  target  of  genome  sequencing  and/or  mapping
   projects and for which we intend to:

   -  Be as  complete as  possible. All sequences available at a given time
      should be  immediately included  in SWISS-PROT.  This  also  includes
      sequence corrections and updates;
   -  Provide a higher level of annotation;
   -  Provide cross-references  to specialized  database(s)  that  contain,
      among other  data, some genetic information about the genes that code
      for these proteins;
   -  Provide specific indices or documents.

   What was  done since  the last  release or  in preparation  for the next
   release:

   -  We have added four species to the list of model organisms:

        Salmonella typhimurium. We added all the missing publicly available
        protein sequences  from this  species and  we cross-referenced  the
        Salmonella sequences  to the  StyGene database  from Ken  Rudd (see
        section 2.3.2).  A new  documentation file (SALTY.TXT) is available
        (see section  2.4) that list all the Salmonella typhimurium entries
        linked to StyGene.

        Schizosaccharomyces pombe (Fission yeast). For this organism we are
        looking for  a genomic  database to  which the  relevant SWISS-PROT
        entries can  be linked. In this release we have added a significant
        number of  S.pombe sequences.  A new documentation file (POMBE.TXT)
        is   available    (see   section    2.4)   that    list   all   the
        Schizosaccharomyces pombe entries and their gene designation(s).

        Arabidopsis thaliana  (Mouse-ear cress). As of this release we have
        not yet  cross-linked the entries originating from this organism to
        a specific  genomic database  nor have  we yet  made a  significant
        effort to  enter all the Arabidopsis entries in SWISS-PROT. Such an
        effort is planned for the next release.

        Sulfolobus solfataricus.  This archebacteria  is the  target  of  a
        genomic project carried out at the Dalhousie University in Halifax,
        Canada. It is expected that data originating from this project will
        be available very soon. In preparation for this development we have
        already entered  in SWISS-PROT  all  the  available  S.solfataricus
        sequences that are publicly available.

   -  SWISS-PROT is  now cross-referenced  to  the  Bacillus  subtilis  168
      SubtiList database  designed by  Ivan Moszer of the Pasteur Institute
      in Paris  (see section 2.3.3). We now have in SWISS-PROT, 100% of the
      publicly available  protein sequences  referenced in SubtiList. A new
      documentation file (SUBTILIS.TXT) is available (see section 2.4) that
      list all the Bacillus subtilis entries linked to SubtiList.




<PAGE>



   -  SWISS-PROT is  now  cross-referenced  to  the  yeast  LISTA  database
      designed by  Patrick Linder  of the University of Geneva (see section
      2.3.1). We  have worked together to insure that the gene name used in
      LISTA always  correspond to the first one (if more than one gene name
      exist)  listed   in  the  relevant  SWISS-PROT  entry.  All  proteins
      referenced in  LISTA are  now in  SWISS-PROT. We  also made  a  major
      effort to  integrate in  this release  all the new sequence data from
      the complete  sequences  of  chromosomes  I,  V,  VIII  and  IX.  New
      documentations files are available for each of these chromosomes (see
      section 2.4).


   Here is the current status of the model organisms:

   Organism        Database                    Index file       Number of
                   cross-referenced                             sequences
   --------------  --------------------------  --------------   ---------
   A.thaliana      None yet                    In preparation         271
   B.subtilis      SubtiList                   SUBTILIS.TXT          1083
   C.elegans       WormPep                     CELEGANS.TXT           690
   D.discoideum    DictyDB                     DICTY.TXT              199
   D.melanogaster  FlyBase                     In preparation         711
   E.coli          EcoGene                     ECOLI.TXT             3151
   H.sapiens       MIM                         MIMTOSP.TXT           3067
   S.cerevisiae    LISTA                       YEAST.TXT             3144
   S.typhimurium   StyGene                     SALTY.TXT              568
   S.pombe         None yet                    POMBE.TXT              279
   S.solfataricus  None yet                    None yet                58


        Other relevant information

   We apologizes  for running  behind with  new sequencing data from the C.
   elegans genomic  project. Starting  with the  next release, we will have
   implemented a  mechanism to  insure that data from the genome project is
   directly fed into the SWISS-PROT annotation 'pipeline'.

   With the  help of the group of Elizabeth Kutter from the Evergreen State
   College, we  have revised  and added  protein sequence  entries from the
   completed genome  of phage  T4. The next release will incorporate all of
   the data  from the complete genome of the Autographa californica nuclear
   polyhedrosis virus.


   2.3  Changes in the DR line

   In this  release, we have added cross-references from SWISS-PROT to five
   additional databases.  These cross-references  are  present  in  the  DR
   lines.

        2.3.1  LISTA

   The LISTA  database of  budding yeast  (Saccharomyces cerevisiae)  genes
   coding for  proteins prepared  under the supervisation of Patrick Linder



<PAGE>



   at the University of Geneva (See: Doelz R., Mosse M.-O., Slonimski P.P.,
   Bairoch A., and Linder P.; Nucleic Acids Res. (1994), 22:3459-3461).


   Data bank identifier:    LISTA
   Primary identifier  :    Unique identifier  attributed by  LISTA to  the
                            gene coding for the protein
   Secondary identifier:    The gene designation (name)
   Example             :    DR   LISTA; SC00018; ACT1.


        2.3.2  StyGene

   The  StyGene   section  of   the  StySeq/StyMap   integrated  Salmonella
   typhimurium LT2  database, both  prepared by  Ken Rudd  at the  National
   Center for Biotechnology Information (NCBI).

   Data bank identifier:    STYGENE
   Primary identifier  :    Unique identifier  attributed by StyGene to the
                            gene coding for the protein
   Secondary identifier:    The gene designation (name)
   Example             :    DR   STYGENE; SG10312; PROV.


        2.3.3  SubtiList

   The SubtiList  relational database  for the Bacillus subtilis 168 genome
   prepared under the supervisation of Ivan Moszer at the Pasteur Institute
   (See Moszer I., Glaser P., and Danchin A.; Microbiol. (1995), In press).

   Data bank identifier:    SUBTILIST
   Primary identifier  :    Unique identifier  attributed by  SubtiList  to
                            the gene coding for the protein
   Secondary identifier:    The gene designation (name)
   Example             :    DR   SUBTILIST; BG10774; OPPD.


        2.3.4  HSSP

   The database  of Homology-derived Secondary Structure of Proteins (HSSP)
   prepared under  the supervisation  of Chris  Sander at  the  EMBL  (See:
   Sander C., and Schneider R.; Nucleic Acids Res. (1993), 21:3105-3109).

   Data bank identifier:    HSSP
   Primary identifier  :    Accession number  of a  SWISS-PROT entry cross-
                            referenced to  a PDB  entry  whose structure is
                            expected to be  similar to that of the entry in
                            which the HSSP cross-reference is present
   Secondary identifier:    Entry name of the PDB structure related to that
                            of the entry in  which the HSSP cross-reference
                            is present
   Example             :    DR   HSSP; P00438; 1DOB.





<PAGE>



        2.3.5  Transfac

   The  transcription   factor  database   (Transfac)  developed  by  Edgar
   Wingender   and    Rainer   Knueppel    from   the   Gesellschaft   fuer
   Biotechnologische Forschung mbH in Braunschweig.

   Data bank identifier:    TRANSFAC
   Primary identifier  :    Unique identifier  (accession  number)  of  the
                            Tranfac entry
   Secondary identifier:    None; a dash '-' is stored in that field
   Example             :    DR   TRANSFAC; T00141; -.


   2.4  Status of the documentation files

   SWISS-PROT is  distributed with  a large  number of documentation files.
   Some of  these files  have been  available for  a long  time  (the  user
   manual, release  notes, the  various  indices  for  authors,  citations,
   keywords, etc.),  but  many  have  been  created  recently  and  we  are
   continuously  adding  new  files.  The  following  table  list  all  the
   documents that are currently available or that will be added in the next
   few months.

   USERMAN .TXT   User manual
   RELNOTES.TXT   Release notes
   SHORTDES.TXT   Short description of entries in SWISS-PROT

   JOURLIST.TXT   List of abbreviations for journals cited
   KEYWLIST.TXT   List of keywords in use
   SPECLIST.TXT   List of organism identification codes
   EXPERTS .TXT   List of on-line experts for PROSITE and SWISS-PROT

   ACINDEX .TXT   Accession number index
   AUTINDEX.TXT   Author index
   CITINDEX.TXT   Citation index
   KEYINDEX.TXT   Keyword index
   SPEINDEX.TXT   Species index

   7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
   CDLIST  .TXT   CD nomenclature for surface proteins of human leucocytes
   CELEGANS.TXT   Index  of   Caenorhabditis  elegans   entries  and  their
                  corresponding  gene    designations  and  WormPep  cross-
                  references
   DICTY   .TXT   Index  of  Dictyostelium  discoideum  entries  and  their
                  corresponding   gene  designations  and   DictyDB  cross-
                  references
   EC2DTOSP.TXT   Index of  Escherichia coli  Gene-protein database entries
                  referenced in SWISS-PROT
   ECOLI   .TXT   Index of  Escherichia coli  K12 chromosomal  entries  and
                  their corresponding EcoGene cross-reference
   EMBLTOSP.TXT   Index of EMBL Database entries referenced in SWISS-PROT
   EXTRADOM.TXT   Nomenclature of extracellular domains





<PAGE>



   GLYCOSYL.TXT   Index of  glycosyl hydrolases  classified by  families on
                  the basis of sequence similarities [2]
   HOXLIST .TXT   Vertebrate homeobox proteins: nomenclature and index
   HUMCHR21.TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome 21 [1]
   HUMCHRY .TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome Y [1]
   MIMTOSP .TXT   Index of MIM entries referenced in SWISS-PROT
   NOMLIST .TXT   List of nomenclature related references for proteins
   PDBTOSP .TXT   Index of Brookhaven PDB entries referenced in SWISS-PROT
   PLASTID .TXT   List of chloroplast and cyanelle encoded proteins
   POMBE   .TXT   Index of  Schizosaccharomyces pombe entries in SWISS-PROT
                  and their corresponding gene designations [1]
   RESTRIC .TXT   List of restriction enzymes and methylases entries
   RIBOSOMP.TXT   Index of ribosomal proteins classified by families on the
                  basis of sequence similarities [2]
   SALTY   .TXT   Index of  Salmonella typhimurium  LT2 chromosomal entries
                  and their corresponding StyGene cross-references [1]
   SUBTILIS.TXT   Index of  Bacillus subtilis  168 chromosomal  entries and
                  their corresponding SubtiList cross-references [1]
   YEAST   .TXT   Index  of  Saccharomyces  cerevisiae  entries  and  their
                  corresponding gene designations
   YEAST1  .TXT   Yeast Chromosome I entries [1]
   YEAST2  .TXT   Yeast Chromosome II entries
   YEAST3  .TXT   Yeast Chromosome III entries
   YEAST5  .TXT   Yeast Chromosome V entries [1]
   YEAST8  .TXT   Yeast Chromosome VIII entries [1]
   YEAST9  .TXT   Yeast Chromosome IX entries [1]
   YEAST11 .TXT   Yeast Chromosome XI entries

   Notes:

   [1]  New in release 31.
   [2]  Will be available starting with release 32 in June 1995.


   2.5  The Expasy World-Wide Web server

        2.5.1 Background information

   The World-Wide Web (WWW), which originated at CERN, is a powerful global
   information  system   merging  networked   information   retrieval   and
   hypertext. It  gives access, using hypertext links, to the documents and
   information contained  in all the existing WWW servers around the world,
   as well  as to  the data  obtainable through other information retrieval
   systems like WAIS, Gopher, X500, etc. To access a WWW server, one has to
   run on a local computer a client program (a WWW browser), which displays
   hypertext documents.  The user  can then either request a keyword search
   or jump  to another  document by following a hypertext link. WWW has the
   outstanding advantage  of extending  the hypertext  model to  the  whole
   world (by allowing hypertext jumps to documents anywhere on the internet
   network) and  by being  device and  user-interface independent (browsers
   exist for  a variety  of computers  and user-interfaces,  including Unix




<PAGE>



   workstations  running  XWindows,  MacIntoshes  and  PCs  with  Microsoft
   Windows).

   The ExPASy  WWW server  allows access, using the user-friendly hypertext
   model, to  the SWISS-PROT,  PROSITE,  ENZYME,  SWISS-2DPAGE  and  SWISS-
   3DIMAGE databases and, through any SWISS-PROT protein sequence entry, to
   other  databases   such  as   EMBL,  REBASE,  FlyBase,  GCRDb,  MaizeDB,
   SubtiList, OMIM,  PDB, HSSP,  YEPD and Medline. Using a browser which is
   able to  display images  one can also remotely access 2D gels image data
   from SWISS-2DPAGE.

   A WWW  server can  be accessed  on  the  internet  through  its  Uniform
   Resource Locator  (URL), the addressing system defined by the WWW model.
   The URL for the ExPASy WWW server is:

                           http://expasy.hcuge.ch/
   or
                            http://129.195.254.61/

   To access a WWW server, you need to run a browser (or client) program on
   your local computer. Browsers exist for a variety of machines and may be
   obtained by  anonymous ftp. ExPASy can be used with any WWW browser, but
   we recommend  either NCSA  Mosaic or  Netscape Navigator.  Both are very
   flexible and powerful browsers with a graphical user interface; they are
   available for  Unix boxes  using X11/Motif; for Apple McIntoshes and for
   Microsoft Windows. You can get them from various FTP sites, for example:

      ftp.ncsa.uiuc.edu (for Mosaic)
      ftp1.netscape.com (for Netscape)

   For more  information on  the  ExPASy  WWW  server,  you  can  read  the
   following article:

      Appel R.D., Bairoch A., Hochstrasser D.F.
      A new  generation of  information retrieval tools for biologists: the
      example of the ExPASy WWW server.
      Trends Biochem. Sci. 19:258-260(1994).

   Or you can contact Dr. Ron Appel:

      Email: appel@cih.hcuge.ch
      Fax: +41-22-372 61 98


        2.5.2 SWISS-SHOP

   Thanks to the work of Manuel Peitsch from the Geneva Glaxo Institute for
   Molecular Biology,  we can  provide, on ExPASy, an important new service
   called SWISS-SHOP. SWISS-Shop allows any users of SWISS-PROT to indicate
   what proteins  he/she is  interested in.  This can be done using various
   criteria that can be combined:

   -  By entering  one  or  more  words  that  should  be  present  in  the
      description line;



<PAGE>



   -  By entering one or more species name(s) or taxonomic division(s);
   -  By entering one or more keywords;
   -  By entering one or more author names;
   -  By entering the accession number (or entry name) of a PROSITE pattern
      or a user-defined sequence pattern;
   -  By entering  the accession  number (or  entry name)  of  an  existing
      SWISS-PROT entry or by entering a "private" sequence.

   Every week,  the new  sequences entered  in SWISS-PROT are automatically
   compared with all the criteria that have been defined by the users. If a
   sequence corresponds  to the  selection criteria defined by a user, that
   sequence is sent by electronic mail.


        2.5.3 What else is new on ExPASy

   Since the  last release,  there has been a number of new developments on
   the ExPASy WWW server. Here are some highlights of these changes:

   -  Access to  the ENZYME data bank has been fully implemented in ExPASy.
      Many different  access options  are allowed.  Hypertext links between
      ENZYME and  SWISS-PROT, PROSITE, MIM and the Japanese Ligand database
      are available.

   -  WWW  links   have  been  implemented  between  SWISS-PROT  and  HSSP,
      SubtiList, and YEPD.

   -  Cross-references from  SWISS-PROT to Flybase now use the links to the
      new server for that database (at Harvard).

   -  The  page   giving  access  to  the  SWISS-PROT  documents  has  been
      completely updated.  A new  page allows  access to  all  of  the  old
      release notes.


   2.6  Important forthcoming change

   In the next release, the RM (Reference Medline) line will be replaced by
   a more 'generic' line called RX (Reference cross-references). The format
   of that line will be:

   RX   bibliographic_database_name; identifier.

   As of the next release, the only "bibliographic_database_name" that will
   be used  will be  "MEDLINE" and the associated "identifier" is the eight
   digit  Medline  Unique  Identifier  (UID).  But  it  is  'rumored'  that
   additional bibliographic  databases are  interested to  be linked to the
   sequence databases.

   Example:

   RM   91002678





<PAGE>



   will be changed to:

   RX   MEDLINE; 91002678.


   2.7  Weekly updates of SWISS-PROT

   Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
   are updated at each update:

   new_seq.dat    Contains all the new entries since the last full release.
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release.
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since
                  the last release.

   Currently these  files are  available on  the  following  anonymous  ftp
   servers:

   Organization   ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/swiss-prot/updates

   Organization   National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/swiss-prot/updates

   Organization   European Bioinformatics Institute (EBI)
   Address        ftp.ebi.ac.uk (or 193.62.196.6)
   Directory      /pub/databases/swissprot/new

   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.

   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free. Also,
   for the  same reason,  new  entries  do  not  contain  an  OC  (Organism
   Classification) line.



                            3. ENZYME AND PROSITE

   3.1  The ENZYME data bank

   Release 18.0  of the  ENZYME data bank is distributed with release 31 of
   SWISS-PROT. ENZYME  release 18.0  contains information  relative to 3546
   enzymes.





<PAGE>



   In this release we have added cross-references from the ENZYME data bank
   to the PROSITE data bank document entries that mention specific types of
   enzymes. These  lines are  present in  a new line type (PR) whose format
   is:

   PR   PROSITE; PSITE_DOC_AC_NB

   where 'PSITE_DOC_AC_NB'  is a  PROSITE document  entry accession number.
   Example:

   PR   PROSITE; PDOC00065;


   3.2  The PROSITE data bank

   Release 12.2  of the PROSITE data bank is distributed with release 30 of
   SWISS-PROT.  Release  12.2  contains  785  documentation  chapters  that
   describes 1029  different patterns, rules and profiles/matrices. Release
   12.2 does  not really  represent a new release; the only changes between
   releases 12.1  and 12.2  are updating  of the pointers to the SWISS-PROT
   entries whose  name have  been modified between  releases 30 and 31. The
   next release  of PROSITE  (13.0) will  be distributed with release 32 of
   SWISS-PROT.




                             WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.






















<PAGE>



                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.57   Gln (Q) 4.02   Leu (L) 9.24   Ser (S) 7.19
   Arg (R) 5.21   Glu (E) 6.29   Lys (K) 5.88   Thr (T) 5.80
   Asn (N) 4.50   Gly (G) 6.91   Met (M) 2.36   Trp (W) 1.28
   Asp (D) 5.30   His (H) 2.24   Phe (F) 4.02   Tyr (Y) 3.21
   Cys (C) 1.73   Ile (I) 5.64   Pro (P) 4.98   Val (V) 6.51

   Asx (B) 0.002  Glx (Z) 0.002  Xaa (X) 0.02


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 4714

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 2139
                            2x:  756
                            3x:  426
                            4x:  264
                            5x:  191
                            6x:  191
                            7x:  117
                            8x:   80
                            9x:  102
                           10x:   46
                       11- 20x:  185
                       21- 50x:  128
                       51-100x:   43
                         >100x:   45













<PAGE>



        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        3151          Escherichia coli
         2        3144          Baker's yeast (Saccharomyces cerevisiae)
         3        3067          Human
         4        1843          Mouse
         5        1683          Rat
         6        1083          Bacillus subtilis
         7         735          Bovine
         8         711          Fruit fly (Drosophila melanogaster)
         9         690          Caenorhabditis elegans
        10         573          Chicken
        11         568          Salmonella typhimurium
        12         449          African clawed frog (Xenopus laevis)
        13         403          Rabbit
        14         357          Pig
        15         279          Fission yeast (Schizosaccharomyces pombe)
        16         275          Bacteriophage T4
        17         271          Arabidopsis thaliana (Mouse-ear cress)
        18         258          Maize
        19         251          Vaccinia virus (strain Copenhagen)
        20         219          Rice
        21         214          Pseudomonas aeruginosa
        22         199          Slime mold (Dictyostelium discoideum)
        23         195          Tobacco
        24         193          Human cytomegalovirus (strain AD169)
        25         183          Vaccinia virus (strain WR)
        26         180          Pea
        27         173          Wheat
        28         168          Barley
        29         154          Staphylococcus aureus
        30         153          Dog
        31         151          Marchantia polymorpha (Liverwort)
                   151          Neurospora crassa
        33         147          Soybean
        34         146          Variola virus
        35         144          Pseudomonas putida
                   144          Rhodobacter capsulatus
        37         141          Sheep
        38         135          Spinach
        39         131          Klebsiella pneumoniae
        40         120          Bacillus stearothermophilus
        41         117          Tomato
        42         116          Agrobacterium tumefaciens
        43         111          Potato
        44         107          Rhizobium meliloti
        45         102          Lactococcus lactis (subsp. lactis)









<PAGE>



   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    2381             1001-1100      404
                 51- 100    4198             1101-1200      292
                101- 150    5781             1201-1300      214
                151- 200    4200             1301-1400      140
                201- 250    3732             1401-1500      127
                251- 300    3303             1501-1600       70
                301- 350    3099             1601-1700       57
                351- 400    3136             1701-1800       52
                401- 450    2365             1801-1900       58
                451- 500    2477             1901-2000       38
                501- 550    1775             2001-2100       21
                551- 600    1247             2101-2200       51
                601- 650     898             2201-2300       55
                651- 700     682             2301-2400       22
                701- 750     629             2401-2500       26
                751- 800     497             >2500          132
                801- 850     375
                851- 900     405
                901- 950     281
                951-1000     250


   A.4  Longest sequences

   The longest sequences (>=4000 residues) are listed here:

                               HTS1_COCCA  5217
                               FAT_DROME   5147
                               RYNR_RABIT  5037
                               RYNR_HUMAN  5032
                               RYNC_RABIT  4969
                               DYHC_DICDI  4725
                               DYHC_RAT    4644
                               DYHC_DROME  4639
                               APB_HUMAN   4563
                               APOA_HUMAN  4548
                               RRPA_CVMJH  4488
                               DYHC_TRIGR  4466
                               DYHC_ANTCR  4466
                               GRSB_BACBR  4451
                               PKSK_BACSU  4447
                               PKSL_BACSU  4427
                               PLEC_RAT    4140
                               DYHC_YEAST  4092
                               RRPA_CVH22  4085









<PAGE>



   A.5  List of the most cited journals in SWISS-PROT

   Citations            Journal abbreviation

   4517                 J. BIOL. CHEM.
   3097                 NUCLEIC ACIDS RES.
   2886                 PROC. NATL. ACAD. SCI. U.S.A.
   1869                 J. BACTERIOL.
   1571                 FEBS LETT.
   1543                 GENE
   1454                 EUR. J. BIOCHEM.
   1313                 EMBO J.
   1263                 NATURE
   1215                 BIOCHEM. BIOPHYS. RES. COMMUN.
   1183                 BIOCHEMISTRY
    933                 J. MOL. BIOL.
    931                 BIOCHIM. BIOPHYS. ACTA
    914                 CELL
    860                 MOL. CELL. BIOL.
    738                 MOL. GEN. GENET.
    689                 VIROLOGY
    650                 BIOCHEM. J.
    647                 PLANT MOL. BIOL.
    566                 SCIENCE
    539                 J. BIOCHEM.
    509                 MOL. MICROBIOL.
    443                 J. VIROL.
    392                 J. GEN. VIROL.
    284                 J. CELL BIOL.
    264                 GENOMICS
    253                 GENES DEV.
    249                 BIOL. CHEM. HOPPE-SEYLER
    229                 CURR. GENET.
    218                 ARCH. BIOCHEM. BIOPHYS.
    214                 YEAST
    213                 HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
    212                 J. IMMUNOL.
    190                 MOL. BIOCHEM. PARASITOL.
    184                 J. GEN. MICROBIOL.
    173                 MOL. ENDOCRINOL.
    164                 INFECT. IMMUN.
    156                 J. CLIN. INVEST.
    149                 ONCOGENE
    143                 PLANT PHYSIOL.
    142                 DNA
    137                 FEMS MICROBIOL. LETT.
    132                 HUM. MOL. GENET.
    129                 J. EXP. MED.
    120                 AM. J. HUM. GENET.
    117                 J. MOL. EVOL.
    111                 GENETICS
    102                 BLOOD
    101                 AGRIC. BIOL. CHEM.




<PAGE>


           APPENDIX B: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                         ***********************
******************       *  EMBL Nucleotide    *       **********************
* EPD [Euk.Prom] * <---> *  Sequence Data      * <---- * ECD [E. coli map]  *
******************       *  Library      [EBI] *       **********************
                         ***********************
                          ^  ^ ^  ^  ^  ^  ^
******************        |  | |  I  |  |  |
* FlyBase        * <------+  | |  I  |  |  |           **********************
******************        |  | |  I  |  |  +---------> * GCRDb [7TM recep.] *
                          |  | |  I  |  |  |           **********************
******************        |  | |  I  |  |  |
* SubtiList      * <---------+ |  I  |  |  |           **********************
******************        |  | |  I  |  +------------> * EcoGene [E.coli]   *
                          |  | |  I  |  |  |           **********************
******************        |  | |  I  |  |  |
* MaizeDb        * <-----------+  I  |  |  |           **********************
******************        |  | |  I  +---------------> * LISTA (Yeast)      *
                          |  | |  I  |  |  |           **********************
******************        |  | |  I  |  |  |
* WormPep        *        |  | |  I  |  |  |           **********************
* [C.elegans]    * <----+ |  | |  I  |  |  |  +------> * DictyDB [D.disco.] *
******************      | |  | |  I  |  |  |  |        **********************
                        | |  | |  I  |  |  |  |
******************      | v  v v  v  v  v  v  v        **********************
* REBASE         *      ***********************        * ENZYME [Nomencl.]  *
* [Restriction   * <--- *  SWISS-PROT         * <----- **********************
*  enzymes]      *      *  Protein Sequence   *            |
******************      *  Data Bank          *            v
                        ***********************        **********************
******************      ^ ^ ^  ^ ^  ^ | ^ ^ |          * OMIM   [Diseases]  *
* StyGene        *      | | |  | |  | | | | +--------> **********************
* [S.Typhimurium]* <----+ | |  | |  | | | |
******************        | |  | |  | | | |            **********************
                          | |  | |  | | | +----------> * ECO2DBASE     [2D] *
******************        | |  | |  | | |              **********************
* Transfac       * <------+ |  | |  | | |
******************          |  | |  | | |              **********************
                            |  | |  | | +------------> * SWISS-2DPAGE  [2D] *
******************          |  | |  | |                **********************
* PROSITE        * <--------+  | |  | |
* [Patterns]     *             | |  | |                **********************
******************             | |  | +--------------> * Aarhus/Ghent  [2D] *
             |                 | |  |                  **********************
             |                 | |  |
             |                 | |  +----------------> **********************
             |                 | |                     * YEPD (Yeast)  [2D] *
             |                 | +-----------------+   **********************
             |                 v                   |
             |          ***********************    +-> **********************
             +--------> * PDB [3D structures] * <----- * HSSP (3D simil.)   *
                        ***********************        **********************


<PAGE>
  

Swiss-Prot release 30.0

Published October 1, 1994

                    SWISS-PROT RELEASE 30.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 30.0  of SWISS-PROT  contains 40292 sequence entries, comprising
   14'147'368 amino acids abstracted from 37887 references. This represents
   an increase  of 5.1% over release 29. The recent growth of the data bank
   is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   2.0        09/86               3939               900 163
   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368



      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 29

   2.1  Sequences and annotations

   About 2005 sequences have been added since release 29, the sequence data
   of 296  existing entries  has been  updated and  the annotations of 5260
   entries have been revised.






<PAGE>



   We are  continuing the process to 'clean-up' the various representations
   of domains  in the  feature lines  (especially the  usage of the feature
   keys "DOMAIN", "REPEAT", "DNA_BIND", and "SITE"). We also have undertook
   an overall  revision of  the CC  topics "SUBCELLULAR LOCATION", "SUBUNIT
   and "CAUTION".


   2.2  What's happening with the model organisms

   As we  announced in the last three releases we have selected a number of
   organisms that  are the  target  of  genome  sequencing  and/or  mapping
   projects and for which we intend to:

   -  Be as  complete as  possible. All sequences available at a given time
      should be  immediately included  in SWISS-PROT.  This  also  includes
      sequence corrections and updates.
   -  Provide a high level of annotations.
   -  Cross-references to specialized database(s) that contain, among other
      data, some  genetic information  about the  genes that code for these
      proteins.
   -  Provide specific indices or documents.

   What was  done since  the last  release or  in preparation  for the next
   release:

   -  We have  added Bacillus  subtilis as  a model  organism. We will soon
      link SWISS-PROT to the SubtiList database currently being designed by
      Ivan Moszer of the Pasteur Institute in Paris.
   -  The next  release of  LISTA will  include accession  numbers for each
      gene entry;  we will  therefore be able to cross-reference SWISS-PROT
      to LISTA.

   Here is the current status of the model organisms:

   Organism        Database                    Index file       Number of
                   cross-referenced                             sequences
   --------------  --------------------------  --------------   ---------
   B.subtilis      SubtiList (in preparation)  In preparation         777
   C.elegans       WormPep                     CELEGANS.TXT           688
   D.discoideum    DictyDB                     DICTY.TXT              198
   D.melanogaster  FlyBase                     In preparation         647
   E.coli          EcoGene                     ECOLI.TXT             2901
   H.sapiens       MIM                         MIMTOSP.TXT           2914
   S.cerevisiae    LISTA (in preparation)      YEAST.TXT             2287


   2.3  Changes in the CC line

   We have  introduced in  this release a new 'topic' for the comments (CC)
   line-type: DOMAIN.  This topic  is used  to describe  the general domain
   structure of a protein. It is intended to be used for general statements
   on the domain organization of specific protein and supplement the domain
   information available in the feature table.




<PAGE>



   Examples of its usage:

   CC   -!- DOMAIN: CONTAINS A COILED-COIL DOMAIN ESSENTIAL FOR VESICULAR
   CC       TRANSPORT AND A DISPENSABLE C-TERMINAL REGION.

   CC   -!- DOMAIN: THE B CHAIN IS COMPOSED OF TWO DOMAINS, EACH DOMAIN
   CC       CONSISTS OF 3 HOMOLOGOUS SUBDOMAINS (ALPHA, BETA, GAMMA).


   The topic  "ALTERNATIVE  SPLICING"  has  been  renamed  to  "ALTERNATIVE
   PRODUCTS" as  it used  for describing  the existence  of related protein
   sequence(s) produced  by either alternative splicing of the same gene(s)
   or by the use of alternative initiation codons.

   Examples of its usage:

   CC   -!- ALTERNATIVE PRODUCTS: SKELETAL MUSCLE AND FIBROBLAST
   CC       TROPOMYOSINS ARE OBTAINED BY ALTERNATIVE MRNA SPLICING.

   CC   -!- ALTERNATIVE PRODUCTS: USING ALTERNATIVE INITIATION CODONS IN
   CC       THE SAME READING FRAME, THE GENE TRANSLATES INTO THREE
   CC       ISOZYMES: ALPHA, BETA AND BETA'.


   2.4  Changes in the DR line

   We  have   added  cross-references   from  SWISS-PROT   to   the   Yeast
   Electrophoresis Protein  Database (YEPD) from the Quest Protein Database
   Center of  Cold Spring  Harbor Laboratory.  These  cross-references  are
   present in the DR lines:

   Data bank identifier:    YEPD
   Primary identifier  :    Protein spot alphanumeric designation.
   Secondary identifier:    None; a dash '-' is stored in that field
   Example             :    DR   YEPD; 4270; -.


   2.5  Status of the documentation files

   SWISS-PROT is  distributed with  a large  number of documentation files.
   Some of  these files  have been  available for  a long  time  (the  user
   manual, release  notes, the  various  indices  for  authors,  citations,
   keywords, etc.),  but  many  have  been  created  recently  and  we  are
   continuously  adding  new  files.  The  following  table  list  all  the
   documents that are currently available or that will be added in the next
   few months.

   USERMAN .TXT   User manual
   RELNOTES.TXT   Release notes
   SHORTDES.TXT   Short description of entries in SWISS-PROT
   JOURLIST.TXT   List of abbreviations for journals cited
   KEYWLIST.TXT   List of keywords in use
   SPECLIST.TXT   List of organism identification codes
   EXPERTS .TXT   List of on-line experts for PROSITE and SWISS-PROT



<PAGE>



   ACINDEX .TXT   Accession number index
   AUTINDEX.TXT   Author index
   CITINDEX.TXT   Citation index
   KEYINDEX.TXT   Keyword index
   SPEINDEX.TXT   Species index

   7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
   CDLIST  .TXT   CD nomenclature for surface proteins of human leucocytes
   CELEGANS.TXT   Index  of   Caenorhabditis  elegans   entries  and  their
                  corresponding  gene   designations  and   WormPep  cross-
   DICTY   .TXT   Index  of  Dictyostelium  discoideum  entries  and  their
                  corresponding   gene   designations  and  DictyDB  cross-
                  references
   EC2DTOSP.TXT   Index of  Escherichia coli  Gene-protein database entries
                  referenced in SWISS-PROT
   ECOLI   .TXT   Index of  Escherichia coli  K12 chromosomal  entries  and
                  their corresponding EcoGene cross-reference [4]
   EMBLTOSP.TXT   Index of EMBL Database entries referenced in SWISS-PROT
   EXTRADOM.TXT   Nomenclature of extracellular domains [1]
   GLYCOSYL.TXT   Index of  glycosyl hydrolases  classified by  families on
                  the basis of sequence similarities [2]
   HOXLIST .TXT   Vertebrate homeobox proteins: nomenclature and index
   MIMTOSP .TXT   Index of MIM entries referenced in SWISS-PROT
   NOMLIST .TXT   List of nomenclature related references for proteins
   PDBTOSP .TXT   Index of Brookhaven PDB entries referenced in SWISS-PROT
   PLASTID .TXT   List of chloroplast and cyanelle encoded proteins
   RESTRIC .TXT   List of restriction enzymes and methylases entries
   RIBOSOMP.TXT   Index of ribosomal proteins classified by families on the
                  basis of sequence similarities [2]
   YEAST   .TXT   Index  of  Saccharomyces  cerevisiae  entries  and  their
                  corresponding gene designations
   YEAST2 .TXT    Yeast Chromosome II entries [1]
   YEAST3 .TXT    Yeast Chromosome III entries [1]
   YEAST8 .TXT    Yeast Chromosome VIII entries [2]
   YEAST11 .TXT   Yeast Chromosome XI entries

   Notes:

   [1]  New in release 30.
   [2]  Will be available starting with release 31 in February 1995.


   2.6  The Expasy World-Wide Web server

   The World-Wide Web (WWW), which originated at CERN, is a powerful global
   information  system   merging  networked   information   retrieval   and
   hypertext. It  gives access, using hypertext links, to the documents and
   information contained  in all the existing WWW servers around the world,
   as well  as to  the data  obtainable through other information retrieval
   systems like WAIS, Gopher, X500, etc. To access a WWW server, one has to
   run on a local computer a client program (a WWW browser), which displays
   hypertext documents.  The user  can then either request a keyword search
   or jump  to another  document by following a hypertext link. WWW has the




<PAGE>



   outstanding advantage  of extending  the hypertext  model to  the  whole
   world (by allowing hypertext jumps to documents anywhere on the internet
   network) and  by being  device and  user-interface independent (browsers
   exist for  a variety  of computers  and user-interfaces,  including Unix
   workstations  running  XWindows,  MacIntoshes  and  PCs  with  Microsoft
   Windows).

   The ExPASy  WWW server  allows access, using the user-friendly hypertext
   model,  to  the  SWISS-PROT,  PROSITE,  SWISS-2DPAGE  and  SWISS-3DIMAGE
   databases and,  through any  SWISS-PROT protein sequence entry, to other
   databases such  as EMBL, PROSITE, REBASE, FlyBase, GCRDb, MaizeDB, OMIM,
   PDB and Medline. Using a browser which is able to display images one can
   also remotely access 2D gels image data from SWISS-2DPAGE.

   A WWW  server can  be accessed  on  the  internet  through  its  Uniform
   Resource Locator  (URL), the addressing system defined by the WWW model.
   The URL for the ExPASy WWW server is:

                           http://expasy.hcuge.ch/
   or
                            http://129.195.254.61/

   To access a WWW server, you need to run a browser (or client) program on
   your local computer. Browsers exist for a variety of machines and may be
   obtained by  anonymous ftp. ExPASy can be used with any WWW browser, but
   we recommend  NCSA Mosaic.  It is  a very  flexible and powerful browser
   with  a  graphical  user  interface;  available  for  Unix  boxes  using
   X11/Motif; for  Apple McIntoshes  and for Microsoft Windows. You can get
   it from the FTP site: ftp.ncsa.uiuc.edu.

   To access  all the  data available  from SWISS-2DPAGE,  the user's local
   computer needs  to run  an image  viewing program.  For most browsers on
   Unix workstations  the default  program is  xv, a shareware application.
   Similar Windows  or Apple  shareware or  public domain  applications are
   also available.

   For more  information on  the  ExPASy  WWW  server,  you  can  read  the
   following article:

      Appel R.D., Bairoch A., Hochstrasser D.F.
      A new  generation of  information retrieval tools for biologists: the
      example of the ExPASy WWW server.
      Trends Biochem. Sci. 19:258-260(1994).


   Or you can contact Dr. Ron Appel:

      Email: appel@cih.hcuge.ch
      Fax: +41-22-372 61 98








<PAGE>



   2.7  Weekly updates of SWISS-PROT

   Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
   are updated at each update:

   new_seq.dat    Contains all the new entries since the last full release.
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release.
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since the last release.

   Currently these  files are  available on  the  following  anonymous  ftp
   servers:

   Organization   ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/swiss-prot/updates

   Organization   National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/swiss-prot/updates

   Organization   EBI ftp server
   Address        ftp.ebi.ac.uk (193.62.196.6)
   Directory      /pub/databases/swissprot/new

   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.

   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free. Also,
   for the  same reason,  new  entries  do  not  contain  an  OC  (Organism
   Classification) line.



                            3. ENZYME AND PROSITE

   3.1  The ENZYME data bank

   Release 17.0  of the  ENZYME data bank is distributed with release 30 of
   SWISS-PROT. ENZYME  release 17.0  contains information  relative to 3546
   enzymes.

   3.2  The PROSITE data bank

   Release 12.1  of the PROSITE data bank is distributed with release 30 of
   SWISS-PROT.  Release  12.1  contains  785  documentation  chapters  that
   describes 1029  different patterns, rules and profiles/matrices. Release
   12.1 does  not really  represent a new release; the only changes between




<PAGE>



   releases 12.0  and 12.1  are updating  of the pointers to the SWISS-PROT
   entries whose  name have  been modified between  releases 29 and 30. The
   next release  of PROSITE  (13.0) will  be distributed with release 31 of
   SWISS-PROT.

                             WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.












































<PAGE>

                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.60   Gln (Q) 4.03   Leu (L) 9.22   Ser (S) 7.15
   Arg (R) 5.23   Glu (E) 6.28   Lys (K) 5.84   Thr (T) 5.81
   Asn (N) 4.48   Gly (G) 6.95   Met (M) 2.36   Trp (W) 1.28
   Asp (D) 5.29   His (H) 2.24   Phe (F) 4.01   Tyr (Y) 3.21
   Cys (C) 1.76   Ile (I) 5.61   Pro (P) 5.00   Val (V) 6.52

   Asx (B) 0.005  Glx (Z) 0.005  Xaa (X) 0.02


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 4550

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 2045
                            2x:  740
                            3x:  414
                            4x:  255
                            5x:  187
                            6x:  192
                            7x:  115
                            8x:   80
                            9x:   95
                           10x:   44
                       11- 20x:  175
                       21- 50x:  122
                       51-100x:   42
                         >100x:   44















<PAGE>



        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        2914          Human
         2        2901          Escherichia coli
         3        2287          Baker's yeast (Saccharomyces cerevisiae)
         4        1748          Mouse
         5        1610          Rat
         6         777          Bacillus subtilis
         7         723          Bovine
         8         688          Caenorhabditis elegans
         9         647          Fruit fly (Drosophila melanogaster)
        10         551          Chicken
        11         493          Salmonella typhimurium
        12         432          African clawed frog (Xenopus laevis)
        13         393          Rabbit
        14         347          Pig
        15         251          Arabidopsis thaliana (Mouse-ear cress)
                   251          Vaccinia virus (strain Copenhagen)
        17         246          Maize
        18         224          Fission yeast (Schizosaccharomyces pombe)
        19         210          Rice
        20         200          Bacteriophage T4
        21         199          Pseudomonas aeruginosa
        22         198          Slime mold (Dictyostelium discoideum)
        23         193          Human cytomegalovirus (strain AD169)
        24         184          Tobacco
        25         183          Vaccinia virus (strain WR)
        26         176          Pea
        27         170          Wheat
        28         163          Barley
        29         151          Marchantia polymorpha (Liverwort)
        30         147          Dog
        31         146          Variola virus
        32         141          Staphylococcus aureus
                   141          Sheep
        34         139          Soybean
        35         137          Pseudomonas putida
                   137          Rhodobacter capsulatus
        37         133          Spinach
        38         130          Neurospora crassa
        39         128          Klebsiella pneumoniae
        40         117          Bacillus stearothermophilus
                   117          Tomato
        42         113          Agrobacterium tumefaciens
        43         108          Potato
        44         103          Rhizobium meliloti










<PAGE>



   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    2250             1001-1100      378
                 51- 100    3939             1101-1200      267
                101- 150    5433             1201-1300      198
                151- 200    3898             1301-1400      120
                201- 250    3409             1401-1500      115
                251- 300    3036             1501-1600       58
                301- 350    2862             1601-1700       55
                351- 400    2914             1701-1800       48
                401- 450    2171             1801-1900       57
                451- 500    2291             1901-2000       37
                501- 550    1656             2001-2100       20
                551- 600    1149             2101-2200       48
                601- 650     813             2201-2300       54
                651- 700     617             2301-2400       21
                701- 750     586             2401-2500       26
                751- 800     445             >2500          122
                801- 850     346
                851- 900     373
                901- 950     246
                951-1000     234


   A.4  Longest sequences

   The longest sequences (>=4000 residues) are listed here:

                               HTS1_COCCA  5217
                               FAT_DROME   5147
                               RYNR_RABIT  5037
                               RYNR_HUMAN  5032
                               RYNC_RABIT  4969
                               DYHC_DICDI  4725
                               DYHC_DROME  4639
                               APB_HUMAN   4563
                               APOA_HUMAN  4548
                               RRPA_CVMJH  4488
                               DYHC_TRIGR  4466
                               GRSB_BACBR  4451
                               PLEC_RAT    4140
                               DYHC_YEAST  4092
                               RRPA_CVH22  4085













<PAGE>



   A.5  List of the most cited journals in SWISS-PROT

   Citations            Journal abbreviation

   4345                 J. BIOL. CHEM.
   3050                 NUCLEIC ACIDS RES.
   2777                 PROC. NATL. ACAD. SCI. U.S.A.
   1755                 J. BACTERIOL.
   1528                 FEBS LETT.
   1452                 GENE
   1407                 EUR. J. BIOCHEM.
   1243                 EMBO J.
   1227                 NATURE
   1171                 BIOCHEM. BIOPHYS. RES. COMMUN.
   1134                 BIOCHEMISTRY
    896                 J. MOL. BIOL.
    879                 BIOCHIM. BIOPHYS. ACTA
    869                 CELL
    800                 MOL. CELL. BIOL.
    715                 MOL. GEN. GENET.
    683                 VIROLOGY
    641                 BIOCHEM. J.
    620                 PLANT MOL. BIOL.
    535                 SCIENCE
    521                 J. BIOCHEM.
    475                 MOL. MICROBIOL.
    436                 J. VIROL.
    382                 J. GEN. VIROL.
    268                 J. CELL BIOL.
    247                 BIOL. CHEM. HOPPE-SEYLER
    242                 GENOMICS
    232                 GENES DEV.
    213                 HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
    212                 ARCH. BIOCHEM. BIOPHYS.
    210                 CURR. GENET.
    201                 J. IMMUNOL.
    174                 MOL. BIOCHEM. PARASITOL.
    171                 MOL. ENDOCRINOL.
    169                 YEAST
    164                 J. GEN. MICROBIOL.
    154                 INFECT. IMMUN.
    150                 J. CLIN. INVEST.
    142                 DNA
    138                 ONCOGENE
    123                 PLANT PHYSIOL.
    122                 J. EXP. MED.
    122                 FEMS MICROBIOL. LETT.
    114                 J. MOL. EVOL.
    111                 AM. J. HUM. GENET.
    101                 HUM. MOL. GENET.






<PAGE>

           APPENDIX B: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                                                       **********************
                        ***********************        * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * <----> **********************
                        *  Sequence Data      *
******************      *  Library            *        **********************
* FLYBASE        * <--> *********************** <----- * ECD [E. coli map]  *
* [Drosophila    *                ^   ^  ^             **********************
* genomic d.b.]  * <---------+    |   |  |
******************           |    |   |  +------------ **********************
                             |    |   |                * TFD [Trans. fact.] *
                             |    |   |                **********************
******************           |    |   |
* MaizeDb        * <------+  |    |   |                **********************
******************        |  |    |   +--------------> * GCRDb [7TM recep.] *
                          |  |    |   |                **********************
******************        |  |    |   |
* WormPep        *        |  |    |   |                **********************
* [C.elegans]    * <----+ |  |    |   |       +------> * DictyDB [D.disco.] *
******************      | |  |    |   |       |        **********************
                        | |  |    |   |       |
******************      | v  v    v   v       v        **********************
* REBASE         *      ***********************        * ENZYME [Nomencl.]  *
* [Restriction   * <--- *  SWISS-PROT         * <----- **********************
*  enzymes]      *      *  Protein Sequence   *            |
******************      *  Data Bank          *            v
                        ***********************        **********************
******************       ^  ^  ^  | |  ^ ^  |          * OMIM   [Diseases]  *
* EcoGene/EcoSeq *       |  |  |  | |  | |  +--------> **********************
* [E. coli]      * <-----+  |  |  | |  | |
******************          |  |  | |  | +-----------> **********************
                            |  |  | |  |               * ECO2DBASE     [2D] *
                            |  |  | |  |               **********************
******************          |  |  | |  |
* PROSITE        * <--------+  |  | |  +-------------> **********************
* [Patterns]     *             |  | |                  * SWISS-2DPAGE  [2D] *
******************             |  | |                  **********************
             |                 |  | |
             |                 |  | +----------------> **********************
             |                 |  |                    * Aarhus/Ghent  [2D] *
             |                 |  +---------------+    **********************
             |                 v                  |
             |          ***********************   |    **********************
             +--------> * PDB [3D structures] *   +--> * YEPD (yeast)  [2D] *
                        ***********************        **********************










<PAGE>
  

Swiss-Prot release 29.0

Published June 1, 1994

                    SWISS-PROT RELEASE 29.0 RELEASE NOTES

                               1. INTRODUCTION

   1.1  Evolution

   Release 29.0  of SWISS-PROT  contains 38303 sequence entries, comprising
   13'464'008 amino acids abstracted from 36638 references. This represents
   an increase  of 7.7% over release 28. The recent growth of the data bank
   is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008

   1.2  Source of data

   Release 29.0  has been  updated using protein sequence data from release
   40.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 38.0 of the
   EMBL Nucleotide Sequence Database.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):            4682
   Entries with pointer(s) to only EMBL entri(es):           5191
   Entries with pointer(s) to both EMBL and PIR entri(es):  27691
   Entries with no pointers lines:                            739



<PAGE>


      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 28

   2.1  Sequences and annotations

   About 2320 sequences have been added since release 28, the sequence data
   of 351  existing entries  has been  updated and  the annotations of 7300
   entries have been revised.

   We are  continuing the process to 'clean-up' the various representations
   of domains  in the  feature lines  (especially the  usage of the feature
   keys "DOMAIN", "REPEAT", "DNA_BIND", and "SITE"). We also have undertook
   an overall  revision of  the CC  topics "SUBCELLULAR LOCATION", "SUBUNIT
   and "CAUTION".

   2.2  What's happening with the model organisms

   As we  announced in  the last  two releases we have selected a number of
   organisms that  are the  target  of  genome  sequencing  and/or  mapping
   projects and for which we intend to:

   -  Be as  complete as  possible. All sequences available at a given time
      should be  immediately included  in SWISS-PROT.  This  also  includes
      sequence corrections and updates.
   -  Provide a high level of annotations.
   -  Cross-references to specialized database(s) that contain, among other
      data, some  genetic information  about the  genes that code for these
      proteins.
   -  Provide specific indices or documents.

   What was  done since  the last  release or  in preparation  for the next
   release:

   -  We have  added Homo  sapiens (human) as the sixth model organism (see
      the next section for some additional information).
   -  In the  next release  we will  add Bacillus  subtilis as  the seventh
      organism. We will link SWISS-PROT to the SubtiList database currently
      being designed by Ivan Moszer of the Pasteur Institute in Paris.
   -  The next  release of  LISTA will  include accession  numbers for each
      gene entry;  we will  therefore be able to cross-reference SWISS-PROT
      to LISTA.

   Here is the current status of the model organisms:

   Organism        Database                    Index file       Number of
                   cross-referenced                             sequences
   --------------  --------------------------  --------------   ---------
   B.subtilis      SubtiList (in preparation)  In preparation         563
   C.elegans       WormPep                     CELEGANS.TXT           679
   D.discoideum    DictyDB                     DICTY.TXT              198
   D.melanogaster  FlyBase                     In preparation         600
   E.coli          EcoGene                     ECOLI.TXT             2674
   H.sapiens       MIM                         MIMTOSP.TXT           2862
   S.cerevisiae    LISTA (in preparation)      YEAST.TXT             1951





<PAGE>


   2.3  Human genetic diseases

   We have  made an  important effort in the implementation, in SWISS-PROT,
   of data relevant to human genetic diseases. This effort has mainly dealt
   with the following enhancements:

   a) In sequence  entries associated  with one  more genetic  diseases, we
      have  updated  and  expanded  the  annotations  characterizing  those
      diseases.  These   annotations  are  stored  in  the  CC  line  topic
      'DISEASE'.

      Examples:

   CC   -!- DISEASE: DEFECTS IN GALNS ARE A CAUSE OF  MUCOPOLYSACCHARIDOSIS
   CC       TYPE IVA (MPS IVA) (ALSO KNOWN AS MORQUIO A SYNDROME) WHICH IS
   CC       CHARACTERIZED BY SPECIFIC SPONDYLOEPIPHYSEAL DYSPLASIA, SHORT
   CC       TRUNK DWARFISM, COXA VALGA, ODONTOID HYPOPLASIA, CORNEAL
   CC       OPACITIES, PRESERVATION OF INTELLIGENCE, AND EXCESSIVE URINARY
   CC       EXCRETION OF KERATAN SULFATE AND CHONDROITIN-6-SULFATE.

   CC   -!- DISEASE: DEFECTS IN KRT9 ARE A CAUSE OF EPIDERMOLYTIC
   CC       PALMOPLANTAR KERATODERMA (EPPK), AN AUTOSOMAL DOMINANT DISEASE
   CC       CHARACTERIZED BY DIFFUSE THICKENING OF THE EPIDERMIS ON THE
   CC       ENTIRE SURFACE OF PALMS AND SOLES SHARPLY BORDERED WITH
   CC       ERYTHEMATOUS MARGINS.


   b) We have  entered in  SWISS-PROT all the mutations linked with genetic
      diseases or  polymorphisms as  long as  they are  not  frameshift  or
      nonsense mutation. These mutations are described in the feature table
      ('VARIANT' key) and the relevant references have been added.

      Partial example  (from entry  P07949 / KRET_HUMAN) describing the RET
      protein which is linked with the diseases MEN2A, MEN2B, MTC and HSCR:

   RN   [7]
   RP   VARIANT MEN2B MET-929.
   RM   94159102
   RA   HOFSTRA R.M.W., LANDSVATER R.M., CECCHERINI I., STULP R.P.,
   RA   STELWAGEN T., LUO Y., PASINI B., HOEPPENER J.W.M.,
   RA   VAN AMSTEL H.K.P., ROMEO G., LIPS C.J.M., BUYS C.H.C.M.;
   RL   NATURE 367:375-376(1994).
   RN   [8]
   RP   VARIANTS HSCR PRO-765; GLN-897 AND GLY-972.
   RM   94159103
   RA   ROMEO G., RONCHETTO P., LUO Y., BARONE V., SERI M., CECCHERINI I.,
   RA   PASINI B., BOCCIARDI R., LERONE M., KAARLAINEN H., MARTUCCIELLO G.;
   RL   NATURE 367:377-378(1994).


   FT   VARIANT     765    765       S -> P (IN HSCR).
   FT   VARIANT     897    897       R -> Q (IN HSCR).
   FT   VARIANT     929    929       T -> M (IN MEN2B).
   FT   VARIANT     972    972       R -> G (IN HSCR).




<PAGE>


   c) A new  CC topic  'POLYMORPHISM' has been implemented. Examples of its
      use:

   CC   -!- POLYMORPHISM: THE ALLELIC FORM OF THE ENZYME WITH GLN-191
   CC       HYDROLYZES PARAOXON WITH A LOW TURNOVER NUMBER AND THE ONE WITH
   CC       ARG-191 WITH A HIGH TURNOVER NUMBER.

   CC   -!- POLYMORPHISM: OVER 80 VARIANTS OF HUMAN DBP HAVE BEEN
   CC       IDENTIFIED. THE THREE MOST COMMON ALLELES ARE CALLED GC1F,
   CC       GC1S, AND GC2. THE SEQUENCE SHOWN IS THAT OF THE GC2 ALLELE.

   d) New keywords have been introduced:

      - 'DISEASE MUTATION' is used  for sequences in which there is at least
        one known disease-inducing mutation.
      - 'POLYMORPHISM' is used in  each entry  where "neutral" variants have
        been found (at the level of the protein sequence).
      - 'CHROMOSOMAL TRANSLOCATION' is used  to indicate proteins whose gene
        are known to be involved in chromosomal translocations.
      - Keywords have been implemented for genetic diseases linked with more
        than a single gene/protein. These keywords are:

   ALBINISM
   ALZHEIMER'S DISEASE
   AMYOTROPHIC LATERAL SCLEROSIS
   ATHEROSCLEROSIS
   AUTOIMMUNE ENCEPHALOMYELITIS
   AUTOIMMUNE UVEITIS
   BERNARD SOULIER SYNDROME
   CHARCOT-MARIE-TOOTH DISEASE
   CHRONIC GRANULOMATOUS DISEASE
   COCKAYNE'S SYNDROME
   DEJERINE-SOTTAS SYNDROME
   DIABETES
   DOWN'S SYNDROME
   DWARFISM
   ELLIPTOCYTOSIS
   EMPHYSEMA
   GAUCHER DISEASE
   GLYCOGEN STORAGE DISEASE
   GM2-GANGLIOSIDOSIS
   GOUT
   HEMOPHILIA
   HEREDITARY HEMOLYTIC ANEMIA
   HYPERLIPOPROTEINEMIA
   LEBER'S HEREDITARY OPTIC NEUROPATHY
   MAPLE SYRUP URINE DISEASE
   METACHROMATIC LEUCODYSTROPHY
   MUCOPOLYSACCHARIDOSIS
   PHENYLKETONURIA
   PSEUDOHERMAPHRODITISM
   RETINITIS PIGMENTOSA
   SCID
   SYSTEMIC LUPUS ERYTHEMATOSUS
   THROMBOPHILIA



<PAGE>


   VON WILLEBRAND DISEASE
   XERODERMA PIGMENTOSUM

   e) The GDB list of genes has been used to update the GN (gene name) line
      of many SWISS-PROT entries.


   2.4  Changes in the DR line

   We  have  added  cross-references  from  SWISS-PROT  to  two  additional
   databases:

   -  The G-protein--coupled  receptor database  (GCRDb)  prepared  by  Lee
      Frank Kolakowski at the Massachusetts General Hospital Renal Unit.
      Reference: Kolakowski L.F. Jr.; Receptors Channels In press(1994).

   -  The Maize  Genome Database  (MaizeDB) developed by the USDA-ARS Maize
      Genome Project  as part  of the National Agricultural Library's Plant
      Genome Research Program.


   These cross-references are present in the DR lines:

   Data bank identifier:    GCRDB
   Primary identifier  :    Unique identifier  (accession  number)  of  the
                            entry
   Secondary identifier:    None; a dash '-' is stored in that field
   Example             :    DR   GCRDB; GCR_0087; -.


   Data bank identifier:    MAIZEDB
   Primary identifier  :    'Gene-product' accession ID
   Secondary identifier:    None; a dash '-' is stored in that field
   Example             :    DR   MAIZEDB; 25342; -.


   We  have   removed  from   SWISS-PROT  cross-references   to  TFD   (the
   Transcription Factor  Database). The  main reason  for this  decision is
   that the  information stored  in the 'polypeptide' table of TFD does not
   expand on the data present in the corresponding SWISS-PROT entry.


   2.5  Status of the documentation files

   SWISS-PROT is  distributed with  a large  number of documentation files.
   Some of  these files  have been  available for  a long  time  (the  user
   manual, release  notes, the  various  indices  for  authors,  citations,
   keywords, etc.),  but  many  have  been  created  recently  and  we  are
   continuously  adding  new  files.  The  following  table  list  all  the
   documents that are currently available or that will be added in the next
   few months.

   USERMAN .TXT   User manual
   RELNOTES.TXT   Release notes
   SHORTDES.TXT   Short description of entries in SWISS-PROT



<PAGE>



   JOURLIST.TXT   List of abbreviations for journals cited
   KEYWLIST.TXT   List of keywords in use
   SPECLIST.TXT   List of organism identification codes
   EXPERTS .TXT   List of on-line experts for PROSITE and SWISS-PROT [1, 3]

   ACINDEX .TXT   Accession number index
   AUTINDEX.TXT   Author index
   CITINDEX.TXT   Citation index
   KEYINDEX.TXT   Keyword index
   SPEINDEX.TXT   Species index

   7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
   CDLIST  .TXT   CD nomenclature for surface proteins of human leucocytes
   CELEGANS.TXT   Index  of   Caenorhabditis  elegans   entries  and  their
                  corresponding  gene   designations  and   WormPep  cross-
                  references
   DICTY   .TXT   Index  of  Dictyostelium  discoideum  entries  and  their
                  corresponding   gene   designations  and  DictyDB  cross-
                  references
   EC2DTOSP.TXT   Index of  Escherichia coli  Gene-protein database entries
                  referenced in SWISS-PROT
   ECOLI   .TXT   Index of  Escherichia coli  K12 chromosomal  entries  and
                  their corresponding EcoGene cross-reference [4]
   EMBLTOSP.TXT   Index of EMBL Database entries referenced in SWISS-PROT
   GLYCOSYL.TXT   Index of  glycosyl hydrolases  classified by  families on
                  the basis of sequence similarities [2]
   HOXLIST .TXT   Vertebrate homeobox proteins: nomenclature and index
   MIMTOSP .TXT   Index of MIM entries referenced in SWISS-PROT
   NOMLIST .TXT   List of nomenclature related references for proteins [1]
   PDBTOSP .TXT   Index of Brookhaven PDB entries referenced in SWISS-PROT
   PLASTID .TXT   List of chloroplast and cyanelle encoded proteins
   RESTRIC .TXT   List of restriction enzymes and methylases entries [1]
   RIBOSOMP.TXT   Index of ribosomal proteins classified by families on the
                  basis of sequence similarities [2]
   YEAST   .TXT   Index  of  Saccharomyces  cerevisiae  entries  and  their
                  corresponding gene designations
   YEAST11 .TXT   Yeast Chromosome XI entries [1]


   Notes:

   [1]  New in release 29.
   [2]  Will be available starting with release 30 in October 1994.
   [3]  The list  of on-line  experts used to be an appendix of the release
        notes. We now provide it as a separate document.
   [4]  The format  of this  file was slightly modified in this release; we
        added a field that  indicates  if the  3D  structure  of  an E.coli
        protein is available in PDB.









<PAGE>


   2.6  The Expasy World-Wide Web server

        2.6.1  Background information

   The World-Wide Web (WWW), which originated at CERN, is a powerful global
   information  system   merging  networked   information   retrieval   and
   hypertext. It  gives access, using hypertext links, to the documents and
   information contained  in all the existing WWW servers around the world,
   as well  as to  the data  obtainable through other information retrieval
   systems like WAIS, Gopher, X500, etc. To access a WWW server, one has to
   run on a local computer a client program (a WWW browser), which displays
   hypertext documents.  The user  can then either request a keyword search
   or jump  to another  document by following a hypertext link. WWW has the
   outstanding advantage  of extending  the hypertext  model to  the  whole
   world (by allowing hypertext jumps to documents anywhere on the internet
   network) and  by being  device and  user-interface independent (browsers
   exist for  a variety  of computers  and user-interfaces,  including Unix
   workstations  running  XWindows,  MacIntoshes  and  PCs  with  Microsoft
   Windows).

   The ExPASy  WWW server  allows access, using the user-friendly hypertext
   model,  to  the  SWISS-PROT,  PROSITE,  SWISS-2DPAGE  and  SWISS-3DIMAGE
   databases and,  through any  SWISS-PROT protein sequence entry, to other
   databases such  as EMBL, PROSITE, REBASE, FlyBase, GCRDb, MaizeDB, OMIM,
   PDB and Medline. Using a browser which is able to display images one can
   also remotely access 2D gels image data from SWISS-2DPAGE.

   A WWW  server can  be accessed  on  the  internet  through  its  Uniform
   Resource Locator  (URL), the addressing system defined by the WWW model.
   The URL for the ExPASy WWW server is:

                           http://expasy.hcuge.ch/
   or
                            http://129.195.254.61/

   To access a WWW server, you need to run a browser (or client) program on
   your local computer. Browsers exist for a variety of machines and may be
   obtained by  anonymous ftp. ExPASy can be used with any WWW browser, but
   we recommend  NCSA Mosaic.  It is  a very  flexible and powerful browser
   with  a  graphical  user  interface;  available  for  Unix  boxes  using
   X11/Motif; for  Apple McIntoshes  and for Microsoft Windows. You can get
   it from the FTP site: ftp.ncsa.uiuc.edu.

   To access  all the  data available  from SWISS-2DPAGE,  the user's local
   computer needs  to run  an image  viewing program.  For most browsers on
   Unix workstations  the default  program is  xv, a shareware application.
   Similar Windows  or Apple  shareware or  public domain  applications are
   also available.

   For more  information on  the  ExPASy  WWW  server,  you  can  read  the
   following article:

      Appel R.D., Bairoch A., Hochstrasser D.F.
      A new  generation of  information retrieval tools for biologists: the
      example of the ExPASy WWW server.
      Trends Biochem. Sci. 19:258-260(1994).


<PAGE>


   Or you can contact Dr. Ron Appel:

      Email: appel@cih.hcuge.ch
      Fax: +41-22-372 61 98


        2.6.2  Changes to the WWW ExPASy server

   There has been quite a number of changes to the server in the last three
   months. We want to list specifically the following enhancements:

   -  A direct  entry point to PROSITE has been implemented. It is possible
      to search in PROSITE by description (title of the entry), entry name,
      accession number, author name and by performing a full text search.
   -  It is  now possible to retrieve either the EMBL or GenBank version of
      a cross-referenced nucleotide sequence entry.
   -  Active cross-references  are now  provided to  GCRDb and MaizeDB (see
      section 2.4 above).
   -  New SWISS-PROT  documents such  as RESTRIC.TXT  or  YEAST11.TXT  (see
      section 2.5 above) are available as hypertext documents.


   2.7  Weekly updates of SWISS-PROT

   Since release 24, we provide weekly updates of SWISS-PROT.

   The weekly  updates are  available by  anonymous FTP.  Three  files  are
   updated at each update:

   new_seq.dat    Contains all the new entries since the last full release.
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release.
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since the last release.

   Currently these  files are  available on  the  following  anonymous  ftp
   servers:

   Organization   ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/swiss-prot/updates

   Organization   National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/swiss-prot/updates

   Organization   EMBL ftp server
   Address        ftp.embl-heidelberg.de (or 192.54.41.33)
   Directory      /pub/databases/swissprot/new

   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.



<PAGE>


   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free. Also,
   for the  same reason,  new  entries  do  not  contain  an  OC  (Organism
   Classification) line.


                            3. ENZYME AND PROSITE

   3.1  The ENZYME data bank

   Release 16.0  of the  ENZYME data bank is distributed with release 29 of
   SWISS-PROT. ENZYME  release 16.0  contains information  relative to 3546
   enzymes. For the first time we have integrated information directly sent
   to us by the Enzyme nomenclature subcommitee of the NCB-IUBMB.


   3.2  The PROSITE data bank

        3.2.1  Statistics for release 12

   Release 12.0  of the PROSITE data bank is distributed with release 29 of
   SWISS-PROT.  Release   12  contains   785  documentation  chapters  that
   describes 1029  different patterns,  rules and  profiles/matrices. Since
   the last major release of PROSITE (release 11.0 of October 1993), 71 new
   chapters have been added and 338 chapters have been updated.

   Out of  a total  of  38303  entries  in  SWISS-PROT,  18786  are  cross-
   referenced in  PROSITE (excluding  the false  positives). This tally for
   exactly 49% of the sequences in SWISS-PROT.

   The next  release of  PROSITE (12.1) will be distributed with release 30
   of SWISS-PROT.


        3.2.2  List of the new entries in release 12

      Ly-6 / u-PAR domain signature
      Nuclear transition protein 2 signatures
      Ribosomal protein L19 signature
      Ribosomal protein L20 signature
      Ribosomal protein L35 signature
      Ribosomal protein L1e signature
      Ribosomal protein S2 signatures
      Ribosomal protein S7e signature
      Ribosomal protein S21e signature
      Ribosomal protein S28e signature
      DnaA protein signature
      NAD-dependent glycerol-3-phosphate dehydrogenase signature
      FAD-dependent glycerol-3-phosphate dehydrogenase signatures
      Mannitol dehydrogenases signature
      Coproporphyrinogen III oxidase signature
      Bacterial-type phytoene dehydrogenase signature
      Ergosterol biosynthesis ERG4/ERG24 family signatures
      Transaldolase active site




<PAGE>


      Myristoyl-CoA:protein N-myristoyltransferase signatures
      PTS EIIB domains cysteine phosphorylation site signature
      Eukaryotic RNA polymerases 15 Kd subunits signature
      Protein phosphatase 2A regulatory subunit PR55 signatures
      Protein phosphatase 2C signature
      Glycosyl hydrolases family 16 signature
      Glycosyl hydrolases family 25 active sites signature
      Glycosyl hydrolases family 39 putative active site
      Ubiquitin carboxyl-terminal hydrolases family 2 signatures
      Glycoprotease family signature
      Dehydroquinase class I active site
      Dehydroquinase class II signature
      Imidazoleglycerol-phosphate dehydratase signatures
      Cysteine synthase/cystathionine beta-synthase P-phosphate
      Glyoxalase I signatures
      6-pyruvoyl tetrahydropterin synthase signatures
      Phosphomannose isomerase type I signatures
      Folylpolyglutamate synthase signatures
      Transposases, Mutator family, signature
      OHHL biosynthesis luxI family signature
      Succinate dehydrogenase cytochrome b subunit signatures
      Globins profile
      PTR2 family proton/oligopeptide symporters signatures
      glpT family of transporters signature
      Bacterial formate and nitrite transporters signatures
      Fungal hydrophobins signature
      G-protein coupled receptors family 3 signatures
      Antenna complexes alpha and beta subunits signatures
      Photosystem I psaG and psaK proteins signature
      ER lumen protein retaining receptor signatures
      Neuromedin U signature
      Urotensin II signature
      Neutrophil bactenecins signatures
      Gamma-thionins family signature
      Streptomyces subtilisin-type inhibitors signature
      Heat shock hsp20 proteins family profile
      Bacterial export FHIPEP family signature
      Cytochrome c oxidase assembly factor COX10/ctaB/cyoE signature
      Cyclin-dependent kinases regulatory subunits signatures
      ADP-ribosylation factors family signature
      SAR1 family signature
      Initiation factor 3 signature
      Transcription termination factor nusG signature
      BTG1 family signature
      G10 protein signatures
      Clathrin adaptor complexes medium chain signatures
      Clathrin adaptor complexes small chain signature
      Extracellular proteins SCP/Tpx-1/Ag5/PR-1/Sc7 signatures
      Oxysterol-binding protein family signature
      Serum amyloid A proteins signature
      Spermadhesins family signatures
      Syndecans signature
      Translationally controlled tumor protein signatures





<PAGE>


        3.2.3  Status of profiles in PROSITE

   There are  a number  of  protein  families  as  well  as  functional  or
   structural domains  that cannot  be detected using patterns due to their
   extreme sequence  divergence. Typical  examples of  important functional
   domains which  are weakly  conserved are the immunoglobulin domains, the
   SH2 and SH3 domains, or the fibronectin type III domain. In such domains
   there are  only a  few sequence  positions which are well conserved. Any
   attempt of  building a  consensus pattern  for such  regions will either
   fail to  pick up  a significant proportion of the protein sequences that
   contain such  region (false  negative) or will pick up too many proteins
   that do  not contain  the region  (false positive). The use of technique
   based on  weight matrices  or profiles  allows  the  detection  of  such
   proteins or  domains. Philipp  Bucher, Kay  Hofmann at ISREC in Lausanne
   and myself are collaborating to include such methods into PROSITE.

   This is  the first  release of  PROSITE to include weight matrices (also
   known as  profiles). In  this release  only  two  profiles  entries  are
   available (for the hsp20 family of small chaperones and for globins). We
   plan to add many new profiles for the next major release (release 13) as
   well as to replace some of the existing pattern entries by profiles.

   None of  the  many  academic  or  commercial  programs  which  has  been
   developed to scan PROSITE can currently make use of the profile entries.
   We  are  therefore  distributing,  with  PROSITE,  the  source  code  (C
   language) of  two programs  that  should  help  software  developers  to
   implement profile-specific routines in their application(s):

   scan4prf Loads a  sequence from a file and scans it with all (or one) of
   the PROSITE profiles.

   srch4prf Loads a  profile from  a file  and scans  for that profile in a
   SWISS-PROT data base file.

   These programs  will  soon  be  available  in  the  respective  /prosite
   directory of  the  NCBI  and  Expasy  anonymous  FTP  servers  (for  the
   addresses, see section 2.7).

   Important notice  for software  developers: the  integration of profiles
   into PROSITE did not "break" the current format. The profiles entries in
   the PROFILE.DAT  file are  tagged with  the token  "MATRIX" on  the "ID"
   line; a  new line-type  "MA" is  used in  these entries to store all the
   weight matrices  specific parameters. The full description of the format
   of the  "MA" line-type  is available  as a  user's  manual  (file  name:
   PROFILE.TXT) that  is part of the PROSITE distribution files. The format
   of the PROSITE.DOC file has not be changed.



        3.2.4  Author index file

   Starting with  this release, we distribute a file that contains an index
   of the authors (and on-line experts) referenced in the PROSITE.DOC file.
   The name of this file is 'PAUTINDX.TXT'.




<PAGE>



                             WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.

















































<PAGE>


                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition


        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.60   Gln (Q) 4.03   Leu (L) 9.21   Ser (S) 7.15
   Arg (R) 5.23   Glu (E) 6.27   Lys (K) 5.83   Thr (T) 5.82
   Asn (N) 4.47   Gly (G) 6.97   Met (M) 2.36   Trp (W) 1.29
   Asp (D) 5.28   His (H) 2.25   Phe (F) 4.00   Tyr (Y) 3.21
   Cys (C) 1.78   Ile (I) 5.58   Pro (P) 5.02   Val (V) 6.52

   Asx (B) 0.005  Glx (Z) 0.005  Xaa (X) 0.02


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 4471


        A.2.1 Table of the frequency of occurrence of species


        Species represented 1x: 2010
                            2x:  735
                            3x:  402
                            4x:  255
                            5x:  182
                            6x:  196
                            7x:  102
                            8x:   79
                            9x:   88
                           10x:   45
                       11- 20x:  176
                       21- 50x:  121
                       51-100x:   37
                         >100x:   43











<PAGE>




        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        2862          Human
         2        2674          Escherichia coli
         3        1951          Baker's yeast (Saccharomyces cerevisiae)
         4        1697          Mouse
         5        1565          Rat
         6         710          Bovine
         7         679          Caenorhabditis elegans
         8         633          Fruit fly (Drosophila melanogaster)
         9         563          Bacillus subtilis
        10         542          Chicken
        11         410          African clawed frog (Xenopus laevis)
        12         394          Salmonella typhimurium
        13         387          Rabbit
        14         339          Pig
        15         251          Vaccinia virus (strain Copenhagen)
        16         239          Maize
        17         229          Arabidopsis thaliana (Mouse-ear cress)
        18         221          Fission yeast (Schizosaccharomyces pombe)
        19         200          Bacteriophage T4
        20         198          Slime mold (Dictyostelium discoideum)
        21         197          Rice
        22         193          Human cytomegalovirus (strain AD169)
        23         185          Pseudomonas aeruginosa
        24         183          Vaccinia virus (strain WR)
        25         180          Tobacco
        26         174          Pea
        27         168          Wheat
        28         158          Barley
        29         146          Variola virus
        30         142          Dog
        31         139          Sheep
        32         137          Soybean
        33         134          Staphylococcus aureus
        34         131          Spinach
        35         127          Pseudomonas putida
        36         124          Neurospora crassa
        37         122          Marchantia polymorpha (Liverwort)
        38         121          Rhodobacter capsulatus
        39         119          Klebsiella pneumoniae
        40         111          Agrobacterium tumefaciens
        41         108          Bacillus stearothermophilus
        42         104          Tomato
        43         101          Rhizobium meliloti









<PAGE>




   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    2159             1001-1100      363
                 51- 100    3748             1101-1200      257
                101- 150    5182             1201-1300      196
                151- 200    3711             1301-1400      115
                201- 250    3240             1401-1500      113
                251- 300    2860             1501-1600       54
                301- 350    2687             1601-1700       55
                351- 400    2771             1701-1800       45
                401- 450    2065             1801-1900       51
                451- 500    2192             1901-2000       37
                501- 550    1560             2001-2100       19
                551- 600    1089             2101-2200       48
                601- 650     776             2201-2300       53
                651- 700     580             2301-2400       19
                701- 750     558             2401-2500       25
                751- 800     425             >2500          119
                801- 850     328
                851- 900     353
                901- 950     229
                951-1000     221


   Currently the ten longest sequences are:

                            HTS1_COCCA  5217 a.a.
                             FAT_DROME  5147 a.a.
                            RYNR_RABIT  5037 a.a.
                            RYNR_HUMAN  5032 a.a.
                            RYNC_RABIT  4969 a.a.
                            DYHC_DICDI  4725 a.a.
                            APB_HUMAN   4563 a.a.
                            APOA_HUMAN  4548 a.a.
                            RRPA_CVMJH  4488 a.a.
                            DYHC_TRIGR  4466 a.a.



















<PAGE>



           APPENDIX B: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                                                       **********************
                        ***********************        * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * <----> **********************
                        *  Sequence Data      *
******************      *  Library            *        **********************
* FLYBASE        * <--> *********************** <----- * ECD [E. coli map]  *
* [Drosophila    *                ^   ^  ^             **********************
* genomic d.b.]  * <---------+    |   |  |
******************           |    |   |  +------------ **********************
                             |    |   |                * TFD [Trans. fact.] *
                             |    |   |                **********************
******************           |    |   |
* MaizeDb        * <------+  |    |   |                **********************
******************        |  |    |   +--------------> * GCRDb [7TM recep.] *
                          |  |    |   |                **********************
******************        |  |    |   |
* WormPep        *        |  |    |   |                **********************
* [C.elegans]    * <----+ |  |    |   |       +------> * DictyDB [D.disco.] *
******************      | |  |    |   |       |        **********************
                        | |  |    |   |       |
******************      | v  v    v   v       v        **********************
* REBASE         *      ***********************        * ENZYME [Nomencl.]  *
* [Restriction   * <--- *  SWISS-PROT         * <----- **********************
*  enzymes]      *      *  Protein Sequence   *            |
******************      *  Data Bank          *            v
                        ***********************        **********************
******************       ^  ^  |  |  ^   ^  |          * OMIM   [Diseases]  *
* EcoGene/EcoSeq *       |  |  |  |  |   |  +--------> **********************
* [E. coli]      * <-----+  |  |  |  |   |
******************          |  |  |  |   +-----------> **********************
                            |  |  |  |                 * ECO2DBASE     [2D] *
                            |  |  |  |                 **********************
******************          |  |  |  |
* PROSITE        * <--------+  |  |  +---------------> **********************
* [Patterns]     *             |  |                    * SWISS-2DPAGE  [2D] *
******************             |  +---------------+    **********************
             |                 v                  |
             |          ***********************   |    **********************
             +--------> * PDB [3D structures] *   +--> * Aarhus/Ghent  [2D] *
                        ***********************        **********************












<PAGE>
  

Swiss-Prot release 28.0

Published February 1, 1994

                    SWISS-PROT RELEASE 28.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 28.0  of SWISS-PROT  contains 36000 sequence entries, comprising
   12'496'420 amino acids abstracted from 33903 references. This represents
   an increase of 10.1% over release 27. The recent growth of the data bank
   is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420


   1.2  Source of data

   Release 28.0  has been  updated using protein sequence data from release
   38.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 37.0 of the
   EMBL Nucleotide Sequence Database.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):            4624
   Entries with pointer(s) to only EMBL entri(es):           5593
   Entries with pointer(s) to both EMBL and PIR entri(es):  25048
   Entries with no pointers lines:                            735


<PAGE>


      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 27


   2.1  Sequences and annotations

   About 2700 sequences have been added since release 27, the sequence data
   of 379  existing entries  has been  updated and  the annotations of 5700
   entries have been revised.

   In particular  we have  started a  process  to  'clean-up'  the  various
   representations of domains in the feature lines (especially the usage of
   the feature  keys "DOMAIN",  "REPEAT", "DNA_BIND",  and "SITE"). We also
   have undertook  an  overall  revision  of  the  CC  topics  "SUBCELLULAR
   LOCATION", "SUBUNIT  and "CAUTION".  Most of  the work  has already been
   carried out for this release and we plan to finish this major annotation
   revamping for the next release.

   2.2  What's happening with the model organisms

   As we  announced in  the last  release we  have  selected  a  number  of
   organisms that  are the  target  of  genome  sequencing  and/or  mapping
   projects and for which we intend to:

   -  Be as  complete as  possible. All sequences available at a given time
      should be  immediately included  in SWISS-PROT.  This  also  includes
      sequence corrections and updates.
   -  Provide a high level of annotations.
   -  Cross-references to specialized database(s) that contain, among other
      data, some  genetic information  about the  genes that code for these
      proteins.
   -  Provide specific indices or documents.

   Thanks to  a collaborative  effort with Douglas Smith and Bill Loomis of
   UCSD we  have added  a fifth  organism, Dictyostelium  discoideum (slime
   mold), to  our list of model organisms. Many new sequences were added at
   this  release  and  a  new  document  file  (DICTY.TXT)  lists  all  the
   D.discoideum sequence entries in SWISS-PROT and their corresponding gene
   names.

   At this release we also have started our collaboration with the group at
   the Sanger  Genome Center  in Hinxton  (UK) and  we have  added 516  new
   C.elegans sequences;  most of  which are  translation of sequencing data
   from the genome project. A new document file (CELEGANS.TXT) list all the
   C.elegans sequence  entries in  SWISS-PROT and  their corresponding gene
   names and, when appropriate, their cosmid-derived names.

   Here is the current status of the five model organisms:

   Organism         Database                Index file        Number of
                    cross-referenced                          sequences
   --------------   ----------------------  --------------    ---------
   C.elegans        WormPep                 CELEGANS.TXT            672
   D.discoideum     DictyDB                 DICTY.TXT               183
   D.melanogaster   FlyBase                 In preparation          600
   E.coli           EcoGene                 ECOLI.TXT              2555
   S.cerevisiae     LISTA (in preparation)  YEAST.TXT              1731


<PAGE>


   2.3  The Expasy World-Wide Web server

        2.3.1  Background information

   The World-Wide Web (WWW), which originated at CERN, is a powerful global
   information  system   merging  networked   information   retrieval   and
   hypertext. It  gives access, using hypertext links, to the documents and
   information contained  in all the existing WWW servers around the world,
   as well  as to  the data  obtainable through other information retrieval
   systems like WAIS, Gopher, X500, etc. To access a WWW server, one has to
   run on a local computer a client program (a WWW browser), which displays
   hypertext documents.  The user  can then either request a keyword search
   or jump  to another  document by following a hypertext link. WWW has the
   outstanding advantage  of extending  the hypertext  model to  the  whole
   world (by allowing hypertext jumps to documents anywhere on the internet
   network) and  by being  device and  user-interface independent (browsers
   exist for  a variety  of computers  and user-interfaces,  including Unix
   workstations  running  XWindows,  MacIntoshes  and  PCs  with  Microsoft
   Windows).

   The ExPASy  WWW server  allows access, using the user-friendly hypertext
   model, to  the SWISS-PROT  and SWISS-2DPAGE  databases and,  through any
   SWISS-PROT protein  sequence entry,  to other  databases such  as  EMBL,
   PROSITE, REBASE,  FlyBase, PDB,  OMIM and Medline. Using a browser which
   is able  to display  images one  can also  remotely access 2D gels image
   data from SWISS-2DPAGE.

   A WWW  server can  be accessed  on  the  internet  through  its  Uniform
   Resource Locator  (URL), the addressing system defined by the WWW model.
   The URL for the ExPASy WWW server is:

                           http://expasy.hcuge.ch/
   or
                            http://129.195.254.61/

   To access a WWW server, you need to run a browser (or client) program on
   your local computer. Browsers exist for a variety of machines and may be
   obtained by  anonymous ftp. ExPASy can be used with any WWW browser, but
   we recommend  NCSA Mosaic.  It is  a very  flexible and powerful browser
   with  a  graphical  user  interface;  available  for  Unix  boxes  using
   X11/Motif; for  Apple McIntoshes  and for Microsoft Windows. You can get
   it from the FTP site: ftp.ncsa.uiuc.edu.

   To access  all the  data available  from SWISS-2DPAGE,  the user's local
   computer needs  to run  an image  viewing program.  For most browsers on
   Unix workstations  the default  program is  xv, a shareware application.
   Similar Windows  or Apple  shareware or  public domain  applications are
   also available.

   For more  information on  the  ExPASy  WWW  server,  you  can  read  the
   following article:

      Appel R.D., Sanchez J.-C., Bairoch A., Golaz O., Miu M., Pasquali C.,
      Reynaldo Vargas J., Hughes G.J., Hochstrasser D.F.
      Electrophoresis 14:1232-1238(1993).



<PAGE>


   Or you can contact Dr. Ron Appel:

      Email: appel@cih.hcuge.ch
      Fax: +41-22-372 61 98

        2.3.2  Changes to the WWW ExPASy server

   There has been quite a number of changes to the server in the last three
   months. We want to list specifically the following enhancements:

   -  It is  now possible to retrieve the Medline abstract of any reference
      in SWISS-PROT.
   -  Full text searches of SWISS-PROT have been implemented.
   -  The data  available on the server includes the latest full release of
      SWISS-PROT as well as the cumulative weekly updates.
   -  Most SWISS-PROT  documents such  as the  new indices  for  the  model
      organisms, are available as hypertext documents.
   -  The SWISS-2DPAGE  part of  the server  has been greatly enhanced with
      new functionalities.

   2.4  Changes in the DR line

   We have  added cross-references  to the  Dictyostelium discoideum genome
   database (DictyDB)  (see section  2.2  of  these  notes).  These  cross-
   references are present in the DR lines:

   Data bank identifier:  DICTYDB
   Primary identifier:    Unique identifier  attributed by  DictyDB to  the
                          gene coding for the protein.
   Secondary identifier:  The gene  designation (name).  A "-"  is  present
                          when no gene name has yet been assigned.
   Example:               DR   DICTYDB; DD01047; MYOA.


   2.5  Weekly updates of SWISS-PROT

   Since release 24, we provide weekly updates of SWISS-PROT.

   [Note: due  to the fact that we were in the process of 'cleaning up' the
   annotations of  many entries  (see section  2.1), we temporarily stopped
   providing weekly updates from December 1993 to February 1994.]

   The weekly  updates are  available by  anonymous FTP.  Three  files  are
   updated at each update:

   new_seq.dat    Contains all the new entries since the last full release.
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release.
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since the last release.

   Currently these  files are  available on  the  following  anonymous  ftp
   servers:





<PAGE>


   Organism       ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/swiss-prot/updates

   Organism       National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/swiss-prot/updates

   Organism       EMBL ftp server
   Address        ftp.embl-heidelberg.de (or 192.54.41.33)
   Directory      /pub/databases/swissprot/new

   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.

   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free. Also,
   for the  same reason,  new  entries  do  not  contain  an  OC  (Organism
   Classification) line.



                            3. ENZYME AND PROSITE

   3.1  The ENZYME data bank

   Release 15.0  of the  ENZYME data bank is distributed with release 28 of
   SWISS-PROT. ENZYME  release 15.0  contains information  relative to 3489
   enzymes.


   3.2  The PROSITE data bank

        3.2.1  Release 11.1

   Release 11.1  of the PROSITE data bank is distributed with release 28 of
   SWISS-PROT.  Release  11.1  contains  715  documentation  chapters  that
   describes 926 different patterns. Release 11.1 does not really represent
   a new  release; the  only changes  between releases  11.0 and  11.1  are
   updating of  the pointers to the SWISS-PROT entries whose name have been
   modified between  releases 27 and 28. The next release of PROSITE (12.0)
   will be distributed with release 29 of SWISS-PROT.

        3.2.2  Future developments

   Starting with  the next major releases (12.0 of June 1994), PROSITE will
   be extended  to include  weight matrices (also known as profiles). There
   are a  number of  protein families  as well  as functional or structural
   domains that  cannot be  detected using  patterns due  to their  extreme
   sequence divergence.  Typical examples  of important  functional domains
   which are  weakly conserved  are the immunoglobulin domains, the SH2 and




<PAGE>



   SH3 domains,  or the  fibronectin type III domain. In such domains there
   are only  a few sequence positions which are well conserved. Any attempt
   of building  a consensus  pattern for  such regions  will either fail to
   pick up  a significant  proportion of the protein sequences that contain
   such region  (false negative)  or will pick up too many proteins that do
   not contain  the region  (false positive). The use of technique based on
   weight matrices  or profiles  allows the  detection of  such proteins or
   domains. Dr.  Philipp  Bucher  at  ISREC  in  Lausanne  and  myself  are
   collaborating to  include such  methods into PROSITE. This collaboration
   also includes  other participants such as Roland Luethy (AMGEN), Michael
   Gribskov (SDSC)  and Steve  Altschul (NCBI).  If you  are interested  in
   participating in this project please contact Philipp Bucher at:

                          pbucher@isrec-sun1.unil.ch

   Important notice  for software  developers: the  integration of profiles
   into PROSITE  will not  "break" the current format. The profiles entries
   in the  PROFILE.DAT file  will be  tagged with the token "MATRIX" on the
   "ID" line  (currently, only  "PATTERN" and "RULE" are used as tokens); a
   new line-type "MA" will be used in these entries to store all the weight
   matrices specific  parameters. The  format of  the PROSITE.DOC file will
   not be changed.

   The full  description of  the format  of the  PROSITE profile extensions
   will be  available in  a couple  of weeks as a user's manual (file name:
   PROFILE.TXT) that will be posted on the ExPASy and NCBI file servers.

   Organism       ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/prosite

   Organism       National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/prosite


   Here is an example of a PROSITE profile:


ID   SH3; MATRIX.
AC   PS90001;
DT   JUN-1994 (CREATED); JUN-1994 (DATA UPDATE); JUN-1994 (INFO UPDATE).
DE   SH3 domain profile.
CC   /TAXO-RANGE=??E??; /MAX-REPEAT=2;
MA   /GENERAL_SPEC: ALPHABET='ACDEFGHIKLMNPQRSTVWY';
MA   /DISJOINT: DEFINITION=PROTECT; N1=1; N2=53.
MA   /NORMALIZATION: MODE=1; FUNCTION=GRIBSKOV; R1=2.97; R2=-0.0035;
MA      R3=0.7386; R4=-1.001; R5=0.208; TEXT='ZScore';
MA   /NORMALIZATION: MODE=2; FUNCTION=LINEAR; R1=0.0; R2=100.0;
MA      TEXT='OrigScore';
MA   /CUT_OFF: LEVEL=0; SCORE=600; N_SCORE=7.0; MODE=1;
MA   /DEFAULT: MI=-26; I=-3; IM=0; MD=-26; D=-3; DM=0;





<PAGE>


MA   /M: SY='F';M=-2,-3,-3,-4,2,-3,-2,1,-2,0,-1,-2,-3,-3,-4,-2,-1,0,-5,2;
MA   /M: SY='I';M=-1,-5,-2,-3,-2,-3,0,1,1,-1,1,-1,-2,-1,1,-1,0,1,-4,-4;
MA   /M: SY='A';M=2,-3,1,0,-5,2,-2,-1,-1,-3,-2,1,1,0,-2,2,2,0,-8,-5;
MA   /M: SY='L';M=-3,-8,-5,-4,2,-6,-2,2,-4,6,4,-3,-3,-2,-3,-3,-2,1,-3,0;
MA   /M: SY='Y';M=-4,-2,-6,-6,9,-7,0,-1,-5,-1,-3,-3,-6,-5,-6,-4,-4,-4,-1,11;
MA   /M: SY='D';M=1,-6,3,3,-7,0,0,-2,-1,-4,-3,2,0,1,-2,0,0,-2,-9,-6;
MA   /M: SY='Y';M=-5,-3,-6,-6,10,-7,-1,-1,-2,-1,-2,-3,-6,-5,-5,-4,-4,-4,-1,11;
MA   /M: SY='K';M=-1,-6,1,1,-4,-2,0,-2,2,-3,-1,1,-1,1,1,0,0,-3,-7,-6;
MA   /M: SY='A';M=1,-4,1,0,-5,1,-1,-1,0,-3,-1,1,0,0,0,1,1,-1,-7,-6;
MA   /M: SY='R';M=0,-5,0,0,-5,-1,0,-1,1,-3,-1,1,0,1,1,0,0,-2,-5,-5;
MA   /M: SY='R';M=0,-5,1,1,-6,0,1,-2,1,-4,-2,1,0,1,2,1,0,-2,-5,-5;
MA   /M: SY='E';M=1,-6,2,2,-6,0,0,-2,-1,-4,-2,1,1,1,-1,0,0,-3,-8,-6;
MA   /M: SY='D';M=0,-6,2,2,-6,0,1,-3,0,-5,-3,2,-1,2,-1,0,0,-4,-7,-4;
MA   /M: SY='D';M=0,-8,4,3,-6,0,0,-2,-1,-3,-2,2,-2,2,-2,0,-1,-3,-9,-6;
MA   /M: SY='L';M=-2,-8,-5,-5,2,-5,-3,3,-4,7,5,-4,-3,-3,-4,-3,-2,3,-4,-2;
MA   /M: SY='S';M=1,-4,1,1,-5,1,0,-2,1,-4,-2,1,0,0,0,1,1,-2,-6,-5;
MA   /M: SY='F';M=-3,-7,-6,-6,6,-5,-3,3,-2,5,3,-4,-5,-4,-5,-4,-3,1,-3,3;
MA   /M: SY='Q';M=-1,-6,0,0,-3,-2,1,-1,1,-2,0,0,-1,1,1,-1,0,-1,-6,-4;
MA   /M: SY='K';M=-1,-8,0,1,-3,-2,0,-2,3,-3,0,1,0,2,2,0,0,-3,-6,-6;
MA   /M: SY='G';M=2,-5,1,0,-7,7,-3,-4,-2,-6,-4,1,-1,-2,-4,2,0,-2,-10,-8;
MA   /M: SY='D';M=1,-7,5,4,-8,1,1,-3,0,-5,-3,2,-1,2,-2,0,0,-4,-10,-6;
MA   /M: SY='I';M=0,-5,-1,-2,-2,-2,-1,2,0,0,1,-1,-2,0,0,-1,0,1,-6,-5;
MA   /M: SY='L';M=-2,-6,-5,-5,3,-5,-3,4,-3,6,4,-4,-4,-3,-4,-3,-2,3,-5,0;
MA   /M: SY='Q';M=-1,-5,-1,-1,-3,-2,0,0,0,-2,-1,0,-1,0,0,-1,0,-1,-6,-3;
MA   /M: SY='V';M=0,-4,-3,-4,-1,-3,-3,5,-3,3,3,-2,-2,-2,-3,-2,0,5,-8,-4;
MA   /M: SY='L';M=-1,-6,-3,-3,-1,-3,-2,2,-3,3,2,-2,-2,-2,-3,-2,-1,2,-5,-3;
MA   /M: SY='D';M=0,-6,3,3,-6,0,1,-3,2,-5,-2,2,-1,2,1,0,0,-4,-7,-5;
MA   /M: SY='K';M=-1,-6,0,0,-2,-1,0,-3,3,-4,-1,1,-1,0,1,0,0,-3,-6,-4;
MA   /M: SY='N';M=1,-4,1,1,-5,0,0,-2,0,-3,-2,1,1,0,-1,1,1,-1,-7,-5;
MA      /I: MI=0; I=-1; MD=0; /M SY='X'; M=0; D=-1;
MA   /M: SY='G';M=1,-5,0,0,-5,1,-2,-1,-2,-3,-2,0,0,-1,-2,0,0,-1,-8,-6;
MA   /M: SY='G';M=1,-6,3,3,-7,3,0,-4,-1,-5,-4,2,-1,1,-2,1,0,-3,-10,-6;
MA   /M: SY='W';M=-9,-12,-9,-11,1,-11,-4,-8,-5,-3,-6,-6,-8,-7,3,-4,-8,-9,26,0;
MA   /M: SY='W';M=-7,-9,-9,-9,0,-9,-4,-5,-5,-1,-4,-6,-7,-6,2,-3,-6,-6,18,-1;
MA   /M: SY='K';M=-1,-7,0,0,-3,-2,0,-2,2,-3,-1,1,-1,1,2,0,-1,-3,-5,-5;
MA   /M: SY='G';M=2,-3,0,-1,-6,3,-3,-2,-3,-4,-3,0,0,-2,-3,1,0,0,-10,-6;
MA   /M: SY='Q';M=-2,-6,0,0,-3,-3,1,-2,0,-2,-1,0,-2,1,1,-1,-1,-3,-5,-3;
MA      /I: MI=0; I=-2; MD=0; /M SY='X'; M=0; D=-2;
MA   /M: SY='T';M=0,-4,-1,-1,-4,0,-2,0,-1,-2,0,0,-1,-1,-1,0,1,0,-7,-5;
MA   /M: SY='T';M=0,-5,0,0,-3,-1,-1,-1,1,-3,-1,1,-1,0,0,1,1,-1,-6,-4;
MA   /M: SY='G';M=0,-5,0,-1,-5,3,-2,-3,-1,-5,-3,0,-1,-1,-1,1,0,-2,-7,-6;
MA   /M: SY='K';M=0,-6,1,1,-5,-1,1,-2,2,-4,-1,1,-1,2,2,0,0,-3,-6,-6;
MA   /M: SY='R';M=-1,-6,-1,-1,-5,-3,1,-1,1,-3,-1,0,-1,1,3,-1,-1,-2,-2,-6;
MA   /M: SY='G';M=1,-5,0,0,-6,6,-3,-3,-3,-5,-4,0,-1,-2,-4,1,0,-2,-10,-6;
MA   /M: SY='W';M=-5,-5,-5,-5,2,-6,-2,-2,-4,-1,-3,-3,-6,-5,-3,-3,-4,-4,4,3;
MA   /M: SY='F';M=-3,-5,-6,-6,6,-5,-3,4,-1,3,2,-4,-4,-5,-4,-3,-2,2,-4,3;
MA   /M: SY='P';M=2,-4,-1,-1,-7,-1,0,-3,-2,-4,-3,-1,8,0,0,1,0,-2,-8,-7;
MA   /M: SY='G';M=1,-3,0,0,-4,2,-1,-2,0,-3,-2,0,0,-1,-1,1,1,-1,-6,-5;
MA   /M: SY='N';M=1,-5,2,1,-5,0,1,-2,1,-4,-2,2,0,0,0,1,1,-2,-7,-4;
MA   /M: SY='Y';M=-5,-1,-7,-7,10,-8,-1,-1,-5,-1,-3,-3,-7,-6,-6,-4,-4,-5,0,13;
MA   /M: SY='V';M=0,-3,-3,-5,-2,-2,-3,5,-3,2,2,-2,-2,-3,-4,-1,0,5,-8,-5;
MA   /M: SY='E';M=1,-6,2,3,-6,0,0,-2,1,-4,-2,1,0,2,0,0,0,-3,-8,-6;
MA   /M: SY='P';M=0,-5,-1,-1,-2,-2,-1,-2,-1,-3,-2,0,1,-1,-2,0,-1,-2,-6,-3;
//




<PAGE>


                             WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.


















































<PAGE>


                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.62   Gln (Q) 4.02   Leu (L) 9.20   Ser (S) 7.13
   Arg (R) 5.23   Glu (E) 6.26   Lys (K) 5.83   Thr (T) 5.82
   Asn (N) 4.47   Gly (G) 6.98   Met (M) 2.36   Trp (W) 1.29
   Asp (D) 5.28   His (H) 2.25   Phe (F) 4.01   Tyr (Y) 3.22
   Cys (C) 1.77   Ile (I) 5.59   Pro (P) 5.01   Val (V) 6.52

   Asx (B) 0.005  Glx (Z) 0.005  Xaa (X) 0.02


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 4283

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1939
                            2x:  703
                            3x:  398
                            4x:  250
                            5x:  179
                            6x:  160
                            7x:   94
                            8x:   68
                            9x:   87
                           10x:   47
                       11- 20x:  170
                       21- 50x:  111
                       51-100x:   36
                         >100x:   41














<PAGE>




        A.2.2  Table of the most represented species


    Number   Frequency          Species
         1        2663          Human
         2        2555          Escherichia coli
         3        1731          Baker's yeast (Saccharomyces cerevisiae)
         4        1578          Mouse
         5        1453          Rat
         6         679          Bovine
         7         672          Caenorhabditis elegans
         8         600          Fruit fly (Drosophila melanogaster)
         9         529          Bacillus subtilis
        10         515          Chicken
        11         380          African clawed frog (Xenopus laevis)
        12         367          Rabbit
        13         352          Salmonella typhimurium
        14         327          Pig
        15         251          Vaccinia virus (strain Copenhagen)
        16         236          Maize
        17         211          Arabidopsis thaliana (Mouse-ear cress)
        18         200          Bacteriophage T4
        19         193          Human cytomegalovirus (strain AD169)
        20         183          Slime mold (Dictyostelium discoideum)
                   183          Vaccinia virus (strain WR)
        22         182          Rice
        23         176          Pseudomonas aeruginosa
        24         170          Tobacco
        25         169          Pea
        26         165          Wheat
        27         164          Fission yeast (Schizosaccharomyces pombe)
        28         149          Barley
        29         146          Variola virus
        30         138          Dog
        31         136          Soybean
        32         131          Staphylococcus aureus
                   131          Sheep
        34         129          Spinach
        35         122          Neurospora crassa
        36         120          Marchantia polymorpha (Liverwort)
        37         118          Rhodobacter capsulatus
        38         116          Pseudomonas putida
        39         112          Klebsiella pneumoniae
        40         108          Agrobacterium tumefaciens












<PAGE>




   A.3  Repartition of the sequences by size



               From   To  Number             From   To   Number
                  1-  50    2081             1001-1100      342
                 51- 100    3595             1101-1200      227
                101- 150    4999             1201-1300      176
                151- 200    3459             1301-1400      103
                201- 250    3057             1401-1500      102
                251- 300    2681             1501-1600       51
                301- 350    2516             1601-1700       49
                351- 400    2572             1701-1800       40
                401- 450    1943             1801-1900       43
                451- 500    2065             1901-2000       31
                501- 550    1413             2001-2100       17
                551- 600    1014             2101-2200       43
                601- 650     714             2201-2300       50
                651- 700     533             2301-2400       18
                701- 750     514             2401-2500       22
                751- 800     393             >2500          114
                801- 850     305
                851- 900     319
                901- 950     199
                951-1000     200




   Currently the ten longest sequences are:


                            HTS1_COCCA  5217 a.a.
                             FAT_DROME  5147 a.a.
                            RYNR_RABIT  5037 a.a.
                            RYNR_HUMAN  5032 a.a.
                            RYNC_RABIT  4969 a.a.
                            DYHC_DICDI  4725 a.a.
                            APB_HUMAN   4563 a.a.
                            APOA_HUMAN  4548 a.a.
                            RRPA_CVMJH  4488 a.a.
                            DYHC_TRIGR  4466 a.a.














<PAGE>


                         APPENDIX B: ON-LINE EXPERTS



   B.1  List of on-line experts for PROSITE and SWISS-PROT


Field of expertise            Name               Email address
---------------------------   ------------------ ----------------------------
Alcohol dehydrogenases        Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt.persson@embl-heidelberg.de
Aldehyde dehydrogenases       Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt.persson@embl-heidelberg.de
Alpha-crystallins/HSP-20      Leunissen J.A.M.   jackl@caos.caos.kun.nl
                              de Jong W.         u629000@hnykun11.bitnet
Alpha-2-macroglobulins        Van Leuven F.      fred@blekul13.bitnet
AA-tRNA synthetases class II  Leberman R.        leberman@frembl51.bitnet
Apolipoproteins               Boguski M.S.       boguski@ncbi.nlm.nih.gov
AraC family HTH proteins      Ramos J.L.         jlramos@cnbvx3.cnb.uam.es
Arrestins                     Kolakowski L.F.Jr. lfk@receptor.mgh.harvard.edu
Asparaginase / glutaminase    Gribskov M.        gribskov@sdsc.edu
ATP synthase c subunit        Recipon H.         recipon@ncbi.nlm.nih.gov
Band 4.1 family proteins      Rees J.            jrees@vax.oxford.ac.uk
Beta-lactamases               Brannigan J.       jab5@vaxa.york.ac.uk
Beta-transducin family        Boguski M.S.       boguski@ncbi.nlm.nih.gov
C-type lectin domain          Drickamer K.       drick@cuhhca.hhmi.columbia.edu
Chalcone/stilbene synthases   Schroeder J.       raf@sun1.ruf.uni-freiburg.de
Chaperonins cpn10/cpn60       Georgopoulos C.    georgopo@cmu.unige.ch
Chaperonins TCP1 family       Willison K.R.      willison@icr.ac.uk
Chitinases                    Henrissat B.       bernie@cermav.grenet.fr
Clusterin                     Peitsch M.C.       mcp13936@ggr.co.uk
Cold shock domain             Landsman D.        landsman@ncbi.nlm.nih.gov
CTF/NF-I                      Mermod N.          nmermod@ulys.unil.ch
                              Gronostajski R.    gronosr@ccsmtp.ccf.org
Cytochromes P450              Holsztynska E.J.   ela@netcom.uucp
                                                 netcom!ela@apple.com
DEAD-box helicases            Linder P.          linder@urz.unibas.ch
Deoxyribonuclease I           Peitsch M.C.       mcp13936@ggr.co.uk
dnaJ family                   Kelley W.          kelley@cmu.unige.ch
EF-hand calcium-binding       Cox J.A.           cox@sc2a.unige.ch
                              Kretsinger R.H.    rhk5i@virginia.bitnet
Elongation factor 1           Amons R.           wmbamons@rulgl.leidenuniv.nl
Enoyl-CoA hydratase           Hofmann K.O.       khofmann@biomed.biolan.uni-koeln.de
Fatty acid desaturases        Piffanelli P.      piffanelli@jii.afrc.ac.uk
fruR/lacI family HTH proteins Reizer J.          jreizer@ucsd.edu
GATA-type zinc-fingers        Boguski M.S.       boguski@ncbi.nlm.nih.gov
GDT/GTP dissociation stimul.  Boguski M.S.       boguski@ncbi.nlm.nih.gov
GltP family of transporters   Hofmann K.O.       khofmann@biomed.biolan.uni-koeln.de
Glucanases                    Henrissat B.       bernie@cermav.grenet.fr
                              Beguin P.          phycel@pasteur.bitnet
Glutamine synthetase          Tateno Y.          ytateno@genes.nig.ac.jp
G-protein coupled receptors   Chollet A.         arc3029@ggr.co.uk
                              Attwood T.K.       bph6tka@biovax.leeds.ac.uk
                              Kolakowski L.F.Jr. lfk@receptor.mgh.harvard.edu




<PAGE>


GTPase-activating proteins    Boguski M.S.       boguski@ncbi.nlm.nih.gov
HIT family                    Seraphin B.        seraphin@embl-heidelberg.de
HMG1/2 and HMG-14/17          Landsman D.        landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases    Kolakowski L.F.Jr. lfk@receptor.mgh.harvard.edu
Integrases                    Roy P.H.           2020000@saphir.ulaval.ca
Kringle domain                Ikeo K.            kikeo@genes.nig.ac.jp
Lipocalins                    Boguski M.S.       boguski@ncbi.nlm.nih.gov
                              Peitsch M.C.       mcp13936@ggr.co.uk
lysR family HTH proteins      Henikoff S.        henikoff@sparky.fhcrc.org
MAC components / perforin     Peitsch M.C.       mcp13936@ggr.co.uk
Malic enzymes                 Glynias M.         mglynias@ncsa.uiuc.edu
MAM domain                    Bork P.            bork@embl-heidelberg.de
MIP family proteins           Reizer J.          jreizer@ucsd.edu
Myelin proteolipid protein    Hofmann K.O.       khofmann@biomed.biolan.uni-koeln.de
Pancreatic trypsin inhibitor  Ikeo K.            kikeo@genes.nig.ac.jp
PEP requiring enzymes         Reizer J.          jreizer@ucsd.edu
pfkB carbohydrate kinases     Reizer J.          jreizer@ucsd.edu
Phosphomannose isomerases     Proudfoot A.E.I.   aep6830@ggr.co.uk
Phytochromes                  Partis M.D.        partis@afrc.ac.uk
Plant viruses icosahedral     Koonin E.V.        koonin@ncbi.nlm.nih.gov
capsid proteins
Protein kinases               Quinn A.M.         quinn@biomed.med.yale.edu
                              Hunter T.          hunter@salk-sc2.sdsc.edu
PTS proteins                  Reizer J.          jreizer@ucsd.edu
Restriction-modification      Bickle T.          bickle@urz.unibas.ch
            enzymes           Roberts R.J.       roberts@neb.com
Ribosomal protein S15         Ellis S.R.         srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases    Harayama S.        sharayam@ddbj.nig.ac.jp
Signal sequence peptidases    von Heijne G.      gvh@csb.ki.se
                              Dalbey R.E.        rdalbey@magnus.acs.ohio-state.edu
Sodium symporters             Reizer J.          jreizer@ucsd.edu
Subtilases                    Brannigan J.       jab5@vaxa.york.ac.uk
                              Siezen R.J.        nizo@caos.caos.kun.nl
Thiol proteases               Turk B.            turk@ijs.ac.mail.yu
Thiol proteases inhibitors    Turk B.            turk@ijs.ac.mail.yu
TNF family                    Jongeneel C.V.     vjongene@isrecmail.unil.ch
TPR repeats                   Boguski M.S.       boguski@ncbi.nlm.nih.gov
Transit peptides              von Heijne G.      gvh@csb.ki.se
Type-II membrane antigens     Levy S.            levy@cellbio.stanford.edu
Uracil-DNA glycosylase        Aasland R.         aasland@embl-heidelberg.de
Vitamin K-depend. Gla domain  Price P.A.         pprice@ucsd.edu
XPGC protein                  Clarkson S.G.      clarkson@cmu.unige.ch
Xylose isomerase              Jenkins J.         jenkins@frira.afrc.ac.uk
WAP-type domain               Claverie J.-M.     jmc@ncbi.nlm.nih.gov
ZP domain                     Bork P.            bork@embl-heidelberg.de

African swine fever virus     Yanez R.J.         ryanez@cbm2.uam.es
Bacteriophage P4              Halling C.         chh9@midway.uchicago.edu
Caenorhabditis elegans       Sonnhammer E.      esr@mrc-molecular.biology.cam.ac.uk
Chloroplast encoded proteins  Hallick R.B.       hallick@arizona.edu
Dictyostelium discoideum      Smith D.W.         dsmith@ucsd.edu
Drosophila                    Ashburner M.       ma11@phx.cam.ac.uk
Escherichia coli              Rudd K.            rudd@ncbi.nlm.nih.gov
Salmonella typhimurium        Rudd K.            rudd@ncbi.nlm.nih.gov
Snakes                        Stocklin R.        stocklin@cmu.unige.ch



<PAGE>




   B.2  Requirements to fulfill to become an on-line expert

   An expert  should be  a scientist  working with  specific famili(es)  of
   proteins (or specific domains) and who would:

   a) Review the  protein sequences in SWISS-PROT and the patterns/matrices
      in PROSITE relevant to their field of research.
   b) Agree to  be contacted  by people  that have obtained new sequence(s)
      which seem to belong to "their" familie(s) of proteins.
   c) Have access  to electronic  mail and be willing to use it to send and
      receive data.

   If you are willing to be part of this scheme please contact Amos Bairoch
   at the following electronic mail address:

                             bairoch@cmu.unige.ch







































<PAGE>




           APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:


                                                       **********************
                        ***********************        * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * <----> **********************
                        *  Sequence Data      *
******************      *  Library            *        **********************
* FLYBASE        * <--> *********************** <----- * ECD [E. coli map]  *
* [Drosophila    *                ^      ^             **********************
* genomic d.b.]  * <--------+     |      |
******************          |     |      +------------ **********************
                            |     |                    * TFD [Trans. fact.] *
                            |     |      +-----------> **********************
******************          |     |      |
* WormPep        *          |     |      |             **********************
* [C.elegans]    * <----+   |     |      |    +------> * DictyDB [D.disco.] *
******************      |   |     |      |    |        **********************
                        |   |     |      |    |
******************      |   v     v      v    v        **********************
* REBASE         *      ***********************        * ENZYME [Nomencl.]  *
* [Restriction   * <--- *  SWISS-PROT         * <----- **********************
*  enzymes]      *      *  Protein Sequence   *            |
******************      *  Data Bank          *            v
                        ***********************        **********************
******************       ^  ^  |  |  ^   ^  |          * OMIM   [Diseases]  *
* EcoGene/EcoSeq *       |  |  |  |  |   |  +--------> **********************
* [E. coli]      * <-----+  |  |  |  |   |
******************          |  |  |  |   +-----------> **********************
                            |  |  |  |                 * ECO2DBASE     [2D] *
                            |  |  |  |                 **********************
******************          |  |  |  |
* PROSITE        * <--------+  |  |  +---------------> **********************
* [Patterns]     *             |  |                    * SWISS-2DPAGE  [2D] *
******************             |  +---------------+    **********************
             |                 v                  |
             |          ***********************   |    **********************
             +--------> * PDB [3D structures] *   +--> * Aarhus/Ghent  [2D] *
                        ***********************        **********************














<PAGE>
  

Swiss-Prot release 27.0

Published October 1, 1993

                    SWISS-PROT RELEASE 27.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 27.0  of SWISS-PROT  contains 33329 sequence entries, comprising
   11'484'420 amino acids abstracted from 32314 references. This represents
   an increase  of 5.6% over release 26. The recent growth of the data bank
   is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420

   1.2  Source of data

   Release 27.0  has been  updated using protein sequence data from release
   37.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 36.0 of the
   EMBL Nucleotide Sequence Database.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):            4553
   Entries with pointer(s) to only EMBL entri(es):           4557
   Entries with pointer(s) to both EMBL and PIR entri(es):  23557
   Entries with no pointers lines:                            662



<PAGE>




      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 26


   2.1  Sequences and annotations

   About 1532 sequences have been added since release 26, the sequence data
   of 213  existing entries  has been  updated and  the annotations of 3000
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of
   proteins:

   -  Aspartate and glutamate racemases
   -  Bacteriophage T4 proteins
   -  Band 3 anion proteins
   -  Beta amylases
   -  Cysteine synthases
   -  Deoxyribonuclease I
   -  Epenymins
   -  Epimorphin family
   -  GTP1/Obg family
   -  Fork head domain proteins
   -  G-linked receptors family 2
   -  Glutamate 5-kinase
   -  HIT family
   -  Lysyl oxidases
   -  mutT domain proteins
   -  Nitrilases / cyanide hydratase
   -  Peripherin / rom-1
   -  Phosphatidylinositol 3-kinases
   -  Pollen proteins Ole e I family
   -  Prokaryotic transglycosylases
   -  Protein prenyltransferases alpha subunit repeat
   -  Renal dipeptidases
   -  Trehalases



   2.2  A special emphasis on four "model" organisms

   We have selected four organisms that are the target of genome sequencing
   and/or mapping projects and for which we intend to:

   -  Be as  complete as  possible. All sequences available at a given time
      should be  immediatly included  in  SWISS-PROT.  This  also  includes
      sequence corrections and updates.
   -  Provide a high level of annotations.
   -  Cross-references to specialized database(s) that contain, among other
      data, some  genetic information  about the  genes that code for these
      proteins.
   -  Provide specific indices or documents.






<PAGE>



   The four organisms selected are:

   o  Caenorhabditis elegans (worm)
   o  Drosophila melanogaster (fly)
   o  Escherichia coli
   o  Saccharomyces cerevisiae (yeast)

   Such a  special effort  has been  going on  for more  than  a  year  for
   Escherichia coli  (thanks to  a very  fruitful  collaboration  with  Ken
   Rudd), it  has started  in this  release for  yeast: about 300 new yeast
   sequences were  entered; about  1500 entries  were reannotated and a new
   document (YEAST.TXT)  is provided  that list yeast entries in SWISS-PROT
   classified by  gene name  and synonym(s).  The next  release will target
   C.elegans thanks  to a collaboration with the group at the Sanger Genome
   Center in  Hinxton (UK).  The Drosophila  "project" should  start a  bit
   later.

   Organism          Database                 Index file
                     X-referenced             provided
   --------------    ----------------------   --------------
   C.elegans         WormPep                  Next release
   D.melanogaster    Flybase                  In preparation
   E.coli            EcoGene                  ECOLI.TXT
   S.cerevisiae      LISTA (in preparation)   YEAST.TXT


   2.3  The Expasy World-Wide Web server

   The recent months have seen a tremendous increase in the availability of
   software tools  and applications  that allow  to efficiently make use of
   the varied  resources which  are part  of the Internet network. Three of
   these  `Network   Information  Tools'  (NIR)  are  widely  used  by  the
   biological community:  WAIS (Wide  Area Information Server), Gopher and,
   more recently,  the World-Wide  Web (WWW). As many organizations provide
   WAIS or  Gopher servers  that offer access to SWISS-PROT and PROSITE, we
   felt that  there was  no need to set up ourselves such a service. But no
   such server was yet available for WWW.

   The World-Wide Web (WWW), which originated at CERN, is a powerful global
   information  system   merging  networked   information   retrieval   and
   hypertext. It  gives access, using hypertext links, to the documents and
   information contained  in all the existing WWW servers around the world,
   as well  as to  the data  obtainable through other information retrieval
   systems like WAIS, Gopher, X500, etc. To access a WWW server, one has to
   run on a local computer a client program (a WWW browser), which displays
   hypertext documents.  The user  can then either request a keyword search
   or jump  to another  document by following a hypertext link. WWW has the
   outstanding advantage  of extending  the hypertext  model to  the  whole
   world (by allowing hypertext jumps to documents anywhere on the internet
   network) and  by being  device and  user-interface independent (browsers
   exist for  a variety  of computers  and user-interfaces,  including Unix
   workstations  running  XWindows,  MacIntoshes  and  PCs  with  Microsoft
   Windows).




<PAGE>



   A WWW  server has  been set  up by  Ron Appel  from the  group of  Denis
   Hochstrasser at  the Faculty of Medicine of the University of  Geneva on
   the ExPASy  molecular biology  server. It allows access, using the user-
   friendly hypertext  model, to  the SWISS-PROT and SWISS-2DPAGE databases
   and, through  any SWISS-PROT  protein sequence entry, to other databases
   such as  EMBL, PROSITE,  REBASE, Flybase,  PDB and OMIM. Using a browser
   which is  able to  display images  one can  also remotely access 2D gels
   image data from SWISS-2DPAGE.

   A WWW  server can  be accessed  on  the  internet  through  its  Uniform
   Resource Locator  (URL), the addressing system defined by the WWW model.
   The URL for the ExPASy molecular biology WWW server is:

             http://expasy.hcuge.ch/

   or

             http://129.195.254.61/

   To access a WWW server, you need to run a browser (or client) program on
   your local computer. Browsers exist for a variety of machines and may be
   obtained by  anonymous ftp. Here is a selected list (taken from the CERN
   WWW server)  of currently  available browsers  and the  ftp address from
   which they can be retrieved:

   NCSA Mosaic    a very  flexible and  powerful browser  with a  graphical
                  user interface. Available for Unix boxes using X11/Motif;
                  for Mc  Intoshes and  for Microsoft  Windows.  FTP  site:
                  ftp.ncsa.uiuc.edu (in /Web/xmosaic).

   lynx           a full screen browser for vt100s using full screen, arrow
                  keys, highlighting,  etc. FTP site: ftp2.cc.ukans.edu (in
                  /pub/lynx).

   www            a basic  line mode  browser giving access to WWW from any
                  dumb terminal. FTP site: info.cern.ch (in /pub/www).



   To access  all the  data available  from SWISS-2DPAGE,  the user's local
   computer needs  to run  an image  viewing program.  For most browsers on
   Unix workstations  the default  program is  xv, a  shareware application
   developed by John Bradley at University of Pennsylvania. The program can
   be found by ftp at export.lcs.mit.edu (in /contrib).

   For more  information on  the ExPASy  WWW server, please contact Dr. Ron
   Appel:

        Email: appel@cih.hcuge.ch
        Tel: +41-22-372 62 64
        Fax: +41-22-372 61 98






<PAGE>



   2.4  Changes in the DR line

   We have  added cross-references  to the  WormPep collection of candidate
   protein translations  from the  Caenorhabditis elegans genome sequencing
   project (see  section 2.2  of these  notes). These  cross-references are
   present in the DR lines:

   Data bank identifier: WORMPEP
   Primary identifier  : Cosmid-derived name  given to that protein by the
                         C.elegans genome sequencing project.  In  general
                         this name will not change.
   Secondary identifier: Number  attributed   by  the   C.elegans   genome
                         sequencing project  to that protein.  This number
                         will change when the sequence is updated.
   Example             : DR   WORMPEP; ZK637.7; CE00437.


   2.5  Weekly updates of SWISS-PROT

   Since release 24, we provide weekly updates of SWISS-PROT. These updates
   are available by anonymous FTP. Three files are updated every week:

   new_seq.dat    Contains all the new entries since the last full release.
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release.
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since  the last release.

   Currently these  files are  available on  the  following  anonymous  ftp
   servers:

   Organism       EMBL ftp server
   Address        ftp.embl-heidelberg.de (or 192.54.41.33)
   Directory      /pub/databases/swissprot/new

   Organism       ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/swiss-prot/updates

   Organism       National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/swiss-prot/updates

   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.

   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free. Also,
   for the  same reason,  new  entries  do  not  contain  an  OC  (Organism
   Classification) line.




<PAGE>





                            3. ENZYME AND PROSITE

   3.1  The ENZYME data bank

   Release 14.0  of the  ENZYME data bank is distributed with release 27 of
   SWISS-PROT. ENZYME  release 14.0  contains information  relative to 3489
   enzymes.



   3.2  The PROSITE data bank

   3.2.1  What's new in release 11.0

   Release 11.0  of the PROSITE data bank is distributed with release 27 of
   SWISS-PROT.  Release  11.0  contains  715  documentation  chapters  that
   describes 926  different patterns.  Since  the  last  major  release  of
   PROSITE (release  10.00 of  December 1992),  80 new  chapters have  been
   added and  306 chapters  have been  updated. The new chapters are listed
   below:

   -  Protein splicing signature
   -  CAP-Gly domain signature
   -  MAM domain signature
   -  Prokaryotic transcription elongation factors signatures
   -  MCM2/3/5 family signature
   -  XPGC protein signatures
   -  Bacterial regulatory proteins, arsR family signature
   -  Bacterial regulatory proteins, deoR family signature
   -  Dps protein family signatures
   -  Ribosomal protein L27 signature
   -  Ribosomal protein L36 signature
   -  mutT domain signature
   -  3-hydroxyisobutyrate dehydrogenase signature
   -  Dihydroorotate dehydrogenase signatures
   -  Alanine dehydrogenase and pyridine nucleotide transhydrogenase
   -  Lysyl oxidase putative copper-binding region signature
   -  6-hydroxy-D-nicotine oxidase and reticuline oxidase FAD-binding
   -  Cytochrome c oxidase subunit VB, zinc binding region signature
   -  Indoleamine 2,3-dioxygenase signatures
   -  Glycine radical signature
   -  Uroporphyrin-III C-methyltransferase signatures
   -  Protein prenyltransferases alpha subunit repeat signature
   -  Phosphatidylinositol 3-kinase signatures
   -  Glutamate 5-kinase signature
   -  Guanylate kinase signature
   -  ADP-glucose pyrophosphorylase signatures
   -  2'-5'-oligoadenylate synthetases signatures
   -  Deoxyribonuclease I signatures
   -  Glucoamylase active site region signature





<PAGE>




   -  Trehalase signatures
   -  Glycosyl hydrolases family 8 signature
   -  Prokaryotic transglycosylases signature
   -  Renal dipeptidase active site
   -  Serine proteases, ompT family signatures
   -  Proteasome B-type subunits signature
   -  Signal peptidases II signature
   -  Cytidine & deoxycytidylate deaminases zinc-binding region signature
   -  GTP cyclohydrolase I signatures
   -  Nitrilases / cyanide hydratase signatures
   -  Orn/DAP/Arg decarboxylases family 2 signatures
   -  Uroporphyrinogen decarboxylase signatures
   -  Alpha-isopropylmalate and homocitrate synthases signatures
   -  Beta-eliminating lyases pyridoxal-phosphate attachment site
   -  Dihydroxy-acid and 6-phosphogluconate dehydratases signatures
   -  Prephenate dehydratase signatures
   -  Cysteine synthase pyridoxal-phosphate attachment site
   -  Cys/Met metabolism enzymes pyridoxal-phosphate attachment site
   -  Cytochrome c and c1 heme lyases signatures
   -  Aspartate and glutamate racemases signatures
   -  Mandelate racemase / muconate lactonizing enzyme family signatures
   -  Phosphoglucomutase and phosphomannomutase phosphoserine signature
   -  D-alanine--D-alanine ligase signatures
   -  Carbamoyl-phosphate synthase subdomain signatures
   -  Nickel-dependent hydrogenases b-type cytochrome subunit signatures
   -  Adrenodoxin family, iron-sulfur binding region signature
   -  ABC-2 type transport system integral membrane proteins signature
   -  Acyl-CoA-binding protein signature
   -  LacY family proton/sugar symporters signatures
   -  Sodium:alanine symporter family signature
   -  Sodium:galactoside symporter family signature
   -  Osteopontin signature
   -  Peripherin / rom-1 signature
   -  Interleukins -4 and -13 signature
   -  Erythropoietin signature
   -  Galanin signature
   -  Chaperonins clpA/B signatures
   -  Bacterial type II secretion system protein D signature
   -  Bacterial type II secretion system protein F signature
   -  MARCKS family signatures
   -  Elongation factor 1 beta/beta'/delta chain signatures
   -  Eukaryotic initiation factor 4E signature
   -  Calsequestrin signatures
   -  GTP1/OBG family signature
   -  HIT family signature
   -  Ependymins signatures
   -  Epimorphin family signature
   -  Yeast PIR proteins repeats signature
   -  Oleosins signature
   -  Pollen proteins Ole e I family signature
   -  Hypothetical YCR59c/yigZ family signature




<PAGE>




   3.2.2  Future developments

   Starting with  the next  major releases (12.0 of May 1994), PROSITE will
   be extended  to include  weight matrices (also known as profiles). There
   are a  number of  protein families  as well  as functional or structural
   domains that  cannot be  detected using  patterns due  to their  extreme
   sequence divergence.  Typical examples  of important  functional domains
   which are  weakly conserved  are the immunoglobulin domains, the SH2 and
   SH3 domains,  or the  fibronectin type III domain. In such domains there
   are only  a few sequence positions which are well conserved. Any attempt
   of building  a consensus  pattern for  such regions  will either fail to
   pick up  a significant  proportion of the protein sequences that contain
   such region  (false negative)  or will pick up too many proteins that do
   not contain  the region  (false positive). The use of technique based on
   weight matrices  or profiles  allows the  detection of  such proteins or
   domains. Dr.  Philipp  Bucher  at  ISREC  in  Lausanne  and  myself  are
   collaborating to  include such  methods into PROSITE. This collaboration
   also includes  other participants such as Roland Luethy (AMGEN), Michael
   Gribskov (SDSC)  and Steve  Altschul (NCBI).  If you  are interested  in
   participating in this project please contact Philipp Bucher at:

                          pbucher@isrec-sun1.unil.ch

   We will  include in  the next  release note of SWISS-PROT (Release 28 of
   February 1994)  a brief  description of the PROSITE syntax extension for
   profiles. The full description will be available in the User's Manual of
   release 11.1 of PROSITE (February 1994).

   Important notice  for software  developers: the  integration of profiles
   into PROSITE  will not  "break" the current format. The profiles entries
   in the  PROFILE.DAT file  will be  tagged with the token "MATRIX" on the
   "ID" line  (currently, only  "PATTERN" and "RULE" are used as tokens); a
   new line-type "MA" will be used in these entries to store all the weight
   matrices specific  parameters. The  format of  the PROFILE.DOC file will
   not be changed.



                            4. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.










<PAGE>



                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.66   Gln (Q) 4.02   Leu (L) 9.20   Ser (S) 7.09
   Arg (R) 5.23   Glu (E) 6.26   Lys (K) 5.81   Thr (T) 5.83
   Asn (N) 4.45   Gly (G) 7.04   Met (M) 2.35   Trp (W) 1.30
   Asp (D) 5.27   His (H) 2.25   Phe (F) 3.99   Tyr (Y) 3.22
   Cys (C) 1.78   Ile (I) 5.55   Pro (P) 5.03   Val (V) 6.52

   Asx (B) 0.005  Glx (Z) 0.005  Xaa (X) 0.02


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 4143

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1879
                            2x:  685
                            3x:  387
                            4x:  247
                            5x:  186
                            6x:  143
                            7x:   88
                            8x:   78
                            9x:   76
                           10x:   48
                       11- 20x:  151
                       21- 50x:  103
                       51-100x:   33
                         >100x:   39













<PAGE>





        A.2.2  Table of the most represented species


    Number   Frequency          Species
         1        2530          Human
         2        2376          Escherichia coli
         5        1563          Baker's yeast (Saccharomyces cerevisiae)
         4        1496          Mouse
         5        1385          Rat
         6         654          Bovine
         7         579          Fruit fly (Drosophila melanogaster)
         8         495          Bacillus subtilis
         9         489          Chicken
        10         369          African clawed frog (Xenopus laevis)
        11         341          Salmonella typhimurium
                   341          Rabbit
        13         311          Pig
        14         251          Vaccinia virus (strain Copenhagen)
        15         224          Maize
        16         200          Bacteriophage T4
        17         193          Human cytomegalovirus (strain AD169)
        18         190          Arabidopsis thaliana (Mouse-ear cress)
        19         183          Vaccinia virus (strain WR)
        20         180          Rice
        21         166          Pseudomonas aeruginosa
                   166          Tobacco
        23         164          Pea
        24         162          Wheat
        25         155          Caenorhabditis elegans
                   155          Fission yeast (Schizosaccharomyces pombe)
        27         138          Barley
        28         133          Soybean
        29         131          Slime mold (Dictyostelium discoideum)
        30         130          Spinach
        31         129          Staphylococcus aureus
        32         127          Sheep
        33         119          Marchantia polymorpha (Liverwort)
        34         118          Rhodobacter capsulatus
        35         117          Dog
        36         114          Pseudomonas putida
        37         111          Neurospora crassa
        38         110          Klebsiella pneumoniae
        39         104          Bacillus stearothermophilus












<PAGE>





   A.3  Repartition of the sequences by size



               From   To  Number             From   To   Number
                  1-  50    1966             1001-1100      321
                 51- 100    3345             1101-1200      199
                101- 150    4723             1201-1300      156
                151- 200    3191             1301-1400       97
                201- 250    2804             1401-1500       86
                251- 300    2469             1501-1600       44
                301- 350    2289             1601-1700       46
                351- 400    2377             1701-1800       37
                401- 450    1780             1801-1900       43
                451- 500    1933             1901-2000       31
                501- 550    1325             2001-2100       13
                551- 600     922             2101-2200       40
                601- 650     645             2201-2300       48
                651- 700     488             2301-2400       16
                701- 750     475             2401-2500       18
                751- 800     368             >2500          100
                801- 850     270
                851- 900     293
                901- 950     182
                951-1000     189




   Currently the ten largest sequences are:


                            RYNR_RABIT  5037 a.a.
                            RYNR_HUMAN  5032 a.a.
                            APB_HUMAN   4563 a.a.
                            APOA_HUMAN  4548 a.a.
                            RRPA_CVMJH  4488 a.a.
                            DYHC_TRIGR  4466 a.a.
                            GRSB_BACBR  4451 a.a.
                            PLEC_RAT    4140 a.a.
                            POLG_BVDV   3988 a.a.
                            VGF1_IBVB   3951 a.a.













<PAGE>



                         APPENDIX B: ON-LINE EXPERTS



   B.1  List of on-line experts for PROSITE and SWISS-PROT


Field of expertise            Name               Email address
---------------------------   ------------------ ----------------------------
Alcohol dehydrogenases        Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt.persson@embl-
                                                 heidelberg.de
Aldehyde dehydrogenases       Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt.persson@embl-
                                                 heidelberg.de
Alpha-crystallins/HSP-20      Leunissen J.A.M.   jackl@caos.caos.kun.nl
                              de Jong W.         u629000@hnykun11.bitnet
Alpha-2-macroglobulins        Van Leuven F.      fred@blekul13.bitnet
AA-tRNA synthetases class II  Leberman R.        leberman@frembl51.bitnet
Apolipoproteins               Boguski M.S.       boguski@ncbi.nlm.nih.gov
AraC family HTH proteins      Ramos J.L.         jlramos@cnbvx3.cnb.uam.es
Arrestins                     Kolakowski L.F.Jr. kolakowski@helix.mgh.
                                                 harvard.edu
Asparaginase / glutaminase    Gribskov M.        gribskov@sdsc.edu
ATP synthase c subunit        Recipon H.         recipon@ncbi.nlm.nih.gov
Band 4.1 family proteins      Rees J.            jrees@vax.oxford.ac.uk
Beta-lactamases               Brannigan J.       jab5@vaxa.york.ac.uk
Beta-transducin family        Boguski M.S.       boguski@ncbi.nlm.nih.gov
C-type lectin domain          Drickamer K.       drick@cuhhca.hhmi.columbia.
                                                 edu
Chalcone/stilbene synthases   Schroeder J.       raf@sun1.ruf.uni-freiburg.de
Chaperonins cpn10/cpn60       Georgopoulos C.    georgopo@cmu.unige.ch
Chaperonins TCP1 family       Willison K.R.      willison@icr.ac.uk
Chitinases                    Henrissat B.       bernie@cermav.grenet.fr
Clusterin                     Peitsch M.C.       peitsch@ulbio1.unil.ch
Cold shock domain             Landsman D.        landsman@ncbi.nlm.nih.gov
CTF/NF-I                      Mermod N.          nmermod@ulys.unil.ch
                              Gronostajski R.    gronosr@ccsmtp.ccf.org
Cytochromes P450              Holsztynska E.J.   ela@netcom.uucp
                                                 netcom!ela@apple.com
DEAD-box helicases            Linder P.          linder@urz.unibas.ch
Deoxyribonuclease I           Peitsch M.C.       peitsch@ulbio1.unil.ch
dnaJ family                   Kelley W.          kelley@cmu.unige.ch
EF-hand calcium-binding       Cox J.A.           cox@sc2a.unige.ch
                              Kretsinger R.H.    rhk5i@virginia.bitnet
Elongation factor 1           Amons R.           wmbamons@rulgl.leidenuniv.nl
Enoyl-CoA hydratase           Hofmann K.O.       khofmann@biomed.biolan.uni-
                                                 koeln.de
Fatty acid desaturases        Piffanelli P.      piffanelli@jii.afrc.ac.uk
fruR/lacI family HTH proteins Reizer J.          jreizer@ucsd.edu
GATA-type zinc-fingers        Boguski M.S.       boguski@ncbi.nlm.nih.gov
GDT/GTP dissociation stimul.  Boguski M.S.       boguski@ncbi.nlm.nih.gov





<PAGE>




GltP family of transporters   Hofmann K.O.       khofmann@biomed.biolan.uni-
                                                 koeln.de
Glucanases                    Henrissat B.       bernie@cermav.grenet.fr
                              Beguin P.          phycel@pasteur.bitnet
Glutamine synthetase          Tateno Y.          ytateno@genes.nig.ac.jp
G-protein coupled receptors   Chollet A.         arc3029@ggr.co.uk
                              Attwood T.K.       bph6tka@biovax.leeds.ac.uk
                              Kolakowski L.F.Jr. kolakowski@helix.mgh.
                                                 harvard.edu
GTPase-activating proteins    Boguski M.S.       boguski@ncbi.nlm.nih.gov
HIT family                    Seraphin B.        seraphin@embl-heidelberg.de
HMG1/2 and HMG-14/17          Landsman D.        landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases    Kolakowski L.F.Jr. kolakowski@helix.mgh.
                                                 harvard.edu
Integrases                    Roy P.H.           2020000@saphir.ulaval.ca
Kringle domain                Ikeo K.            kikeo@genes.nig.ac.jp
Lipocalins                    Boguski M.S.       boguski@ncbi.nlm.nih.gov
                              Peitsch M.C.       peitsch@ulbio1.unil.ch
lysR family HTH proteins      Henikoff S.        henikoff@sparky.fhcrc.org
MAC components / perforin     Peitsch M.C.       peitsch@ulbio1.unil.ch
Malic enzymes                 Glynias M.         mglynias@ncsa.uiuc.edu
MAM domain                    Bork P.            bork@embl-heidelberg.de
MIP family proteins           Reizer J.          jreizer@ucsd.edu
Myelin proteolipid protein    Hofmann K.O.       khofmann@biomed.biolan.uni-
                                                 koeln.de
Pancreatic trypsin inhibitor  Ikeo K.            kikeo@genes.nig.ac.jp
PEP requiring enzymes         Reizer J.          jreizer@ucsd.edu
pfkB carbohydrate kinases     Reizer J.          jreizer@ucsd.edu
Phytochromes                  Partis M.D.        partis@gcri.afrc.ac.uk
Plant viruses icosahedral     Koonin E.V.        koonin@ncbi.nlm.nih.gov
capsid proteins
Protein kinases               Hanks S.           hanks@vuctrvax.bitnet
                              Hunter T.          hunter@salk-sc2.sdsc.edu
PTS proteins                  Reizer J.          jreizer@ucsd.edu
Restriction-modification      Bickle T.          bickle@urz.unibas.ch
            enzymes           Roberts R.J.       roberts@neb.com
Ribosomal protein S15         Ellis S.R.         srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases    Harayama S.        sharayam@ddbj.nig.ac.jp
Signal sequence peptidases    von Heijne G.      gvh@csb.ki.se
                              Dalbey R.E.        rdalbey@magnus.acs.ohio-
                                                 state.edu
Sodium symporters             Reizer J.          jreizer@ucsd.edu
Subtilases                    Brannigan J.       jab5@vaxa.york.ac.uk
                              Siezen R.J.        nizo@caos.caos.kun.nl
Thiol proteases               Turk B.            turk@ijs.ac.mail.yu
Thiol proteases inhibitors    Turk B.            turk@ijs.ac.mail.yu
TNF family                    Jongeneel C.V.     vjongene@isrecmail.unil.ch
TPR repeats                   Boguski M.S.       boguski@ncbi.nlm.nih.gov
Transit peptides              von Heijne G.      gvh@csb.ki.se
Type-II membrane antigens     Levy S.            levy@cellbio.stanford.edu
Uracil-DNA glycosylase        Aasland R.         aasland@embl-heidelberg.de





<PAGE>




Vitamin K-depend. Gla domain  Price P.A.         pprice@ucsd.edu
XPGC protein                  Clarkson S.G.      clarkson@cmu.unige.ch
Xylose isomerase              Jenkins J.         jenkins@frira.afrc.ac.uk
WAP-type domain               Claverie J.-M.     jmc@ncbi.nlm.nih.gov
ZP domain                     Bork P.            bork@embl-heidelberg.de


African swine fever virus     Yanez R.J.         ryanez@cbm2.uam.es
Bacteriophage P4              Halling C.         chh9@midway.uchicago.edu
Caenorhabditis elegans       Sonnhammer E.      esr@mrc-molecular.biology.
                                                 cam.ac.uk
Chloroplast encoded proteins  Hallick R.B.       hallick@arizona.edu
Drosophila                    Ashburner M.       ma11@phx.cam.ac.uk
Escherichia coli              Rudd K.            rudd@ncbi.nlm.nih.gov
Salmonella typhimurium        Rudd K.            rudd@ncbi.nlm.nih.gov
Snakes                        Stocklin R.        stocklin@cmu.unige.ch




   B.2  Requirements to fulfill to become an on-line expert

   An expert  should be  a scientist  working with  specific famili(es)  of
   proteins (or specific domains) and who would:

   a) Review the  protein sequences in SWISS-PROT and the patterns/matrices
      in PROSITE relevant to their field of research.
   b) Agree to  be contacted  by people  that have obtained new sequence(s)
      which seem to belong to "their" familie(s) of proteins.
   c) Have access  to electronic  mail and be willing to use it to send and
      receive data.

   If you are willing to be part of this scheme please contact Amos Bairoch
   at the following electronic mail address:

                             bairoch@cmu.unige.ch




















<PAGE>




           APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:


                                                       **********************
                        ***********************        * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * <----> **********************
                        *  Sequence Data      *
******************      *  Library            *        **********************
* FLYBASE        * <--> *********************** <----- * ECD [E. coli map]  *
* [Drosophila    *                ^         ^          **********************
* genomic d.b.]  * <------+       |         |
******************        |       |         +--------- **********************
                          |       |                    * TFD [Trans. fact.] *
                          |       |         +--------> **********************
                          |       |         |
******************        v       v         v          **********************
* REBASE         *      ***********************        * ENZYME [Nomencl.]  *
* [Restriction   * <--- *  SWISS-PROT         * <----- **********************
*  enzymes]      *      *  Protein Sequence   *            |
******************      *  Data Bank          *            v
                        ***********************        **********************
******************       ^  ^  |  |  ^   ^  |          * OMIM   [Diseases]  *
* EcoGene/EcoSeq *       |  |  |  |  |   |  +--------> **********************
* [E. coli]      * <-----+  |  |  |  |   |
******************          |  |  |  |   +-----------> **********************
                            |  |  |  |                 * ECO2DBASE     [2D] *
                            |  |  |  |                 **********************
******************          |  |  |  |
* PROSITE        * <--------+  |  |  +---------------> **********************
* [Patterns]     *             |  |                    * SWISS-2DPAGE  [2D] *
******************             |  +---------------+    **********************
             |                 v                  |
             |          ***********************   |    **********************
             +--------> * PDB [3D structures] *   +--> * Aarhus/Ghent  [2D] *
                        ***********************        **********************


















<PAGE>
  

Swiss-Prot release 26.0

Published July 1, 1993


                    SWISS-PROT RELEASE 26.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 26.0  of SWISS-PROT  contains 31808 sequence entries, comprising
   10'875'091 amino acids abstracted from 30967 references. This represents
   an increase  of 6.5% over release 25. The recent growth of the data bank
   is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091

   1.2  Source of data

   Release 26.0  has been  updated using protein sequence data from release
   36.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 35.0 of the
   EMBL Nucleotide Sequence Database.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):            4485
   Entries with pointer(s) to only EMBL entri(es):           4471
   Entries with pointer(s) to both EMBL and PIR entri(es):  22181
   Entries with no pointers lines:                            671




<PAGE>




      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 25


   2.1  Sequences and annotations

   About 1875 sequences have been added since release 25, the sequence data
   of 286  existing entries  has been  updated and  the annotations of 4100
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of
   proteins:

   -  6-hydroxy-D-nicotine oxidase and reticuline oxidase
   -  ABC-2 type transport system integral membrane proteins
   -  Acyl-CoA-binding protein
   -  Beta-eliminating lyases
   -  Calsequestrins
   -  CAP-Gly domain protein
   -  Carbamoyl-phosphate synthase
   -  Chaperonins clpA/B
   -  Chloroplast encoded proteins
   -  Cys/Met metabolism enzymes
   -  D-alanine--D-alanine ligases
   -  Dihydroxy-acid and 6-phosphogluconate dehydratases
   -  Galanin
   -  General secretion pathway (GSP) proteins
   -  Guanylate kinase family
   -  GTP cyclohydrolase I
   -  Hox complex proteins
   -  Indoleamine 2,3-dioxygenase
   -  MCM2/3/5 family
   -  Nickel-dependent hydrogenases b-type cytochrome subunit
   -  Ornithine/DAP/arginine decarboxylases family 2
   -  Osteopontin
   -  'POU' domain proteins
   -  Prephenate dehydratase
   -  Proteins containing cyclic nucleotide-binding domain(s)
   -  Proteasome A-type and B-type subunits
   -  Sodium:alanine symporters
   -  Sodium:galactoside symporters
   -  Sodium:neurotransmitter symporters


   2.2  Weekly updates of SWISS-PROT

   Since release 24, we provide weekly updates of SWISS-PROT. These updates
   are available by anonymous FTP. Three files are updated every week:

   new_seq.dat    Contains all the new entries since the last full release.
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release.
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since  the last release.




<PAGE>




   Currently these  files are  available on  the  following  anonymous  ftp
   servers:

   Organism       EMBL ftp server
   Address        ftp.embl-heidelberg.de (or 192.54.41.33)
   Directory      /pub/databases/swissprot/new

   Organism       Basel Biozentrum Biocomputing server (EMBnet SWISS node)
   Address        bioftp.unibas.ch (or 131.152.8.1)
   Directory      /archive_data/brand_new/swissprot/updates

   Organism       ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/swiss-prot/updates

   Organism       National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/swiss-prot/updates


   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.

   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free. Also,
   for the  same reason,  new  entries  do  not  contain  an  OC  (Organism
   Classification) line.



                            3. ENZYME AND PROSITE

   3.1  The ENZYME data bank

        3.1.1  Statistics

   Release 13.0  of the  ENZYME data bank is distributed with release 26 of
   SWISS-PROT. ENZYME  release 13.0  contains information  relative to 3489
   enzymes.


        3.1.2  Important change to the User's manual

   A form  is now  provided that  can be  used to  fill in  the information
   necessary for  the Nomenclature  Committee of  the IUBMB to assign an EC
   number or  to correct  error in existing entries. this form can be found
   at the end of the User's manual of the ENZYME data bank (Appendix 1).






<PAGE>




   The NC-IUBMB  will regularly  send  us  updates  and  additions  to  the
   nomenclature so  that they  can be  integrated into  the data  bank in a
   timely manner.


        3.1.3  Citation

   A paper  has been published that briefly describes the ENZYME data bank.
   You can use it if you want to cite ENZYME in a publication:

      Bairoch A.
      The ENZYME data bank
      Nucleic Acids Res. 21:3155-3156(1993).


   3.2  The PROSITE data bank

   Release 10.2  of the PROSITE data bank is distributed with release 26 of
   SWISS-PROT.  Release  10.2  contains  635  documentation  chapters  that
   describes 803 different patterns. Release 10.2 does not really represent
   a new release; the only changes between releases 10.0; 10.1 and 10.2 are
   updating of  the pointers to the SWISS-PROT entries whose name have been
   modified between  releases 25 and 26. The next release of PROSITE (11.0)
   will be distributed with release 27 of SWISS-PROT.

   Normally release  11 should  have been  distributed with  release 26  of
   SWISS-PROT, but  technical difficulties  have delayed  this new  release
   which will offer many new patterns.



                            4. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.

















<PAGE>



                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.68   Gln (Q) 4.03   Leu (L) 9.19   Ser (S) 7.06
   Arg (R) 5.25   Glu (E) 6.26   Lys (K) 5.80   Thr (T) 5.83
   Asn (N) 4.43   Gly (G) 7.08   Met (M) 2.35   Trp (W) 1.31
   Asp (D) 5.26   His (H) 2.25   Phe (F) 3.99   Tyr (Y) 3.22
   Cys (C) 1.79   Ile (I) 5.52   Pro (P) 5.04   Val (V) 6.53

   Asx (B) 0.005  Glx (Z) 0.005  Xaa (X) 0.02


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala,  Gly, Ser,  Val, Glu,  Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 4052

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1844
                            2x:  662
                            3x:  385
                            4x:  243
                            5x:  179
                            6x:  135
                            7x:   87
                            8x:   76
                            9x:   81
                           10x:   45
                       11- 20x:  147
                       21- 50x:   98
                       51-100x:   32
                         >100x:   38













<PAGE>




        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        2454          Human
         2        2222          Escherichia coli
         3        1439          Mouse
         4        1339          Rat
         5        1220          Baker's yeast (Saccharomyces cerevisiae)
         6         634          Bovine
         7         560          Fruit fly (Drosophila melanogaster)
         8         477          Chicken
         9         454          Bacillus subtilis
        10         362          African clawed frog (Xenopus laevis)
        11         340          Salmonella typhimurium
        12         333          Rabbit
        13         298          Pig
        14         251          Vaccinia virus (strain Copenhagen)
        15         222          Maize
        16         193          Human cytomegalovirus (strain AD169)
        17         177          Arabidopsis thaliana (Mouse-ear cress)
                   177          Rice
        19         176          Vaccinia virus (strain WR)
        20         167          Bacteriophage T4
        21         161          Pea
        22         159          Tobacco
                   159          Wheat
        24         151          Pseudomonas aeruginosa
        25         142          Caenorhabditis elegans
        26         141          Fission yeast (Schizosaccharomyces pombe)
        27         133          Barley
        28         129          Staphylococcus aureus
        29         127          Spinach
        30         125          Soybean
        31         123          Sheep
        32         122          Slime mold (Dictyostelium discoideum)
        33         119          Marchantia polymorpha (Liverwort)
        34         118          Rhodobacter capsulatus
        35         115          Dog
        36         113          Pseudomonas putida
        37         110          Neurospora crassa
                   110          Klebsiella pneumoniae















<PAGE>













   A.3  Repartition of the sequences by size



               From   To  Number             From   To   Number
                  1-  50    1915             1001-1100      306
                 51- 100    3248             1101-1200      188
                101- 150    4568             1201-1300      151
                151- 200    3081             1301-1400       95
                201- 250    2653             1401-1500       80
                251- 300    2358             1501-1600       41
                301- 350    2165             1601-1700       43
                351- 400    2244             1701-1800       36
                401- 450    1657             1801-1900       40
                451- 500    1851             1901-2000       31
                501- 550    1247             2001-2100       12
                551- 600     871             2101-2200       37
                601- 650     614             2201-2300       47
                651- 700     445             2301-2400       16
                701- 750     437             2401-2500       17
                751- 800     351             >2500           94
                801- 850     256
                851- 900     270
                901- 950     166
                951-1000     177


   Currently the ten largest sequences are:


                            RYNR_RABIT  5037 a.a.
                            RYNR_HUMAN  5032 a.a.
                            APB_HUMAN   4563 a.a.
                            APOA_HUMAN  4548 a.a.
                            RRPA_CVMJH  4488 a.a.
                            DYHC_TRIGR  4466 a.a.
                            GRSB_BACBR  4451 a.a.
                            PLEC_RAT    4140 a.a.
                            POLG_BVDV   3988 a.a.
                            VGF1_IBVB   3951 a.a.







<PAGE>



                         APPENDIX B: ON-LINE EXPERTS


   B.1  List of on-line experts for PROSITE and SWISS-PROT


Field of expertise            Name               Email address
---------------------------   ------------------ ----------------------------
Alcohol dehydrogenases        Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt.persson@embl-
                                                 heidelberg.de
Aldehyde dehydrogenases       Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt.persson@embl-
                                                 heidelberg.de
Alpha-crystallins/HSP-20      Leunissen J.A.M.   jackl@caos.caos.kun.nl
                              de Jong W.         u629000@hnykun11.bitnet
Alpha-2-macroglobulins        Van Leuven F.      fred@blekul13.bitnet
AA-tRNA synthetases class II  Leberman R.        leberman@frembl51.bitnet
Apolipoproteins               Boguski M.S.       boguski@ncbi.nlm.nih.gov
AraC family HTH proteins      Ramos J.L.         jlramos@cnbvx3.cnb.uam.es
Arrestins                     Kolakowski L.F.Jr. kolakowski@helix.mgh.
                                                 harvard.edu
ATP synthase c subunit        Recipon H.         recipon@ncbi.nlm.nih.gov
Band 4.1 family proteins      Rees J.            jrees@vax.oxford.ac.uk
Beta-lactamases               Brannigan J.       jab5@vaxa.york.ac.uk
Beta-transducin family        Boguski M.S.       boguski@ncbi.nlm.nih.gov
C-type lectin domain          Drickamer K.       drick@cuhhca.hhmi.columbia.
                                                 edu
Chalcone/stilbene synthases   Schroeder J.       raf@sun1.ruf.uni-freiburg.de
Chaperonins cpn10/cpn60       Georgopoulos C.    georgopo@cmu.unige.ch
Chaperonins TCP1 family       Willison K.R.      willison@icr.ac.uk
Chitinases                    Henrissat B.       bernie@cermav.grenet.fr
Clusterin                     Peitsch M.C.       peitsch@ulbio1.unil.ch
Cold shock domain             Landsman D.        landsman@ncbi.nlm.nih.gov
CTF/NF-I                      Mermod N.          nmermod@ulys.unil.ch
                              Gronostajski R.    gronosr@ccsmtp.ccf.org
Cytochromes P450              Holsztynska E.J.   ela@netcom.uucp
                                                 netcom!ela@apple.com
DEAD-box helicases            Linder P.          linder@urz.unibas.ch
dnaJ family                   Kelley W.          kelley@cmu.unige.ch
EF-hand calcium-binding       Cox J.A.           cox@sc2a.unige.ch
                              Kretsinger R.H.    rhk5i@virginia.bitnet
Elongation factor 1           Amons R.           wmbamons@rulgl.leidenuniv.nl
Enoyl-CoA hydratase           Hofmann K.O.       khofmann@biomed.biolan.uni-
                                                 koeln.de
Fatty acid desaturases        Piffanelli P.      piffanelli@jii.afrc.ac.uk
fruR/lacI family HTH proteins Reizer J.          jreizer@ucsd.edu
GATA-type zinc-fingers        Boguski M.S.       boguski@ncbi.nlm.nih.gov
GDT/GTP dissociation stimul.  Boguski M.S.       boguski@ncbi.nlm.nih.gov
GltP family of transporters   Hofmann K.O.       khofmann@biomed.biolan.uni-
                                                 koeln.de
Glucanases                    Henrissat B.       bernie@cermav.grenet.fr
                              Beguin P.          phycel@pasteur.bitnet



<PAGE>




Glutamine synthetase          Tateno Y.          ytateno@genes.nig.ac.jp
G-protein coupled receptors   Chollet A.         arc3029@ggr.co.uk
                              Attwood T.K.       bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins    Boguski M.S.       boguski@ncbi.nlm.nih.gov
HIT family                    Seraphin B.        seraphin@embl-heidelberg.de
HMG1/2 and HMG-14/17          Landsman D.        landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases    Kolakowski L.F.Jr. kolakowski@helix.mgh.
                                                 harvard.edu
Integrases                    Roy P.H.           2020000@saphir.ulaval.ca
Kringle domain                Ikeo K.            kikeo@genes.nig.ac.jp
Lipocalins                    Boguski M.S.       boguski@ncbi.nlm.nih.gov
                              Peitsch M.C.       peitsch@ulbio1.unil.ch
lysR family HTH proteins      Henikoff S.        henikoff@sparky.fhcrc.org
MAC components / perforin     Peitsch M.C.       peitsch@ulbio1.unil.ch
Malic enzymes                 Glynias M.         mglynias@ncsa.uiuc.edu
MAM domain                    Bork P.            bork@embl-heidelberg.de
MIP family proteins           Reizer J.          jreizer@ucsd.edu
Myelin proteolipid protein    Hofmann K.O.       khofmann@biomed.biolan.uni-
                                                 koeln.de
Pancreatic trypsin inhibitor  Ikeo K.            kikeo@genes.nig.ac.jp
PEP requiring enzymes         Reizer J.          jreizer@ucsd.edu
pfkB carbohydrate kinases     Reizer J.          jreizer@ucsd.edu
Phytochromes                  Partis M.D.        partis@gcri.afrc.ac.uk
Protein kinases               Hanks S.           hanks@vuctrvax.bitnet
                              Hunter T.          hunter@salk-sc2.sdsc.edu
PTS proteins                  Reizer J.          jreizer@ucsd.edu
Restriction-modification      Bickle T.          bickle@urz.unibas.ch
            enzymes           Roberts R.J.       roberts@neb.com
Ribosomal protein S15         Ellis S.R.         srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases    Harayama S.        sharayam@ddbj.nig.ac.jp
Signal sequence peptidases    von Heijne G.      gvh@csb.ki.se
                              Dalbey R.E.        rdalbey@magnus.acs.ohio-
                                                 state.edu
Sodium symporters             Reizer J.          jreizer@ucsd.edu
Subtilases                    Brannigan J.       jab5@vaxa.york.ac.uk
                              Siezen R.J.        nizo@caos.caos.kun.nl
Thiol proteases               Turk B.            turk@ijs.ac.mail.yu
Thiol proteases inhibitors    Turk B.            turk@ijs.ac.mail.yu
TNF family                    Jongeneel C.V.     vjongene@isrecmail.unil.ch
TPR repeats                   Boguski M.S.       boguski@ncbi.nlm.nih.gov
Transit peptides              von Heijne G.      gvh@csb.ki.se
Type-II membrane antigens     Levy S.            levy@cellbio.stanford.edu
Uracil-DNA glycosylase        Aasland R.         aasland@bio.uib.no
Vitamin K-depend. Gla domain  Price P.A.         pprice@ucsd.edu
XPGC protein                  Clarkson S.G.      clarkson@cmu.unige.ch
Xylose isomerase              Jenkins J.         jenkins@frira.afrc.ac.uk
WAP-type domain               Claverie J.-M.     jmc@ncbi.nlm.nih.gov
ZP domain                     Bork P.            bork@embl-heidelberg.de









<PAGE>




African swine fever virus     Yanez R.J.         ryanez@cbm2.uam.es
Bacteriophage P4              Halling C.         chh9@midway.uchicago.edu
Chloroplast encoded proteins  Hallick R.B.       hallick@arizona.edu
Drosophila                    Ashburner M.       ma11@phx.cam.ac.uk
Escherichia coli              Rudd K.            rudd@ncbi.nlm.nih.gov
Salmonella typhimurium        Rudd K.            rudd@ncbi.nlm.nih.gov
Snakes                        Stocklin R.        stocklin@cmu.unige.ch


Requirements to fulfill to become an on-line expert
===================================================

An expert should be a scientist working with specific famili(es) of proteins
(or specific domains) and who would:

  a) Review the protein sequences in SWISS-PROT and the patterns/matrices
     in PROSITE relevant to their field of research.
  b) Agree to be contacted by  people  that have obtained new sequence(s)
     which seem to belong to "their" familie(s) of proteins.
  c) Have access to electronic mail  and be willing to use it to send and
     receive data.

 If you are willing to be part of this scheme please contact Amos Bairoch
 at the following electronic mail address:

                       bairoch@cmu.unige.ch






























<PAGE>




           APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                                                       **********************
                        ***********************        * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * <----> **********************
                        *  Sequence Data      *
******************      *  Library            *        **********************
* FLYBASE        * <--> *********************** <----- * ECD [E. coli map]  *
* [Drosophila    *                ^         ^          **********************
* genomic d.b.]  * <------+       |         |
******************        |       |         +--------- **********************
                          |       |                    * TFD [Trans. fact.] *
                          |       |         +--------> **********************
                          |       |         |
******************        v       v         v          **********************
* REBASE         *      ***********************        * ENZYME [Nomencl.]  *
* [Restriction   * <--- *  SWISS-PROT         * <----- **********************
*  enzymes]      *      *  Protein Sequence   *            |
******************      *  Data Bank          *            v
                        ***********************        **********************
******************       ^  ^  |  |  ^   ^  |          * OMIM   [Diseases]  *
* EcoGene/EcoSeq *       |  |  |  |  |   |  +--------> **********************
* [E. coli]      * <-----+  |  |  |  |   |
******************          |  |  |  |   +-----------> **********************
                            |  |  |  |                 * ECO2DBASE     [2D] *
                            |  |  |  |                 **********************
******************          |  |  |  |
* PROSITE        * <--------+  |  |  +---------------> **********************
* [Patterns]     *             |  |                    * SWISS-2DPAGE  [2D] *
******************             |  +---------------+    **********************
             |                 v                  |
             |          ***********************   |    **********************
             +--------> * PDB [3D structures] *   +--> * Aarhus/Ghent  [2D] *
                        ***********************        **********************



















<PAGE>
  

Swiss-Prot release 25.0

Published April 1, 1993


                    SWISS-PROT RELEASE 25.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 25.0  of SWISS-PROT  contains 29955 sequence entries, comprising
   10'214'020 amino acids abstracted from 29176 references. This represents
   an increase of 7% over release 24. The recent growth of the data bank is
   summarized below.

   Release    Date   Number of entries     Nb of amino acids

    3.0       11/86               4160               969 641
    4.0       04/87               4387             1 036 010
    5.0       09/87               5205             1 327 683
    6.0       01/88               6102             1 653 982
    7.0       04/88               6821             1 885 771
    8.0       08/88               7724             2 224 465
    9.0       11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020

   1.2  Source of data

   Release 25.0  has been  updated using protein sequence data from release
   35.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 34.0 of the
   EMBL Nucleotide Sequence Database.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):               4420
   Entries with pointer(s) to only EMBL entri(es):              4235
   Entries with pointer(s) to both EMBL and PIR entri(es):     20694
   Entries with no pointers lines:                               606





<PAGE>




      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 24


   2.1  Sequences and annotations

   About 1820 sequences have been added since release 24, the sequence data
   of 191  existing entries  has been  updated and  the annotations of 2900
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of
   proteins:

   -  2'-5'-oligoadenylate synthetases
   -  Adaptins
   -  ADP-glucose pyrophosphorylases
   -  Alanine dehydrogenases and pyridine nucleotide transhydrogenases
   -  Alpha-isopropylmalate and homocitrate synthases
   -  Ankrepeats proteins
   -  Bacterial regulatory proteins, arsR family
   -  Cobalt-binding eukaryotic proteins
   -  Cytochrome c and c1 heme lyases
   -  Elongation factor 1 beta/beta'/delta and gamma chains
   -  Eukaryotic initiation factor 4E
   -  Erythropoietins
   -  Glycosyl hydrolases family 8
   -  Glucoamylases
   -  Heat shock 20 Kd family
   -  Interleukins-4
   -  MARCKS family
   -  Oleosins
   -  Prokaryotic transcription elongation factors (greA/B)
   -  Serine proteases, ompT family
   -  Tyrosine protein kinases
   -  Uroporphyrin-III C-methyltransferases
   -  XPGC family proteins


   2.2  SWISS-PROT and 2D gel databases

   Two-dimensional (2D)  gel techniques  have made enormous progress in the
   last few  years. One  of the consequences of this evolution has been the
   development of  databases that  contain master  gels from  a variety  of
   mammalian tissues  or from  bacterial sources. These databases are going
   to play an increasingly important role in the analysis of genomes and of
   molecular diseases.  2D gel  databases generally  contain  one  or  more
   master images  of the  gels that  correspond to  the tissue  or organism
   studied; spots on these images are attributed an identification code and
   a variable  percentage of  these spots are linked to known proteins. The
   identification of  a protein  on a 2D gel is generally carried out using
   antibodies or  by microsequencing.  Microsequencing of 2D gel spots also
   produces partial sequences and physico-chemical data for a number of yet
   uncharacterized proteins.





<PAGE>




   SWISS-PROT has  committed itself  to work  in close collaboration with a
   number of  groups developing  2D gel  databases. Since  last year cross-
   references  are  already  available  to  the  gene-protein  database  of
   Escherichia coli  K-12 (now called ECO2DBASE) [1] and symmetrically that
   database now  contains cross-references  to SWISS-PROT. As a second step
   we have  expanded our links to 2D gel databases by integrating data from
   the following sources:

   -  The Human  2D gel  protein database of the Faculty of Medicine of the
      University of  Geneva (known as SWISS-2DPAGE). SWISS-2DPAGE currently
      contains data  concerning plasma [2] and liver [3] proteins, but will
      soon include additional tissues.

   -  The Human  keratinocyte 2D gel protein database from the universities
      of Aarhus and Ghent [4] (known as AARHUS/GHENT-2DPAGE).

   For both of the above databases we provide:

   a) Cross-references (using  the DR  line) to  the identificators for the
      spots corresponding to known or unknown microsequenced proteins.
   b) We have  created new  entries for  microsequences that  correspond to
      novel, yet unidentified, proteins.
   c) In some  cases we  have entered  the extent of the microsequences for
      already known  proteins. This was done for proteins which are not yet
      well characterized.  The availability  of such microsequences allows,
      for example,  to confirm  the position  of a signal sequence cleavage
      site or to confirm the correctness of a translated genomic sequence.

   In the  near future  the collaboration with the group of D. Hochstrasser
   which produces  the  SWISS-2DPAGE  database  will  be  expanded  in  the
   following directions:

   a) The MELANIE  software package  [5] which is a complete system for the
      analysis  of  2D  gels  and  which  is  developed  by  the  group  of
      Hochstrasser will  allow its users to navigate back and forth between
      SWISS-2DPAGE and  SWISS-PROT. For more information on Melanie, please
      contact Dr. Ron Appel:

      Email: appel@cih.hcuge.ch
      Tel: +41-22-372 62 64
      Fax: +41-22-372 61 98

   b) A file  server will  be set  up that will allow anyone with a network
      connection to obtain annotated graphic files containing the region of
      the gels  that correspond  to a  selected SWISS-PROT  entry linked to
      SWISS-2DPAGE.


   [1]  VanBogelen R.A., Sankar P., Clark R.L., Bogan J.A., Neidhardt F.C.
        Electrophoresis 13:1014-1054(1992).





<PAGE>




   [2]  Hughes G.J.,  Frutiger S.,  Paquet  N.,  Ravier  F.,  Pasquali  C.,
        Sanchez J.-C., James R., Tissot J.-D., Bjellqvist B., Hochstrasser
        D.F.
        Electrophoresis 13:707-714(1992).
   [3]  Hochstrasser D.F.,  Frutiger S.,  Paquet N., Bairoch A., Ravier F.,
        Pasquali C., Sanchez J.-C., Tissot J.-D., Bjellqvist B., Vargas R.,
        Appel R.D., Hughes G.J.
        Electrophoresis 13:992-1001(1992).
   [4]  Celis J.E.,  Rasmussen H.H.,  Madsen P.,  Leffers  H.,  Honore  B.,
        Dejgaard K., Gesser B., Olsen E., Gromov P., Hoffmann H.J., Nielsen
        M., Celis A.,  Basse B.,  Lauridsen J.B.,  Ratz  G.P.,  Nielsen H.,
        Andersen A.H., Walbum E., Kjaergaard I.,  Puype M.,  Van Damme  J.,
        Vandekerckhove J.
        Electrophoresis 13:893-959(1992).
   [5]  Appel R.,  Hochstrasser D.F.,  Funk M., Vargas J.R., Pellegrini C.,
        Muller A.F., Scherrer J.-R.
        Electrophoresis 12:722-735(1991).


   2.3  Changes in the DR line

   a) The cross-references  to the  SWISS-2DPAGE and AARHUS/GHENT-2DPAGE 2D
      gel databases (see the description in the above section).

   Data bank identifier:  SWISS-2DPAGE
   Primary identifier  :  The protein spot alphanumeric designation.
                          Note:  SWISS-2DPAGE   uses   SWISS-PROT   primary
                          accession numbers as the alphanumeric designation
                          of spots  that  are linked to SWISS-PROT entries.
   Secondary identifier:  The species  of origin (currently only `HUMAN' is
                          used).
   Example             :  DR   SWISS-2DPAGE; P10599; HUMAN.

   Data bank identifier:  AARHUS/GHENT-2DPAGE
   Primary identifier  :  The protein spot alphanumeric designation.
   Secondary identifier:  Either  `IEF'  (for   isoelectric  focusing)   or
                          `NEPHGE'   (for   non-equilibrium   pH   gradient
                          electrophoresis).
   Example             :  DR   AARHUS/GHENT-2DPAGE; 8006; IEF.


   b) Instead of using the release number as the secondary identifier for a
      cross-reference to  the FlyBase  database, we  now use  the gene name
      indicated in the UID table of this database. Example:

      DR   FLYBASE; 00055; RELASE 9209.

      is now:

      DR   FLYBASE; 00055; ADH.






<PAGE>




   c) Instead of  using "EC-2D-GEL"  to designate  the 2D  gel gene-protein
      database of  Escherichia coli,  we now use "ECO2DBASE" to reflect the
      new official name of that database. Example:

      DR   EC-2D-GEL; I026.0; 4TH EDITION.

      is now:

      DR   ECO2DBASE; I026.0; 5TH EDITION.



   2.4  Weekly update of SWISS-PROT

   Starting with  the last  release we started to provide weekly updates of
   SWISS-PROT. These  updates are  available by  anonymous FTP. Three files
   are updated every week:

   new_seq.dat    Contains all the new entries since the last full release.
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release.
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since
                  the last release.

   Currently these  files are  available on  the  following  anonymous  ftp
   servers:

   Organism       EMBL ftp server
   Address        ftp.embl-heidelberg.de (or 192.54.41.33)
   Directory      /pub/databases/swissprot/new

   Organism       Basel Biozentrum Biocomputing server (EMBnet SWISS node)
   Address        bioftp.unibas.ch (or 131.152.8.1)
   Directory      /archive_data/brand_new/swissprot/updates

   Organism       ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/swiss-prot/updates

   Organism       National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/swiss-prot/updates

   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.







<PAGE>




   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free. Also,
   for the  same reason,  new  entries  do  not  contain  an  OC  (Organism
   Classification) line.



   2.5  Cancelling the announced change in the RA line concerning the
   author names format

   After reconsideration,  the RA  line change  announced in release 24 has
   been cancelled  as the  expected benefits  did not  seem  to  compensate
   possible negative  effects on  existing software.  The format  of the RA
   lines therefore  remains unchanged  both in the EMBL Nucleotide sequence
   database and in SWISS-PROT.



                            3. ENZYME AND PROSITE

   3.1  The ENZYME data bank

   Release 12.0  of the  ENZYME data bank is distributed with release 25 of
   SWISS-PROT. ENZYME  release 12.0  contains information  relative to 3489
   enzymes.

   3.2  The PROSITE data bank

   Release 10.1  of the PROSITE data bank is distributed with release 25 of
   SWISS-PROT.  Release  10.1  contains  635  documentation  chapters  that
   describes 803 different patterns. Release 10.1 does not really represent
   a new  release; the  only changes  between releases  10.0 and  10.1  are
   updating of  the pointers to the SWISS-PROT entries whose name have been
   modified between  releases 24 and 25. The next release of PROSITE (11.0)
   will be distributed with release 26 of SWISS-PROT.



                            4. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.










<PAGE>




                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.68   Gln (Q) 4.03   Leu (L) 9.17   Ser (S) 7.07
   Arg (R) 5.26   Glu (E) 6.26   Lys (K) 5.81   Thr (T) 5.84
   Asn (N) 4.43   Gly (G) 7.08   Met (M) 2.35   Trp (W) 1.31
   Asp (D) 5.26   His (H) 2.26   Phe (F) 3.98   Tyr (Y) 3.22
   Cys (C) 1.80   Ile (I) 5.51   Pro (P) 5.05   Val (V) 6.52

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 3876

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1755
                            2x:  633
                            3x:  376
                            4x:  240
                            5x:  169
                            6x:  133
                            7x:   81
                            8x:   74
                            9x:   79
                           10x:   38
                       11- 20x:  144
                       21- 50x:   92
                       51-100x:   29
                         >100x:   37












<PAGE>




        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        2297          Human
         2        2039          Escherichia coli
         3        1367          Mouse
         4        1262          Rat
         5        1143          Baker's yeast (Saccharomyces cerevisiae)
         6         593          Bovine
         7         527          Fruit fly (Drosophila melanogaster)
         8         463          Chicken
         9         431          Bacillus subtilis
        10         334          African clawed frog (Xenopus laevis)
        11         330          Salmonella typhimurium
        12         315          Rabbit
        13         287          Pig
        14         251          Vaccinia virus (strain Copenhagen)
        15         211          Maize
        16         193          Human cytomegalovirus (strain AD169)
        17         173          Vaccinia virus (strain WR)
        18         171          Rice
        19         168          Bacteriophage T4
        20         164          Arabidopsis thaliana (Mouse-ear cress)
        21         154          Tobacco
        22         151          Wheat
        23         148          Pea
        24         145          Pseudomonas aeruginosa
        25         139          Caenorhabditis elegans
        26         129          Barley
        27         127          Fission yeast (Schizosaccharomyces pombe)
        28         123          Staphylococcus aureus
        29         120          Spinach
        30         119          Sheep
        31         118          Soybean
        32         117          Marchantia polymorpha (Liverwort)
                   117          Slime mold (Dictyostelium discoideum)
        34         106          Dog
                   106          Neurospora crassa
        36         103          Klebsiella pneumoniae
        37         102          Pseudomonas putida
















<PAGE>




   A.3  Repartition of the sequences by size



               From   To  Number             From   To   Number
                  1-  50    1786             1001-1100      287
                 51- 100    3037             1101-1200      169
                101- 150    4386             1201-1300      142
                151- 200    2924             1301-1400       92
                201- 250    2493             1401-1500       78
                251- 300    2209             1501-1600       39
                301- 350    2035             1601-1700       39
                351- 400    2110             1701-1800       35
                401- 450    1546             1801-1900       38
                451- 500    1738             1901-2000       29
                501- 550    1175             2001-2100       12
                551- 600     828             2101-2200       33
                601- 650     575             2201-2300       43
                651- 700     423             2301-2400       14
                701- 750     402             2401-2500       16
                751- 800     326             >2500           90
                801- 850     240
                851- 900     248
                901- 950     153
                951-1000     165




   Currently the ten largest sequences are:


                            RYNR_RABIT  5037 a.a.
                            RYNR_HUMAN  5032 a.a.
                            APB_HUMAN   4563 a.a.
                            APOA_HUMAN  4548 a.a.
                            RRPA_CVMJH  4488 a.a.
                            DYHC_TRIGR  4466 a.a.
                            GRSB_BACBR  4451 a.a.
                            PLEC_RAT    4140 a.a.
                            POLG_BVDV   3988 a.a.
                            VGF1_IBVB   3951 a.a.














<PAGE>




                         APPENDIX B: ON-LINE EXPERTS



   B.1  List of on-line experts for PROSITE and SWISS-PROT


Field of expertise            Name               Email address
---------------------------   ------------------ ----------------------------
Alcohol dehydrogenases        Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt.persson@embl-heidelberg.
                                                 de
Aldehyde dehydrogenases       Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt.persson@embl-heidelberg.
                                                 de
Alpha-crystallins/HSP-20      Leunissen J.A.M.   jackl@caos.caos.kun.nl
                              de Jong W.         u629000@hnykun11.bitnet
Alpha-2-macroglobulins        Van Leuven F.      fred@blekul13.bitnet
AA-tRNA synthetases class II  Leberman R.        leberman@frembl51.bitnet
Apolipoproteins               Boguski M.S.       boguski@ncbi.nlm.nih.gov
AraC family HTH proteins      Ramos J.L.         jlramos@cnbvx3.cnb.uam.es
Arrestins                     Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.
                                                 edu
ATP synthase c subunit        Recipon H.         recipon@ncbi.nlm.nih.gov
Band 4.1 family proteins      Rees J.            jrees@vax.oxford.ac.uk
Beta-lactamases               Brannigan J.       jab5@vaxa.york.ac.uk
Beta-transducin family        Boguski M.S.       boguski@ncbi.nlm.nih.gov
C-type lectin domain          Drickamer K.       drick@cuhhca.hhmi.columbia.
                                                 edu
Chalcone/stilbene synthases   Schroeder J.       raf@sun1.ruf.uni-freiburg.de
Chaperonins cpn10/cpn60       Georgopoulos C.    georgopo@cmu.unige.ch
Chaperonins TCP1 family       Willison K.R.      willison@icr.ac.uk
Chitinases                    Henrissat B.       bernie@cermav.grenet.fr
Clusterin                     Peitsch M.C.       peitsch@ulbio1.unil.ch
Cold shock domain             Landsman D.        landsman@ncbi.nlm.nih.gov
CTF/NF-I                      Mermod N.          nmermod@ulys.unil.ch
                              Gronostajski R.    gronosr@ccsmtp.ccf.org
Cytochromes P450              Holsztynska E.J.   ela@netcom.uucp
                                                 netcom!ela@apple.com
DEAD-box helicases            Linder P.          linder@urz.unibas.ch
dnaJ family                   Kelley W.          kelley@cmu.unige.ch
EF-hand calcium-binding       Cox J.A.           cox@sc2a.unige.ch
                              Kretsinger R.H.    rhk5i@virginia.bitnet
Elongation factor 1           Amons R.           wmbamons@rulgl.leidenuniv.nl
Enoyl-CoA hydratase           Hofmann K.O.       khofmann@biomed.biolan.
                                                 uni-koeln.de
fruR/lacI family HTH proteins Reizer J.          jreizer@ucsd.edu
GATA-type zinc-fingers        Boguski M.S.       boguski@ncbi.nlm.nih.gov
GDT/GTP dissociation stimul.  Boguski M.S.       boguski@ncbi.nlm.nih.gov
GltP family of transporters   Hofmann K.O.       khofmann@biomed.biolan.
                                                 uni-koeln.de





<PAGE>




Glucanases                    Henrissat B.       bernie@cermav.grenet.fr
                              Beguin P.          phycel@pasteur.bitnet
Glutamine synthetase          Tateno Y.          ytateno@genes.nig.ac.jp
G-protein coupled receptors   Chollet A.         arc3029@ggr.co.uk
                              Attwood T.K.       bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins    Boguski M.S.       boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17          Landsman D.        landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases    Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.
                                                 edu
Integrases                    Roy P.H.           2020000@saphir.ulaval.ca
Kringle domain                Ikeo K.            kikeo@genes.nig.ac.jp
Lipocalins                    Boguski M.S.       boguski@ncbi.nlm.nih.gov
                              Peitsch M.C.       peitsch@ulbio1.unil.ch
lysR family HTH proteins      Henikoff S.        henikoff@sparky.fhcrc.org
MAC components / perforin     Peitsch M.C.       peitsch@ulbio1.unil.ch
Malic enzymes                 Glynias M.         mglynias@ncsa.uiuc.edu
MAM domain                    Bork P.            bork@embl-heidelberg.de
MIP family proteins           Reizer J.          jreizer@ucsd.edu
Myelin proteolipid protein    Hofmann K.O.       khofmann@biomed.biolan.
                                                 uni-koeln.de
Pancreatic trypsin inhibitor  Ikeo K.            kikeo@genes.nig.ac.jp
PEP requiring enzymes         Reizer J.          jreizer@ucsd.edu
pfkB carbohydrate kinases     Reizer J.          jreizer@ucsd.edu
Phytochromes                  Partis M.D.        partis@gcri.afrc.ac.uk
Protein kinases               Hanks S.           hanks@vuctrvax.bitnet
                              Hunter T.          hunter@salk.bitnet
PTS proteins                  Reizer J.          jreizer@ucsd.edu
Restriction-modification      Bickle T.          bickle@urz.unibas.ch
            enzymes           Roberts R.J.       roberts@neb.com
Ribosomal protein S3          Hallick R.         hallick%biotec@arizona.edu
Ribosomal protein S15         Ellis S.R.         srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases    Harayama S.        harayama@cmu.unige.ch
Signal sequence peptidases    von Heijne G.      gvh@csb.ki.se
                              Dalbey R.E.        rdalbey@magnus.acs.ohio-state.
                                                 edu
Sodium symporters             Reizer J.          jreizer@ucsd.edu
Subtilases                    Brannigan J.       jab5@vaxa.york.ac.uk
                              Siezen R.J.        nizo@caos.caos.kun.nl
Thiol proteases               Turk B.            turk@ijs.ac.mail.yu
Thiol proteases inhibitors    Turk B.            turk@ijs.ac.mail.yu
TNF family                    Jongeneel C.V.     vjongene@isrecmail.unil.ch
TPR repeats                   Boguski M.S.       boguski@ncbi.nlm.nih.gov
Transit peptides              von Heijne G.      gvh@csb.ki.se
Type-II membrane antigens     Levy S.            levy@cellbio.stanford.edu
Uracil-DNA glycosylase        Aasland R.         aasland@bio.uib.no
Vitamin K-depend. Gla domain  Price P.A.         pprice@ucsd.edu
XPGC protein                  Clarkson S.G.      clarkson@cmu.unige.ch
Xylose isomerase              Jenkins J.         jenkins@frira.afrc.ac.uk
WAP-type domain               Claverie J.-M.     jmc@ncbi.nlm.nih.gov
ZP domain                     Bork P.            bork@embl-heidelberg.de






<PAGE>




African swine fever virus     Yanez R.J.         ryanez@cbm2.uam.es
Bacteriophage P4              Halling C.         chh9@midway.uchicago.edu
Drosophila                    Ashburner M.       ma11@phx.cam.ac.uk
Escherichia coli              Rudd K.            rudd@ncbi.nlm.nih.gov
Salmonella typhimurium        Rudd K.            rudd@ncbi.nlm.nih.gov
Snakes                        Stocklin R.        stocklin@cmu.unige.ch
Yeast chromosome I            Ouellette F.       francis@monod.biol.mcgill.ca




   B.2  Requirements to fulfill to become an on-line expert

   An expert  should be  a scientist  working with  specific famili(es)  of
   proteins (or specific domains) and who would:

   a) Review the  protein sequences in SWISS-PROT and the patterns/matrices
      in PROSITE relevant to their field of research.
   b) Agree to  be contacted  by people  that have obtained new sequence(s)
      which seem to belong to "their" familie(s) of proteins.
   c) Have access  to electronic  mail and be willing to use it to send and
      receive data.

   If you are willing to be part of this scheme please contact Amos Bairoch
   at one of the following electronic mail addresses:

                             bairoch@cmu.unige.ch
                           bairoch@cgecmu51.bitnet




























<PAGE>




           APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                                                       **********************
                        ***********************        * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * <----> **********************
                        *  Sequence Data      *
******************      *  Library            *        **********************
* FLYBASE        * <--> *********************** <----- * ECD [E. coli map]  *
* [Drosophila    *                ^         ^          **********************
* genomic d.b.]  * <------+       |         |
******************        |       |         +--------- **********************
                          |       |                    * TFD [Trans. fact.] *
                          |       |         +--------> **********************
                          |       |         |
******************        v       v         v          **********************
* REBASE         *      ***********************        * ENZYME [Nomencl.]  *
* [Restriction   * <--- *  SWISS-PROT         * <----- **********************
*  enzymes]      *      *  Protein Sequence   *            |
******************      *  Data Bank          *            v
                        ***********************        **********************
******************       ^  ^  |  |  ^   ^  |          * OMIM   [Diseases]  *
* EcoGene/EcoSeq *       |  |  |  |  |   |  +--------> **********************
* [E. coli]      * <-----+  |  |  |  |   |
******************          |  |  |  |   +-----------> **********************
                            |  |  |  |                 * ECO2DBASE     [2D] *
                            |  |  |  |                 **********************
******************          |  |  |  |
* PROSITE        * <--------+  |  |  +---------------> **********************
* [Patterns]     *             |  |                    * SWISS-2DPAGE  [2D] *
******************             |  +---------------+    **********************
             |                 v                  |
             |          ***********************   |    **********************
             +--------> * PDB [3D structures] *   +--> * Aarhus/Ghent  [2D] *
                        ***********************        **********************



















<PAGE>
  

Swiss-Prot release 24.0

Published December 1, 1992



                    SWISS-PROT RELEASE 24.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 24.0  of SWISS-PROT  contains 28154 sequence entries, comprising
   9'545'427 amino  acids abstracted from 27750 references. This represents
   an increase  of 5.9% over release 23. The recent growth of the data bank
   is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427

   1.2  Source of data

   Release 24.0  has been  updated using protein sequence data from release
   34.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 33.0 of the
   EMBL Nucleotide Sequence Database.














<PAGE>




   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):            4411
   Entries with pointer(s) to only EMBL entri(es):           3691
   Entries with pointer(s) to both EMBL and PIR entri(es):  19493
   Entries with no pointers lines:                            559


      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 23


   2.1  Sequences and annotations

   About 1466 sequences have been added since release 23, the sequence data
   of 196  existing entries  has been  updated and  the annotations of 3300
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of
   proteins:

   -  14-3-3 proteins
   -  5'-nucleotidases
   -  7,8-dihydro-6-hydroxymethylpterin-pyrophosphokinase (HPPK)
   -  Actin-capping proteins alpha subunits
   -  Bacterial regulatory proteins, crp/fnr family
   -  Beta-lactamases class B
   -  Calreticulins and calnexins
   -  Chaperonins TCP-1
   -  Chitinases
   -  Clusterins
   -  Cold shock proteins
   -  Dihydropteroate synthase (DHPS)
   -  DNA-directed DNA polmyerases (C family)
   -  Glutamyl-tRNA reductases
   -  Glycoprotein hormones
   -  Granulins
   -  Guanine-nucleotide releasing factors CDC24 family
   -  Hirudins
   -  Leader peptidase family
   -  Neurotransmitters transporters
   -  Pancreatic hormone family
   -  Prokaryotic-type release factors
   -  Prolyl endopeptidases
   -  Receptor tyrosine kinase class V (eph, eck, elk, etc.)
   -  secY proteins
   -  Serine proteases, subtilisin family (subtilases)
   -  Tranketolases
   -  Transcription factor TFIIB
   -  Transthyretins
   -  Visual pigments (opsins)
   -  Wnt-1 family




<PAGE>




   2.2  Weekly update of SWISS-PROT

   Starting with  this release  we are  providing weekly  updates of SWISS-
   PROT. These  updates are available by anonymous FTP. Three files will be
   updated every week:

   new_seq.dat    Contains all the new entries since the last full release.
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release.
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since the last release.

   Currently these  files are  available on  the  following  anonymous  ftp
   servers:

   Organism       EMBL ftp server
   Address        ftp.embl-heidelberg.de (or 192.54.41.33)
   Directory      /pub/databases/swissprot/new

   Organism       ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/swiss-prot/updates

   Organism       National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/swiss-prot/updates


   !! Important notes !!!

   Although we  are going  to try  to follow  a regular schedule, we do not
   promise to  update these  files every week. In some cases two weeks will
   elapse in-between two updates.

   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free. Also,
   for the  same reason,  new entries  will not  contain  an  OC  (Organism
   Classification) line.


                   3.0 CHANGES PLANNED FOR FUTURE RELEASES

   3.1  Change in the RA line concerning the author names format

   As from  release 25  in March  1993 we  will change the format of author
   names on  RA lines  to conform  to  that  used  by  major  bibliographic
   databases such  as Medline.  The main  change is  that the  periods  and
   hyphens ("-") which currently appear within initials will not appear any
   more. For example, the current:

   RA   Wilson A.C., Smith J.-C.;





<PAGE>



   will appear as:

   RA   Wilson AC, Smith JC;



                            4. ENZYME AND PROSITE

   4.1  The ENZYME data bank

   Release 11.0  of the  ENZYME data bank is distributed with release 24 of
   SWISS-PROT. ENZYME  release 11.0  contains information  relative to 3489
   enzymes. The  data bank  has been  significantly modified  to take  into
   account the  information available  in the  new edition of the IUPAC-IUB
   Enzyme Nomenclature  book which  describes many  new enzymes and updates
   the information concerning existing ones.

   4.2  The PROSITE data bank

   Release 10.0  of the PROSITE data bank is distributed with release 24 of
   SWISS-PROT.  Release  10.0  contains  635  documentation  chapters  that
   describes 803  different patterns.  Since  the  last  major  release  of
   PROSITE (release 9.00 of June 1992), 55 new chapters have been added and
   about 255 chapters have been updated. The new chapters are listed below.

   -  14-3-3 proteins signatures
   -  5'-nucleotidase signatures
   -  7,8-dihydro-6-hydroxymethylpterin-pyrophosphokinase signature
   -  Aminotransferases class-IV signature
   -  AP endonucleases family 1 signatures
   -  AP endonucleases family 2 signatures
   -  ArgE / dapE / CPG2 family signatures
   -  Barwin domain signatures
   -  Beta-lactamases class B signatures
   -  Calreticulin family signatures
   -  Chaperonins TCP-1 signatures
   -  Chitinases class I signatures
   -  Chorismate synthase signatures
   -  Dihydropteroate synthase signatures
   -  Electron transfer flavoprotein alpha-subunit signature
   -  Endonuclease III iron-sulfur binding region signature
   -  Enterobacterial virulence outer membrane protein signatures
   -  Formate--tetrahydrofolate ligase signatures
   -  F-actin capping protein alpha subunit signatures
   -  Germin family signature
   -  GltP / dctA family of transporters signatures
   -  Glutamyl-tRNA reductase signature
   -  Glycoprotein hormones alpha chain signatures
   -  Glycosyl hydrolases family 11 active site signatures
   -  Glycosyl hydrolases family 3 active site
   -  Granulins signature






<PAGE>




   -  Granulocyte-macrophage colony-stimulating factor signature
   -  Guanine-nucleotide dissociation stimulators CDC24 family signature
   -  Guanine-nucleotide dissociation stimulators CDC25 family signature
   -  Involucrin signature
   -  Phosphoglucomutase & phosphomannomutase phosphohistidine signature
   -  Prokaryotic ornithine and lysine decarboxylases pyridoxal-phosphate
   -  Prokaryotic-type carbonic anhydrases signatures
   -  Prokaryotic-type peptide chain release factors signature
   -  Prolyl endopeptidase family serine active site
   -  Protein secY signatures
   -  Receptor tyrosine kinase class V signatures
   -  Riboflavin synthase alpha chain family Lum-binding site signature
   -  Ribosomal protein L13 signature
   -  Ribosomal protein L30e signature
   -  Ribosomal protein L34 signature
   -  Ribosomal protein S16 signature
   -  Ribosomal protein S17e signature
   -  Ribosomal protein S26e signature
   -  Sigma-54 factors family signatures
   -  Sigma-70 factors family signatures
   -  Single-strand binding protein family signatures
   -  Stress-induced proteins SRP1/TIP1 family signature
   -  S-adenosyl-L-homocysteine hydrolase signatures
   -  Tetrahydrofolate dehydrogenase/cyclohydrolase signatures
   -  Transcription factor TFIIB repeat signature
   -  Transketolase signatures
   -  Transthyretin signatures
   -  WHEP-TRS domain signature
   -  XPAC protein signatures



                            5. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.
















<PAGE>




                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.68   Gln (Q) 4.03   Leu (L) 9.15   Ser (S) 7.07
   Arg (R) 5.25   Glu (E) 6.25   Lys (K) 5.80   Thr (T) 5.85
   Asn (N) 4.43   Gly (G) 7.11   Met (M) 2.34   Trp (W) 1.30
   Asp (D) 5.26   His (H) 2.26   Phe (F) 3.97   Tyr (Y) 3.22
   Cys (C) 1.81   Ile (I) 5.50   Pro (P) 5.07   Val (V) 6.51

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 3698

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1665
                            2x:  617
                            3x:  359
                            4x:  236
                            5x:  153
                            6x:  117
                            7x:   86
                            8x:   65
                            9x:   76
                           10x:   31
                       11- 20x:  147
                       21- 50x:   86
                       51-100x:   24
                         >100x:   36












<PAGE>




        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        2094          Human
         2        1991          Escherichia coli
         3        1297          Mouse
         4        1198          Rat
         5        1092          Baker's yeast (Saccharomyces cerevisiae)
         6         576          Bovine
         7         496          Fruit fly (Drosophila melanogaster)
         8         448          Chicken
         9         423          Bacillus subtilis
        10         318          African clawed frog (Xenopus laevis)
        11         316          Salmonella typhimurium
        12         303          Rabbit
        13         278          Pig
        14         251          Vaccinia virus (strain Copenhagen)
        15         209          Maize
        16         193          Human cytomegalovirus (strain AD169)
        17         168          Bacteriophage T4
        18         162          Vaccinia virus (strain WR)
        19         158          Rice
        20         147          Tobacco
        21         140          Wheat
        22         136          Pseudomonas aeruginosa
        23         134          Arabidopsis thaliana (Mouse-ear cress)
                   134          Pea
        25         128          Barley
        26         120          Staphylococcus aureus
        27         117          Fission yeast (Schizosaccharomyces pombe)
                   117          Marchantia polymorpha (Liverwort)
        29         116          Spinach
        30         113          Sheep
        31         111          Slime mold (Dictyostelium discoideum)
                   111          Soybean
        33         109          Caenorhabditis elegans
        34         104          Dog
        35         103          Neurospora crassa
        36         101          Pseudomonas putida

















<PAGE>




   A.3  Repartition of the sequences by size



               From   To  Number             From   To   Number
                  1-  50    1706             1001-1100      270
                 51- 100    2911             1101-1200      157
                101- 150    4223             1201-1300      133
                151- 200    2703             1301-1400       86
                201- 250    2317             1401-1500       71
                251- 300    2091             1501-1600       38
                301- 350    1909             1601-1700       36
                351- 400    1878             1701-1800       33
                401- 450    1423             1801-1900       36
                451- 500    1633             1901-2000       27
                501- 550    1095             2001-2100       10
                551- 600     786             2101-2200       33
                601- 650     543             2201-2300       40
                651- 700     402             2301-2400       13
                701- 750     385             2401-2500       15
                751- 800     309             >2500           78
                801- 850     227
                851- 900     236
                901- 950     145
                951-1000     156


   Currently the ten largest sequences are:


                            RYNR_RABIT  5037 a.a.
                            RYNR_HUMAN  5032 a.a.
                            APB_HUMAN   4563 a.a.
                            APOA_HUMAN  4548 a.a.
                            DYHC_TRIGR  4466 a.a.
                            POLG_BVDV   3988 a.a.
                            VGF1_IBVB   3951 a.a.
                            POLG_HCVA   3898 a.a.
                            POLG_HCVB   3898 a.a.
                            ACVT_PENCH  3791 a.a.
















<PAGE>




                         APPENDIX B: ON-LINE EXPERTS


   B.1  List of on-line experts for PROSITE and SWISS-PROT


Field of expertise            Name               Email address
---------------------------   ------------------ -----------------------------
Alcohol dehydrogenases        Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt.persson@embl-heidelberg.
                                                 de
Aldehyde dehydrogenases       Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt.persson@embl-heidelberg.
                                                 de
Alpha-crystallins/HSP-20      Leunissen J.A.M.   jackl@caos.caos.kun.nl
                              de Jong W.         u629000@hnykun11.bitnet
Alpha-2-macroglobulins        Van Leuven F.      fred@blekul13.bitnet
AA-tRNA synthetases class II  Leberman R.        leberman@frembl51.bitnet
Apolipoproteins               Boguski M.S.       boguski@ncbi.nlm.nih.gov
Arrestins                     Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.
                                                 edu
ATP synthase c subunit        Recipon H.         recipon@ncbi.nlm.nih.gov
Band 4.1 family proteins      Rees J.            jrees@vax.oxford.ac.uk
Beta-lactamases               Brannigan J.       jab5@vaxa.york.ac.uk
Beta-transducin family        Boguski M.S.       boguski@ncbi.nlm.nih.gov
C-type lectin domain          Drickamer K.       drick@cuhhca.hhmi.columbia.
                                                 edu
Chalcone/stilbene synthases   Schroeder J.       raf@sun1.ruf.uni-freiburg.de
Chaperonins cpn10/cpn60       Georgopoulos C.    georgopo@cmu.unige.ch
Chaperonins TCP1 family       Willison K.R.      willison@icr.ac.uk
Chitinases                    Henrissat B.       bernie@cermav.grenet.fr
Clusterin                     Peitsch M.C.       peitsch@ulbio1.unil.ch
Cold shock domain             Landsman D.        landsman@ncbi.nlm.nih.gov
CTF/NF-I                      Mermod N.          nmermod@ulys.unil.ch
                              Gronostajski R.    gronosr@ccsmtp.ccf.org
Cytochromes P450              Holsztynska E.J.   ela@netcom.uucp
                                                 netcom!ela@apple.com
DEAD-box helicases            Linder P.          linder@urz.unibas.ch
dnaJ family                   Kelley W.          kelley@cmu.unige.ch
EF-hand calcium-binding       Cox J.A.           cox@sc2a.unige.ch
                              Kretsinger R.H.    rhk5i@virginia.bitnet
Enoyl-CoA hydratase           Hofmann K.O.       khofmann@biomed.biolan.
                                                 uni-koeln.de
fruR/lacI family HTH proteins Reizer J.          jreizer@ucsd.edu
GATA-type zinc-fingers        Boguski M.S.       boguski@ncbi.nlm.nih.gov
GDT/GTP dissociation stimul.  Boguski M.S.       boguski@ncbi.nlm.nih.gov
GltP family of transporters   Hofmann K.O.       khofmann@biomed.biolan.
                                                 uni-koeln.de
Glucanases                    Henrissat B.       bernie@cermav.grenet.fr
                              Beguin P.          phycel@pasteur.bitnet
Glutamine synthetase          Tateno Y.          ytateno@genes.nig.ac.jp





<PAGE>



G-protein coupled receptors   Chollet A.         chollet@clients.switch.ch
                              Attwood T.K.       bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins    Boguski M.S.       boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17          Landsman D.        landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases    Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.
                                                 edu
Integrases                    Roy P.H.           2020000@saphir.ulaval.ca
Kringle domain                Ikeo K.            kikeo@genes.nig.ac.jp
Lipocalins                    Boguski M.S.       boguski@ncbi.nlm.nih.gov
                              Peitsch M.C.       peitsch@ulbio1.unil.ch
lysR family HTH proteins      Henikoff S.        henikoff@sparky.fhcrc.org
MAC components / perforin     Peitsch M.C.       peitsch@ulbio1.unil.ch
Malic enzymes                 Glynias M.         mglynias@ncsa.uiuc.edu
Myelin proteolipid protein    Hofmann K.O.       khofmann@biomed.biolan.
                                                 uni-koeln.de
Pancreatic trypsin inhibitor  Ikeo K.            kikeo@genes.nig.ac.jp
PEP requiring enzymes         Reizer J.          jreizer@ucsd.edu
pfkB carbohydrate kinases     Reizer J.          jreizer@ucsd.edu
Phytochromes                  Partis M.D.        partis@gcri.afrc.ac.uk
Protein kinases               Hanks S.           hanks@vuctrvax.bitnet
                              Hunter T.          hunter@salk.bitnet
PTS proteins                  Reizer J.          jreizer@ucsd.edu
Restriction-modification      Bickle T.          bickle@urz.unibas.ch
            enzymes           Roberts R.J.       roberts@neb.com
Ribosomal protein S3          Hallick R.         hallick%biotec@arizona.edu
Ribosomal protein S15         Ellis S.R.         srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases    Harayama S.        harayama@cmu.unige.ch
Signal sequence peptidases    von Heijne G.      gvh@csb.ki.se
                              Dalbey R.E.        rdalbey@magnus.acs.ohio-state.
                                                 edu
Sodium symporters             Reizer J.          jreizer@ucsd.edu
Subtilases                    Brannigan J.       jab5@vaxa.york.ac.uk
                              Siezen R.J.        nizo@caos.caos.kun.nl
Thiol proteases               Turk B.            turk@ijs.ac.mail.yu
Thiol proteases inhibitors    Turk B.            turk@ijs.ac.mail.yu
TNF family                    Jongeneel C.V.     vjongene@isrecmail.unil.ch
TPR repeats                   Boguski M.S.       boguski@ncbi.nlm.nih.gov
Transit peptides              von Heijne G.      gvh@csb.ki.se
Type-II membrane antigens     Levy S.            levy@cellbio.stanford.edu
Uracil-DNA glycosylase        Aasland R.         aasland@bio.uib.no
Vitamin K-depend. Gla domain  Price P.A.         pprice@ucsd.edu
Xylose isomerase              Jenkins J.         jenkins@frira.afrc.ac.uk
WAP-type domain               Claverie J.-M.     jmc@ncbi.nlm.nih.gov
ZP domain                     Bork P.            bork@embl-heidelberg.de

African swine fever virus     Yanez R.J.         ryanez@cbm2.uam.es
Bacteriophage P4              Halling C.         chh9@midway.uchicago.edu
Drosophila                    Ashburner M.       ma11@phx.cam.ac.uk
Escherichia coli              Rudd K.            rudd@ncbi.nlm.nih.gov
Salmonella typhimurium        Rudd K.            rudd@ncbi.nlm.nih.gov
Snakes                        Stocklin R.        stocklin@cmu.unige.ch
Yeast chromosome I            Ouellette F.       francis@monod.biol.mcgill.ca





<PAGE>




   B.2  Requirements to fulfill to become an on-line expert

   An expert  should be  a scientist  working with  specific famili(es)  of
   proteins (or specific domains) and which would:

   a) Review the  protein sequences in SWISS-PROT and the patterns/matrices
      in PROSITE relevant to their field of research.
   b) Agree to  be contacted  by people  that have obtained new sequence(s)
      which seem to belong to "their" familie(s) of proteins.
   c) Have access  to electronic  mail and be willing to use it to send and
      receive data.

   If you are willing to be part of this scheme please contact Amos Bairoch
   at one of the following electronic mail addresses:

                             bairoch@cmu.unige.ch
                           bairoch@cgecmu51.bitnet







































<PAGE>





           APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES


   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:


                                                       **********************
                        *********************** <----- * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * -----> **********************
                        *  Sequence Data      *
***************** ----> *  Library            *        **********************
* FLYBASE       * <---- *********************** <----- * ECD [E. coli map]  *
* [Drosophila   *                ^  |       ^          **********************
* genetic maps] * --------+      |  |       |
***************** <-----+ |      |  |       +--------- **********************
                        | |      |  |       +--------- * TFD [Trans. fact.] *
                        | |      |  |       | +------> **********************
                        | |      |  |       | |
*****************       | v      |  v       v |        **********************
* REBASE        *       ***********************        * ENZYME [Nomencl.]  *
* [Restriction  * <---- *  SWISS-PROT         * <----- **********************
*  enzymes]     *       *  Protein Sequence   *            |
*****************       *  Data Bank          *            v
                        ***********************        **********************
*****************         | ^  |  ^ |  ^ |  |          * OMIM   [Diseases]  *
* PROSITE       * <-------+ |  |  | |  | |  +--------> **********************
* [Patterns]    * ----------+  |  | |  | |
*****************              |  | |  | +-----------> **********************
             |                 |  | |  +-------------- * E. coli 2D gels    *
             |                 |  | |                  **********************
             |                 |  | |
             |                 |  | +----------------> **********************
             |                 |  +------------------- * EcoGene/EcoSeq     *
             |                 v                       **********************
             |          ***********************
             +--------> * PDB [3D structures] *
                        ***********************

















<PAGE>
  

Swiss-Prot release 23.0

Published August 1, 1992



                    SWISS-PROT RELEASE 23.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 23.0  of SWISS-PROT  contains 26706 sequence entries, comprising
   9'011'391 amino  acids abstracted from 26485 references. This represents
   an increase  of 7.6% over release 22. The recent growth of the data bank
   is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391

   1.2  Source of data

   Release 23.0  has been  updated using protein sequence data from release
   33.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 31.0 of the
   EMBL Nucleotide Sequence Database.















<PAGE>




   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):               4368
   Entries with pointer(s) to only EMBL entri(es):              3365
   Entries with pointer(s) to both EMBL and PIR entri(es):     18444
   Entries with no pointers lines:                               529


      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 22


   2.1  Sequences and annotations

   About 1680 sequences have been added since release 22, the sequence data
   of 235  existing entries  has been  updated and  the annotations of 3400
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of
   proteins:

   -  AP endonucleases
   -  Bacterial regulatory proteins, lacI family
   -  Electron transfer flavoprotein alpha-subunit
   -  Enterobacterial virulence outer membrane protein
   -  Formate--tetrahydrofolate ligase
   -  Germin family
   -  Guanine-nucleotide releasing factors CDC25 family
   -  Lipoxygenases
   -  Prokaryotic ornithine and lysine decarboxylases
   -  Prokaryotic-type carbonic anhydrases
   -  Riboflavin synthase alpha chain family
   -  Ribosomal proteins
   -  Sigma-54 factors family
   -  Sigma-70 factors family
   -  Single strand binding protein family
   -  Stress-induced proteins SRP1/TIP1 family
   -  TNF family

                   3.0 CHANGES PLANNED FOR FUTURE RELEASES

   3.1  Change in the RA line concerning the author names format

   As from  release 25  in March  1993 we  will change the format of author
   names on  RA lines  to conform  to  that  used  by  major  bibliographic
   databases such  as Medline.  The main  change is  that the  periods  and
   hyphens ("-") which currently appear within initials will not appear any
   more. For example, the current:

   RA   Wilson A.C., Smith J.-C.;






<PAGE>




   will appear as:

   RA   Wilson AC, Smith JC;

   3.2  Weekly update of SWISS-PROT

   Starting with  release 24 in November 1992 we will provide weekly update
   of SWISS-PROT. Instructions  on  how  to access the update files will be
   given at the next release.



                            4. ENZYME AND PROSITE

   4.1  The ENZYME data bank

   Release 10.0  of the  ENZYME data bank is distributed along with release
   23 of  SWISS-PROT. ENZYME  release 10.0 contains information relative to
   3183 enzymes.  The data  bank will probably be significantly modified at
   the next  release due  to the publication of a new edition of the IUPAC-
   IUB Enzyme Nomenclature book which describes many new enzymes and update
   the information concerning existing ones.

   4.2  The PROSITE data bank

   Release 9.10  of the PROSITE data bank is distributed along with release
   23 of  SWISS-PROT. Release 9.10 contains 580 documentation chapters that
   describes 689 different patterns. Release 9.10 does not really represent
   a new  release; the  only changes  between  release  9.0  and  9.10  are
   updating of  the pointers to the SWISS-PROT entries whose name have been
   modified between  release 22  and 23. The next release of PROSITE (10.0)
   will be distributed with release 24 of SWISS-PROT.


                            5. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about annotations to be updated, as for example if the function
   of a protein has been clarified or if new post-translational information
   has become available.














<PAGE>




                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.66   Gln (Q) 4.06   Leu (L) 9.15   Ser (S) 7.07
   Arg (R) 5.24   Glu (E) 6.25   Lys (K) 5.82   Thr (T) 5.84
   Asn (N) 4.45   Gly (G) 7.10   Met (M) 2.34   Trp (W) 1.31
   Asp (D) 5.25   His (H) 2.26   Phe (F) 3.97   Tyr (Y) 3.21
   Cys (C) 1.80   Ile (I) 5.50   Pro (P) 5.06   Val (V) 6.50

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 3497

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1537
                            2x:  612
                            3x:  345
                            4x:  222
                            5x:  148
                            6x:  117
                            7x:   76
                            8x:   60
                            9x:   71
                           10x:   30
                       11- 20x:  144
                       21- 50x:   78
                       51-100x:   24
                         >100x:   33












<PAGE>




        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        2018          Human
         2        1918          Escherichia coli
         3        1220          Mouse
         4        1154          Rat
         5        1053          Baker's yeast (Saccharomyces cerevisiae)
         6         556          Bovine
         7         485          Fruit fly (Drosophila melanogaster)
         8         428          Chicken
         9         402          Bacillus subtilis
        10         311          Salmonella typhimurium
        11         310          African clawed frog (Xenopus laevis)
        12         297          Rabbit
        13         273          Pig
        14         251          Vaccinia virus (strain Copenhagen)
        15         197          Maize
        16         193          Human cytomegalovirus (strain AD169)
        17         168          Bacteriophage T4
        18         159          Vaccinia virus (strain WR)
        19         153          Rice
        20         140          Tobacco
        21         138          Wheat
        22         128          Pea
        23         120          Barley
        24         119          Pseudomonas aeruginosa
                   119          Staphylococcus aureus
        26         117          Marchantia polymorpha (liverwort)
        27         116          Arabidopsis thaliana (Mouse-ear cress)
        28         111          Slime mold (Dictyostelium discoideum)
        29         110          Fission yeast (Schizosaccharomyces pombe)
        30         106          Soybean
        31         104          Caenorhabditis elegans
                   104          Sheep
                   104          Spinach
        34         100          Klebsiella pneumoniae
                   100          Pseudomonas putida
                   100          Dog

















<PAGE>




   A.3  Repartition of the sequences by size



               From   To  Number             From   To   Number
                  1-  50    1644             1001-1100      258
                 51- 100    2839             1101-1200      147
                101- 150    4010             1201-1300      129
                151- 200    2576             1301-1400       79
                201- 250    2168             1401-1500       64
                251- 300    1987             1501-1600       37
                301- 350    1804             1601-1700       32
                351- 400    1773             1701-1800       32
                401- 450    1340             1801-1900       35
                451- 500    1490             1901-2000       27
                501- 550    1053             2001-2100       10
                551- 600     742             2101-2200       32
                601- 650     512             2201-2300       39
                651- 700     378             2301-2400       13
                701- 750     367             2401-2500       14
                751- 800     291             >2500           73
                801- 850     216
                851- 900     220
                901- 950     140
                951-1000     135


   Currently the ten largest sequences are:


                            RYNR_RABIT  5037 a.a.
                            RYNR_HUMAN  5032 a.a.
                            APB_HUMAN   4563 a.a.
                            APOA_HUMAN  4548 a.a.
                            DYHC_TRIGR  4466 a.a.
                            POLG_BVDV   3988 a.a.
                            VGF1_IBVB   3951 a.a.
                            POLG_HCVA   3898 a.a.
                            POLG_HCVB   3898 a.a.
                            ACVT_PENCH  3791 a.a.
















<PAGE>




                         APPENDIX B: ON-LINE EXPERTS



   B.1  List of on-line experts for PROSITE and SWISS-PROT


Field of expertise            Name               Email address
---------------------------   ------------------ ----------------------------
Alcohol dehydrogenases        Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt@medfys.ki.se
Aldehyde dehydrogenases       Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt@medfys.ki.se
Alpha-crystallins/HSP-20      Leunissen J.A.M.   jackl@caos.caos.kun.nl
                              de Jong W.         u629000@hnykun11.bitnet
Alpha-2-macroglobulins        Van Leuven F.      fred@blekul13.bitnet
AA-tRNA synthetases class II  Leberman R.        leberman@frembl51.bitnet
Apolipoproteins               Boguski M.S.       boguski@ncbi.nlm.nih.gov
Arrestins                     Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.edu
Band 4.1 family proteins      Rees J.            jrees@vax.oxford.ac.uk
Beta-lactamases               Brannigan J.       jab5@vaxa.york.ac.uk
Beta-transducin family        Boguski M.S.       boguski@ncbi.nlm.nih.gov
Chalcone/stilbene synthases   Schroeder J.       raf@sun1.ruf.uni-freiburg.de
Chaperonins cpn10/cpn60       Georgopoulos C.    georgopo@cmu.unige.ch
Chaperonins TCP1 family       Willison K.R.      willison@icrf.ac.uk
Chitinases                    Henrissat B.       cermav@frgren81.bitnet
Clusterin                     Peitsch M.C.       peitsch@ulbio1.unil.ch
CTF/NF-I                      Mermod N.          nmermod@ulys.unil.ch
Cytochromes P450              Holsztynska E.J.   ela@netcom.uucp
                                                 netcom!ela@apple.com
DEAD-box helicases            Linder P.          linder@urz.unibas.ch
dnaJ family                   Kelley W.          kelley@cmu.unige.ch
EF-hand calcium-binding       Cox J.A.           cox@sc2a.unige.ch
                              Kretsinger R.H.    rhk5i@virginia.bitnet
Enoyl-CoA hydratase           Hofmann K.O.       khofmann@cipvax.biolan.uni-koeln.de
fruR/lacI family HTH proteins Reizer J.          jreizer@ucsd.edu
GATA-type zinc-fingers        Boguski M.S.       boguski@ncbi.nlm.nih.gov
Glucanases                    Henrissat B.       cermav@frgren81.bitnet
                              Beguin P.          phycel@pasteur.bitnet
G-protein coupled receptors   Chollet A.         chollet@clients.switch.ch
                              Attwood T.K.       bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins    Boguski M.S.       boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17          Landsman D.        landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases    Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.edu
Integrases                    Roy P.H.           2020000@saphir.ulaval.ca
Lipocalins                    Boguski M.S.       boguski@ncbi.nlm.nih.gov
                              Peitsch M.C.       peitsch@ulbio1.unil.ch
lysR family HTH proteins      Henikoff S.        henikoff@sparky.fhcrc.org
MAC components / perforin     Peitsch M.C.       peitsch@ulbio1.unil.ch
Malic enzymes                 Glynias M.         mglynias@ncsa.uiuc.edu
Myelin proteolipid protein    Hofmann K.O.       khofmann@cipvax.biolan.uni-koeln.de





<PAGE>




PEP requiring enzymes         Reizer J.          jreizer@ucsd.edu
pfkB carbohydrate kinases     Reizer J.          jreizer@ucsd.edu
Phytochromes                  Partis M.D.        partis@gcri.afrc.ac.uk
Protein kinases               Hanks S.           hanks@vuctrvax.bitnet
                              Hunter T.          hunter@salk.bitnet
PTS proteins                  Reizer J.          jreizer@ucsd.edu
Restriction-modification      Bickle T.          bickle@urz.unibas.ch
            enzymes           Roberts R.J.       roberts@neb.com
Ribosomal protein S3          Hallick R.         hallick%biotec@arizona.edu
Ribosomal protein S15         Ellis S.R.         srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases    Harayama S.        harayama@cmu.unige.ch
Sodium symporters             Reizer J.          jreizer@ucsd.edu
Subtilases                    Brannigan J.       jab5@vaxa.york.ac.uk
Thiol proteases               Turk B.            turk@ijs.ac.mail.yu
Thiol proteases inhibitors    Turk B.            turk@ijs.ac.mail.yu
TNF family                    Jongeneel C.V.     vjongene@isrec.arcom.ch
TPR repeats                   Boguski M.S.       boguski@ncbi.nlm.nih.gov
Transit peptides              von Heijne G.      gunnar@cbts.sunet.se
Type-II membrane antigens     Levy S.            levy@cellbio.stanford.edu
Uracil-DNA glycosylase        Aasland R.         aasland@bio.uib.no
Xylose isomerase              Jenkins J.         jenkins@frira.afrc.ac.uk
WAP-type domain               Claverie J.-M.     jmc@ncbi.nlm.nih.gov
ZP domain                     Bork P.            bork@embl-heidelberg.de

African swine fever virus     Yanez R.J.         ryanez@cbm2.uam.es
Bacteriophage P4              Halling C.         chh9@midway.uchicago.edu
Drosophila                    Ashburner M.       ma11@phx.cam.ac.uk
Escherichia coli              Rudd K.            rudd@ncbi.nlm.nih.gov
Salmonella typhimurium        Rudd K.            rudd@ncbi.nlm.nih.gov
Snakes                        Stocklin R.        stocklin@cmu.unige.ch
Yeast chromosome I            Ouellette F.       francis@monod.biol.mcgill.ca


   B.2  Requirements to fulfill to become an on-line expert

   An expert  should be  a scientist  working with  specific famili(es)  of
   proteins (or specific domains) and which would:

   a) Review the  protein sequences in SWISS-PROT and the patterns/matrices
      in PROSITE relevant to their field of research.
   b) Agree to  be contacted  by people  that have obtained new sequence(s)
      which seem to belong to "their" familie(s) of proteins.
   c) Have access  to electronic  mail and be willing to use it to send and
      receive data.

   If you are willing to be part of this scheme please contact Amos Bairoch
   at one of the following electronic mail addresses:

                             bairoch@cmu.unige.ch
                           bairoch@cgecmu51.bitnet






<PAGE>




           APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                                                       **********************
                        *********************** <----- * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * -----> **********************
                        *  Sequence Data      *
***************** ----> *  Library            *        **********************
* FLYBASE       * <---- *********************** <----- * ECD [E. coli map]  *
* [Drosophila   *                ^  |       ^          **********************
* genetic maps] * --------+      |  |       |
***************** <-----+ |      |  |       +--------- **********************
                        | |      |  |       +--------- * TFD [Trans. fact.] *
                        | |      |  |       | +------> **********************
                        | |      |  |       | |
*****************       | v      |  v       v |        **********************
* REBASE        *       ***********************        * ENZYME [Nomencl.]  *
* [Restriction  * <---- *  SWISS-PROT         * <----- **********************
*  enzymes]     *       *  Protein Sequence   *            |
*****************       *  Data Bank          *            v
                        ***********************        **********************
*****************         | ^  |  ^ |  ^ |  |          * OMIM   [Diseases]  *
* PROSITE       * <-------+ |  |  | |  | |  +--------> **********************
* [Patterns]    * ----------+  |  | |  | |
*****************              |  | |  | +-----------> **********************
             |                 |  | |  +-------------- * E. coli 2D gels    *
             |                 |  | |                  **********************
             |                 |  | |
             |                 |  | +----------------> **********************
             |                 |  +------------------- * EcoGene/EcoSeq     *
             |                 v                       **********************
             |          ***********************
             +--------> * PDB [3D structures] *
                        ***********************




















<PAGE>
  

Swiss-Prot release 22.0

Published May 1, 1992



                    SWISS-PROT RELEASE 22.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 22.0  of SWISS-PROT  contains 25044 sequence entries, comprising
   8'375'696 amino  acids abstracted from 25613 references. This represents
   an increase  of 6.5% over release 21. The recent growth of the data bank
   is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696


   1.2  Source of data

   Release 22.0  has been  updated using protein sequence data from release
   31.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 30.0 of the
   EMBL Nucleotide Sequence Database.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):               4309
   Entries with pointer(s) to only EMBL entri(es):              3327
   Entries with pointer(s) to both EMBL and PIR entri(es):     16868
   Entries with no pointers lines:                               540






<PAGE>



      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 21


   2.1  Sequences and annotations

   About 1320 sequences have been added since release 21, the sequence data
   of 180  existing entries  has been  updated and  the annotations of 3800
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of
   proteins:

   -  Bacterial regulatory proteins, luxR family
   -  Bacterial regulatory proteins, ntrC family
   -  Band 4.1 proteins family
   -  Beta-amylases
   -  Beta-transducin family
   -  BetvI family pathogenesis proteins
   -  bglG-type antiterminator family
   -  Bromodomain proteins
   -  CBF/NF-Y subunits
   -  CDC48 / PAS1 / SEC18 / TBP-1 family
   -  Chaperonins cpn10
   -  Colipases
   -  Collagens
   -  D-amino acid oxidases
   -  D-isomer specific 2-hydroxyacid dehydrogenases
   -  Dihydrodipicolinate synthases
   -  DnaJ-like proteins
   -  Endoglucanases / exoglucanases/ xylanases
   -  Fork head domain proteins
   -  G-protein coupled receptors family 2
   -  GMC oxidoreductases
   -  Herpes viruses proteins
   -  Inositol monophosphatase family
   -  LIM domain proteins
   -  Mevalonate, and phosphomevalonate kinases.
   -  Mitochondrial creatine kinases
   -  NADH-ubiquinone oxidoreductases peripheral subunits
   -  Orotidine 5'-phosphate decarboxylase
   -  papD family chaperonins
   -  Polygalacturonases
   -  Polyprenyl synthetases
   -  Regulator of chromosome condensation (RCC1)
   -  Respiratory-chain NADH dehydrogenase nuclear encoded subunits
   -  Ribonuclease P protein component
   -  Serine proteases, V8-type family
   -  Snake toxins
   -  Thiol (cysteine) proteases
   -  TNF receptors / low-affinity NGF receptor / CD40 family
   -  Vinculins and alpha-catenins
   -  ZP domain proteins






<PAGE>



   2.2  New topic for the comments (CC) line type

   We have  introduced in  this release a new 'topic' for the comments (CC)
   line-type: PTM.  This  topic  is  used  to  describe  post-translational
   modification(s) whose  position(s) on  the sequenc  is not yet known and
   which can thus not be shown in the feature table.

   Examples of its usage:

   CC   -!- PTM: PHOSPHORYLATED IN VITRO ON SERINE(S) AND THREONINE(S)
   CC       BY PKC.

   CC   -!- PTM: HEAVILY GLYCOSYLATED WITH SULFATED N-LINKED CARBOHYDRATES.

   CC   -!- PTM: SIX DISULFIDE BONDS ARE PRESENT.



   2.3  New feature key

   The UNSURE key is used to describe region(s) of a sequence for which the
   authors are unsure about the sequence assignment.



                            3. ENZYME AND PROSITE

   3.1  The ENZYME data bank

   Release 9.0 of the ENZYME data bank is distributed along with release 22
   of SWISS-PROT.  ENZYME release 9.0 contains information relative to 3179
   enzymes. The  data bank  is complete  and up  to date.  Until new enzyme
   nomenclature data  is published  we only  plan to  update the SWISS-PROT
   pointers at  each release  of the  protein sequence  data bank,  correct
   eventual errors,  and complete  the information  concerning synonyms and
   cofactors using the literature.

   3.2  The PROSITE data bank

   Release 9.00  of the PROSITE data bank is distributed along with release
   22 of  SWISS-PROT. Release 9.00 contains 580 documentation chapters that
   describes 689 different patterns.

   Since the last major release of PROSITE (release 8.00 of December 1991),
   50 new  chapters have  been added  and  about  200  chapters  have  been
   updated. The new chapters are listed below.

   -  Acid phosphatases signature
   -  Bacterial protein export pilT protein family signature
   -  Bacterial regulatory proteins, luxR family signature
   -  Bacterial Ribonuclease P protein component signature
   -  Band 4.1 family domain signatures
   -  Beta-transducin family Trp-Asp repeats signature




<PAGE>



   -  Bromodomain
   -  CBF/NF-Y subunits signatures
   -  CDC48 / PAS1 / SEC18 family signature
   -  Chaperonins cpn10 signature
   -  C-type lectin domain signature
   -  Dihydrodipicolinate synthetase signatures
   -  dnaJ domains signatures
   -  D-amino acid oxidase signature
   -  Fork head domain signatures
   -  GHMP kinases putative ATP-binding domain
   -  Glycosyl hydrolases family 2 active site
   -  Glycosyl hydrolases family 32 active site
   -  Glycosyl hydrolases family 5 signature
   -  Glycosyl hydrolases family 6 signatures
   -  GMC oxidoreductases signatures
   -  Gram-negative pili assembly chaperone signature
   -  G-protein coupled receptors family 2 signatures
   -  Histidinol dehydrogenase active site
   -  Indole-3-glycerol phosphate synthase signature
   -  Inositol monophosphatase family signatures
   -  Leucine aminopeptidase signature
   -  Methionine aminopeptidase signature
   -  Neurotransmitters transporters signature
   -  NifH/frxC family signature
   -  Osteonectin domain signatures
   -  PTN/MK heparin-binding protein family signatures
   -  RecF protein signatures
   -  Regulator of chromosome condensation signatures
   -  Respiratory-chain NADH dehydrogenase subunit 1 signatures
   -  Respiratory-chain NADH dehydrogenase 51 Kd subunit signatures
   -  Respiratory-chain NADH dehydrogenase 75 Kd subunit signatures
   -  Ribosomal protein L30 signature
   -  Ribosomal protein L9 signature
   -  Ribosomal protein S13 signature
   -  Ribosomal protein S19e signature
   -  Ribosomal protein S4 signature
   -  Serine proteases, V8 family, active sites
   -  Sigma-54 interaction domain signatures
   -  Thymidine phosphorylase signature
   -  Tissue factor signature
   -  TNFR/NGFR family cysteine-rich region signature
   -  Transcriptional antiterminators bglG family signature
   -  Vinculin family signatures
   -  ZP domain signature

                            4. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about annotations to be updated, as for example if the function
   of a protein has been clarified or if new post-translational information
   has become available.




<PAGE>



                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition


        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.64   Gln (Q) 4.07   Leu (L) 9.14   Ser (S) 7.10
   Arg (R) 5.24   Glu (E) 6.26   Lys (K) 5.82   Thr (T) 5.84
   Asn (N) 4.47   Gly (G) 7.09   Met (M) 2.33   Trp (W) 1.30
   Asp (D) 5.25   His (H) 2.27   Phe (F) 3.97   Tyr (Y) 3.21
   Cys (C) 1.81   Ile (I) 5.49   Pro (P) 5.06   Val (V) 6.49

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp




   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 3325


        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1465
                            2x:  599
                            3x:  323
                            4x:  197
                            5x:  143
                            6x:  111
                            7x:   74
                            8x:   51
                            9x:   65
                           10x:   35
                       11- 20x:  140
                       21- 50x:   68
                       51-100x:   24
                         >100x:   30










<PAGE>




        A.2.2  Table of the most represented species


    Number   Frequency          Species
         1        1943          Human
         2        1785          Escherichia coli
         3        1152          Mouse
         4        1099          Rat
         5        1004          Baker's yeast (Saccharomyces cerevisiae)
         6         532          Bovine
         7         474          Fruit fly (Drosophila melanogaster)
         8         406          Chicken
         9         377          Bacillus subtilis
        10         290          African clawed frog (Xenopus laevis)
        11         282          Rabbit
        12         258          Pig
        13         251          Vaccinia virus (strain Copenhagen)
        14         245          Salmonella typhimurium
        15         193          Human cytomegalovirus (strain AD169)
        16         183          Maize
        17         168          Bacteriophage T4
        18         153          Vaccinia virus (strain WR)
        19         142          Rice
        20         129          Tobacco
        21         128          Wheat
        22         121          Pea
        23         113          Staphylococcus aureus
        24         112          Pseudomonas aeruginosa
        25         111          Barley
        26         109          Slime mold (Dictyostelium discoideum)
        27         106          Arabidopsis thaliana (Mouse-ear cress)
        28         104          Fission yeast (Schizosaccharomyces pombe)
        29         102          Sheep
                   102          Spinach
        31         100          Pseudomonas putida
                   100          Soybean
        33          97          Dog
        34          96          Caenorhabditis elegans
        35          95          Neurospora crassa

















<PAGE>



   A.3  Repartition of the sequences by size



               From   To  Number             From   To   Number
                  1-  50    1597             1001-1100      239
                 51- 100    2698             1101-1200      140
                101- 150    3791             1201-1300      117
                151- 200    2402             1301-1400       77
                201- 250    2024             1401-1500       62
                251- 300    1838             1501-1600       33
                301- 350    1683             1601-1700       29
                351- 400    1640             1701-1800       29
                401- 450    1257             1801-1900       34
                451- 500    1394             1901-2000       27
                501- 550     980             2001-2100       10
                551- 600     700             2101-2200       28
                601- 650     488             2201-2300       31
                651- 700     338             2301-2400       11
                701- 750     329             2401-2500       12
                751- 800     275             >2500           61
                801- 850     204
                851- 900     209
                901- 950     127
                951-1000     130



   Currently the ten largest sequences are:


                            RYNR_RABIT  5037 a.a.
                            RYNR_HUMAN  5032 a.a.
                            APB_HUMAN   4563 a.a.
                            APOA_HUMAN  4548 a.a.
                            DYHC_TRIGR  4466 a.a.
                            POLG_BVDV   3988 a.a.
                            POLG_HCVA   3898 a.a.
                            POLG_HCVB   3898 a.a.
                            ACVT_PENCH  3791 a.a.
                            TRX_DROME   3759 a.a.
















<PAGE>



                         APPENDIX B: ON-LINE EXPERTS


   B.1  List of on-line experts for PROSITE and SWISS-PROT

Field of expertise            Name               Email address
---------------------------   ------------------ ----------------------------
African swine fever virus     Yanez R.J.         ryanez@cbm2.uam.es
Alcohol dehydrogenases        Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt@medfys.ki.se
Aldehyde dehydrogenases       Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt@medfys.ki.se
Alpha-crystallins/HSP-20      Leunissen J.A.M.   jackl@caos.caos.kun.nl
                              de Jong W.         u629000@hnykun11.bitnet
Alpha-2-macroglobulins        Van Leuven F.      fred@blekul13.bitnet
AA-tRNA synthetases class II  Leberman R.        leberman@frembl51.bitnet
Apolipoproteins               Boguski M.S.       boguski@ncbi.nlm.nih.gov
Arrestins                     Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.edu
Band 4.1 family proteins      Rees J.            jrees@vax.oxford.ac.uk
Beta-lactamases               Brannigan J.       jab5@vaxa.york.ac.uk
Beta-transducin family        Boguski M.S.       boguski@ncbi.nlm.nih.gov
Chalcone/stilbene synthases   Schroeder J.       raf@sun1.ruf.uni-freiburg.de
Chitinases                    Henrissat B.       cermav@frgren81.bitnet
Clusterin                     Peitsch M.C.       peitsch@ulbio1.unil.ch
CTF/NF-I                      Mermod N.          nmermod@ulys.unil.ch
Cytochromes P450              Holsztynska E.J.   ela@netcom.uucp
                                                 netcom!ela@apple.com
DEAD-box helicases            Linder P.          linder@urz.unibas.ch
EF-hand calcium-binding       Cox J.A.           cox@sc2a.unige.ch
                              Kretsinger R.H.    rhk5i@virginia.bitnet
Enoyl-CoA hydratase           Hofmann K.O.       khofmann@cipvax.biolan.uni-koeln.de
fruR/lacI family HTH proteins Reizer J.          jreizer@ucsd.edu
GATA-type zinc-fingers        Boguski M.S.       boguski@ncbi.nlm.nih.gov
Glucanases                    Henrissat B.       cermav@frgren81.bitnet
                              Beguin P.          phycel@pasteur.bitnet
G-protein coupled receptors   Chollet A.         chollet@clients.switch.ch
                              Attwood T.K.       bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins    Boguski M.S.       boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17          Landsman D.        landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases    Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.edu
Integrases                    Roy P.H.           2020000@lavalvx1.bitnet
Lipocalins                    Boguski M.S.       boguski@ncbi.nlm.nih.gov
                              Peitsch M.C.       peitsch@ulbio1.unil.ch
lysR family HTH proteins      Henikoff S.        henikoff@sparky.fhcrc.org
MAC components / perforin     Peitsch M.C.       peitsch@ulbio1.unil.ch
Myelin proteolipid protein    Hofmann K.O.       khofmann@cipvax.biolan.uni-koeln.de
PEP requiring enzymes         Reizer J.          jreizer@ucsd.edu
pfkB carbohydrate kinases     Reizer J.          jreizer@ucsd.edu
Phytochromes                  Partis M.D.        partis@gcri.afrc.ac.uk
Protein kinases               Hanks S.           hanks@vuctrvax.bitnet
                              Hunter T.          hunter@salk.bitnet






<PAGE>


PTS proteins                  Reizer J.          jreizer@ucsd.edu
Restriction-modification      Bickle T.          bickle@urz.unibas.ch
            enzymes           Roberts R.J.       roberts@cshl.org
Ribosomal protein S3          Hallick R.         hallick%biotec@arizona.edu
Ribosomal protein S15         Ellis S.R.         srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases    Harayama S.        harayama@cmu.unige.ch
Sodium symporters             Reizer J.          jreizer@ucsd.edu
Subtilases                    Brannigan J.       jab5@vaxa.york.ac.uk
Thiol proteases               Turk B.            turk@ijs.ac.mail.yu
Thiol proteases inhibitors    Turk B.            turk@ijs.ac.mail.yu
TPR repeats                   Boguski M.S.       boguski@ncbi.nlm.nih.gov
Transit peptides              von Heijne G.      gunnar@cbts.sunet.se
Type-II membrane antigens     Levy S.            levy@cellbio.stanford.edu
Uracil-DNA glycosylase        Aasland R.         aasland@bio.uib.no
Xylose isomerase              Jenkins J.         jenkins@frira.afrc.ac.uk
WAP-type domain               Claverie J.-M.     jmc@ncbi.nlm.nih.gov
ZP domain                     Bork P.            bork@embl-heidelberg.de


Bacteriophage P4              Halling C.         chh9@midway.uchicago.edu
Drosophila                    Ashburner M.       ma11@phx.cam.ac.uk
Escherichia coli              Rudd K.            rudd@ncbi.nlm.nih.gov
Salmonella typhimurium        Rudd K.            rudd@ncbi.nlm.nih.gov
Snakes                        Stocklin R.        stocklin@cmu.unige.ch
Yeast chromosome I            Ouellette F.       francis@monod.biol.mcgill.ca



   B.2  Requirements to fulfill to become an on-line expert

   An expert  should be  a scientist  working with  specific famili(es)  of
   proteins (or specific domains) and which would:

   a) Review the  protein sequences in SWISS-PROT and the patterns/matrices
      in PROSITE relevant to their field of research.
   b) Agree to  be contacted  by people  that have obtained new sequence(s)
      which seem to belong to "their" familie(s) of proteins.
   c) Have access  to electronic  mail and be willing to use it to send and
      receive data.

   If you are willing to be part of this scheme please contact Amos Bairoch
   at one of the following electronic mail addresses:

                             bairoch@cmu.unige.ch
                           bairoch@cgecmu51.bitnet












<PAGE>




           APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

============================================
Relationships between biomolecular databases
============================================

Last updated: March, 1992.

The current status of the relationships (cross-references) between some
biomolecular databases is shown in the following schematic:

                                                       **********************
                        *********************** <----- * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * -----> **********************
                        *  Sequence Data      *
***************** ----> *  Library            *        **********************
* FLYBASE       * <---- *********************** <----- * ECD [E. coli map]  *
* [Drosophila   *                ^  |       ^          **********************
* genetic maps] * --------+      |  |       |
***************** <-----+ |      |  |       +--------- **********************
                        | |      |  |       +--------- * TFD [Trans. fact.] *
                        | |      |  |       | +------> **********************
                        | |      |  |       | |
*****************       | v      |  v       v |        **********************
* REBASE        *       ***********************        * ENZYME [Nomencl.]  *
* [Restriction  * <---- *  SWISS-PROT         * <----- **********************
*  enzymes]     *       *  Protein Sequence   *            |
*****************       *  Data Bank          *            v
                        ***********************        **********************
*****************         | ^  |  ^ |  ^ |  |          * OMIM   [Diseases]  *
* PROSITE       * <-------+ |  |  | |  | |  +--------> **********************
* [Patterns]    * ----------+  |  | |  | |
*****************              |  | |  | +-----------> **********************
             |                 |  | |  +-------------- * E. coli 2D gels    *
             |                 |  | |                  **********************
             |                 |  | |
             |                 |  | +----------------> **********************
             |                 |  +------------------- * EcoGene/EcoSeq     *
             |                 v                       **********************
             |          ***********************
             +--------> * PDB [3D structures] *
                        ***********************











<PAGE>

  

Swiss-Prot release 21.0

Published March 1, 1992



                    SWISS-PROT RELEASE 21.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 21.0  of SWISS-PROT  contains 23742 sequence entries, comprising
   7'866'596 amino  acids abstracted from 23919 references. This represents
   an increase of 5% over release 20. The recent growth of the data bank is
   summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596



   1.2  Source of data

   Release 21.0  has been  updated using protein sequence data from release
   31.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 29.0 of the
   EMBL Nucleotide Sequence Database.















<PAGE>



   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):           4198
   Entries with pointer(s) to only EMBL entri(es):          3031
   Entries with pointer(s) to both EMBL and PIR entri(es): 16003
   Entries with no pointers lines:                           510


      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 20


   2.1  Sequences and annotations

   About 1100 sequences have been added since release 20, the sequence data
   of 150  existing entries  has been  updated and  the annotations of 2860
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of
   proteins:

   -  Acid phosphatases
   -  Acylphosphatases
   -  Bacterial regulatory proteins, luxR family
   -  Cyclins
   -  Cytochromes P450
   -  C-type lectin domain proteins
   -  Histidinol dehydrogenases
   -  Indole-3-glycerol phosphate synthases
   -  Microviridae sequences
   -  Myoglobins
   -  Osteonectin domain proteins
   -  Snakes venom phospholipases A2
   -  RecF proteins
   -  Scorpions venom beta-toxins
   -  PTN/MK heparin-binding protein family
   -  Tissue factor


   2.2  Change in the format of the entry names

   The dollar  sign `$'  in entry names has been replaced by the underscore
   character `_'.  This change  is made  on the behalf of users of sequence
   analysis software  running under  the Unix  operating system,  where the
   dollar sign  is a  reserved symbol.  Example: the entry name `CYC$HUMAN'
   has been changed to `CYC_HUMAN'.


   2.3  New line type GN

   The GN  (Gene Name)  line is  a new  line that  is used  to indicate the
   name(s) of  the gene(s)  that encodes  for the  protein being described.





<PAGE>



   Previously this  information used to be found in the DE line as shown in
   the following example.

   In previous releases:

        DE   SERUM ALBUMIN PRECURSOR (GENE NAME: ALB).

   In the current release:

        DE   SERUM ALBUMIN PRECURSOR.
        GN   ALB.

   The format of the GN line is:

        GN   NAME1[ AND|OR NAME2...].

   Examples:

        GN   ALB.
        GN   REX-1.

   It often  occurs that  more than  one gene  name has been assigned to an
   individual locus.  In that  case all  the synonyms  are listed. The word
   `OR' separates the different designations. The first name in the list is
   assumed to be the most correct (or most current) designation. Example:

        GN   HNS OR DRDX OR OSMZ OR BGLY.

   In a few cases, multiple genes encode for an identical protein sequence.
   In that  case all  the different  gene names  are listed. The word `AND'
   separates the designations. Example:

        GN   CECA1 AND CECA2.

   In very  rare cases  (only one  occurrence has been found in the current
   release) `AND'  and `OR' could be both present. In that case parenthesis
   are used as shown in the following example:

        GN   GVPA AND (GVPB OR GVPA2).



   2.4  New line type RM

   The RM  (Reference Medline)  line is used to indicate the Medline Unique
   Identifier (UID)  of a reference. Previously this information was listed
   in the  RC line  using the  `MEDLINE' token  as shown  in the  following
   example.

   In previous releases:

        RC   MEDLINE=90205618;





<PAGE>



   In the current release:

        RM   90205618

   The format of the RM line is:

        RM   nnnnnnnn

   where `nnnnnnnn' is the eight digit Medline Unique Identifier (UID).



   2.5  Secondary structure information

   Thanks to  the help  of Chris  Sander  and  Reinhard  Schneider  of  the
   Biocomputing group  at EMBL  we have  added  to  the  feature  table  of
   sequence  entries   of  proteins   whose  tertiary  structure  is  known
   experimentally, the  secondary structure  information  corresponding  to
   that protein.  The secondary  structure assignment  is made according to
   DSSP (see Kabsch W., Sander C.; Biopolymers, 22:2577-2637(1983)) and the
   information is  extracted from  the coordinate  data sets of the Protein
   Data Bank (PDB).

   In the  feature table  only  three  types  of  secondary  structure  are
   specified :  helices (HELIX),  beta-strand (STRAND)  and  turns  (TURN).
   Residues not  specified in  one of  these classes  are in  a  `loop'  or
   `random-coil' structure).  Because the DSSP assignment has more than the
   three  common   secondary  structure  classes,  we  have  converted  the
   following DSSP assignments to HELIX, STRAND and TURN:

   DSSP   DSSP definition                                 SWISS-PROT
   code                                                   assignment
   ----   ---------------------------------------------   --------------
   H      Alpha-helix                                     HELIX
   G      3(10) helix                                     HELIX
   I      Pi-helix                                        HELIX
   E      Hydrogen bonded beta-strand (extended strand)   STRAND
   B      Residue in an isolated beta-bridge              STRAND
   T      H-bonded turn (3-turn, 4-turn or 5-turn)        TURN
   S      Bend (five-residue bend centered at residue i)  Not specified


   One should be aware of the following facts:

   a) Segment Length. For helices (alpha and 3-10), the residue just before
      and just after the helix as given by DSSP participates in the helical
      hydrogen bonding  pattern with  a single  H-bond. For  some practical
      purposes, one  can therefore extend the HELIX range by one residue on
      each side. E.g. HELIX 25-35 instead of HELIX 26-34. Also, the ends of
      secondary  structure   segments  are  less  well  defined  for  lower
      resolution structures. A fluctuation of +/- one residue is common.

   b) Missing segments.  In low resolution structures, badly formed helices
      or strands may be omitted in the DSSP definition.



<PAGE>



   c) Special helices  and  strands.  Helices  of  length  three  are  3-10
      helices, those  of length four and longer are either alpha-helices or
      3-10 helices  (pi helices are extremely rare). A strand of length one
      corresponds to a residue in an isolated beta-bridge. Such bridges can
      be structurally important.

   d) Missing secondary  structure. No  secondary  structure  is  currently
      given in the feature table in the following cases:

      - No sequence data in the PDB entry.
      - Structure for which only C-alpha coordinates are in PDB.
      - NMR structure with more than one coordinate data set.
      - Model (i.e. theoretical) structure.



   2.6  Feature key name change

   The secondary  structure description feature key `BETA' has been renamed
   `STRAND' (see the section above for its current definition).



   2.7  Alu-derived warning entries

   Following the  advice and  in collaboration with Jean-Michel Claverie of
   the National  Center for  Biotechnology  Information  (NCBI,  Washington
   D.C.) we  have added  to SWISS-PROT Alu-derived "warning" entries. These
   entries are  provided in  order to  avoid  the  further  'pollution'  of
   protein sequence databases with Alu-derived amino acid sequences.

   Alu repetitive  sequences are  interspersed in human and primate genomes
   with an  average spacing  of 3 Kb. Some of them are actively transcribed
   by pol  III. Normal  transcripts may contain Alu-derived sequences in 5'
   or 3' untranslated regions. however, cDNA libraries also contain partial
   and/or  rearranged  cDNAs  ligated  with  Alu-derived  sequence  in  any
   orientation. This  has been  overlooked in  several occasions,  with the
   consequence  of   erroneous  Alu-derived   amino  acid  sequences  being
   reported.

   Various analyses  indicate that  Alu repeats fall into six classes (A to
   F). Therefore  six "warning"  entries have been constituted with all six
   frames conceptual  translations of  one random  member of  each of these
   classes of Alu repeats. Any significant similarity of a putative protein
   sequence with  an Alu-translated entry must be taken as a warning that a
   part of  Alu repeat  may have  been artifactually included in the coding
   nucleotide sequence.

   These sequences have been assigned accession numbers P23959 (ALUA_HUMAN)
   to P23964 (ALUF_HUMAN).







<PAGE>



   2.8  Feature lines `spring cleaning'

   We are  in the  process of  `cleaning' up  the comments  part of feature
   lines to homogenize the description of specific domains and sites.

   For example  regions enriched in one or more types of amino acid are now
   described using the general format:

        FT   DOMAIN      xxx    xxx       AA1[/AA2/.../AAN]-RICH.

   Where AA1,  AA2, ...  AAN are  valid amino-acid  three letter codes (the
   twenty standard  codes with  the addition  of `GLA' for gamma-carboxylic
   acid).

   Examples:

        FT   DOMAIN       12     45       PRO-RICH.
        FT   DOMAIN      123    456       ASP/GLU-RICH (ACIDIC).
        FT   DOMAIN      246    678       SER/THR-RICH (LINKER REGION).

   Many other  changes of  this nature  have either  been completed in this
   release or are in the process of being carried out.

   Also  note   that  `non-experimental'  derived  features  are  now  only
   indicated by the qualifiers `PROBABLE', `POTENTIAL', or `BY SIMILARITY';
   the use  of qualifiers such as `PUTATIVE', `POSSIBLE', `TENTATIVE', etc.
   has been discontinued.

   This cleaning process will continue in the next two or three releases.


                            3. FORTHCOMING CHANGES

   The following changes will be implemented starting with release 22.

   3.1  A new feature table key: UNSURE

   The UNSURE  key will  be used  to describe  region(s) of  a sequence for
   which the authors are unsure about the sequence assignment.

   3.2. Others

   Other changes  are planned,  but we  are already  past our  deadline  to
   prepare this release so hare are some very brief notes!

   -  An ASN.1  version of  SWISS-PROT will  soon be  officially  available
      (thanks to  Mark Cavanaugh of the NCBI). Software developers that are
      interested in  such a  version can already obtain a beta-test release
      of  SWISS-PROT  21  in  ASN.1  format  (For  details  contact  me  at
      bairoch@cmu.unige.ch).
   -  We are thinking of some new topics for the CC lines.
   -  New developments  concerning the integration of SWISS-PROT with other
      data banks is in the 'pipeline'.




<PAGE>




                            4. ENZYME AND PROSITE

   4.1  The ENZYME data bank

   Release 8.0 of the ENZYME data bank is distributed along with release 21
   of SWISS-PROT.  ENZYME release 8.0 contains information relative to 3073
   enzymes. The  data bank  is complete  and up  to date.  Until new enzyme
   nomenclature data  is published  we only  plan to  update the SWISS-PROT
   pointers at  each release  of the  protein sequence  data bank,  correct
   eventual errors,  and complete  the information  concerning synonyms and
   cofactors using the literature.

   4.2  The PROSITE data bank

   Release 8.10  of the PROSITE data bank is distributed along with release
   21 of  SWISS-PROT. Release 8.10 contains 530 documentation chapters that
   describes 605 different patterns. Release 8.10 does not really represent
   a new  release; the  only changes  between  release  8.0  and  8.10  are
   updating of  the pointers to the SWISS-PROT entries whose name have been
   modified between  release 20  and 21.  The next release of PROSITE (9.0)
   will be distributed with release 22 of SWISS-PROT.


                            5. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about annotations to be updated, as for example if the function
   of a protein has been clarified or if new post-translational information
   has become available.

























<PAGE>



                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.65   Gln (Q) 4.07   Leu (L) 9.15   Ser (S) 7.08
   Arg (R) 5.23   Glu (E) 6.26   Lys (K) 5.83   Thr (T) 5.84
   Asn (N) 4.45   Gly (G) 7.10   Met (M) 2.33   Trp (W) 1.30
   Asp (D) 5.24   His (H) 2.27   Phe (F) 3.97   Tyr (Y) 3.22
   Cys (C) 1.81   Ile (I) 5.46   Pro (P) 5.08   Val (V) 6.49

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 3159

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1373
                            2x:  572
                            3x:  319
                            4x:  191
                            5x:  135
                            6x:  101
                            7x:   72
                            8x:   49
                            9x:   64
                           10x:   33
                       11- 20x:  128
                       21-100x:   95
                         >100x:   27














<PAGE>



        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        1891          Human
         2        1697          Escherichia coli
         3        1116          Mouse
         4        1057          Rat
         5         803          Baker's yeast (Saccharomyces cerevisiae)
         6         519          Bovine
         7         443          Fruit fly (Drosophila melanogaster)
         8         392          Chicken
         9         347          Bacillus subtilis
        10         279          African clawed frog (Xenopus laevis)
        11         271          Rabbit
        12         253          Pig
        13         251          Vaccinia virus (strain Copenhagen)
        14         218          Salmonella typhimurium
        15         193          Human cytomegalovirus (strain AD169)
        16         170          Maize
        17         167          Bacteriophage T4
        18         151          Vaccinia virus (strain WR)
        19         135          Rice
        20         125          Tobacco
        21         121          Wheat
        22         113          Pea
        23         112          Staphylococcus aureus
        24         104          Pseudomonas aeruginosa
        25         103          Slime mold (Dictyostelium discoideum)
                   103          Barley
        27         101          Sheep
        28         100          Fission yeast (Schizosaccharomyces pombe)
        29          95          Spinach
                    95          Dog
                    95          Caenorhabditis elegans
        32          94          Soybean
        33          92          Neurospora crassa
        34          90          Pseudomonas putida
        35          89          Agrobacterium tumefaciens



















<PAGE>



   A.3  Repartition of the sequences by size



               From   To  Number             From   To   Number
                  1-  50    1539             1001-1100      220
                 51- 100    2602             1101-1200      138
                101- 150    3672             1201-1300      115
                151- 200    2281             1301-1400       68
                201- 250    1932             1401-1500       57
                251- 300    1745             1501-1600       32
                301- 350    1566             1601-1700       26
                351- 400    1537             1701-1800       27
                401- 450    1177             1801-1900       31
                451- 500    1265             1901-2000       26
                501- 550     926             2001-2100       10
                551- 600     644             2101-2200       26
                601- 650     463             2201-2300       31
                651- 700     329             2301-2400       11
                701- 750     310             2401-2500       12
                751- 800     236             >2500           54
                801- 850     194
                851- 900     201
                901- 950     122
                951-1000     117


   Currently the ten largest sequences are:


                            RYNR$RABIT  5037 a.a.
                            RYNR$HUMAN  5032 a.a.
                            APB$HUMAN   4563 a.a.
                            APOA$HUMAN  4548 a.a.
                            DYHC$TRIGR  4466 a.a.
                            POLG$BVDV   3988 a.a.
                            POLG$HCVA   3898 a.a.
                            POLG$HCVB   3898 a.a.
                            TRX$DROME   3759 a.a.
                            ACVA$PENCH  3746 a.a.

















<PAGE>



                         APPENDIX B: ON-LINE EXPERTS


   B.1  List of on-line experts for PROSITE and SWISS-PROT

Field of expertise            Name               Email address
---------------------------   ------------------ ----------------------------
African swine fever virus     Yanez R.J.         ryanez@cbm2.uam.es
Alcohol dehydrogenases        Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt@medfys.ki.se
Aldehyde dehydrogenases       Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt@medfys.ki.se
Alpha-crystallins/HSP-20      Leunissen J.A.M.   jackl@caos.caos.kun.nl
                              de Jong W.         u629000@hnykun11.bitnet
Alpha-2-macroglobulins        Van Leuven F.      fred@blekul13.bitnet
Apolipoproteins               Boguski M.S.       boguski@ncbi.nlm.nih.gov
Arrestins                     Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.edu
Bacteriophage P4 proteins     Halling C.         chh9@midway.uchicago.edu
Beta-lactamases               Brannigan J.       jab5@vaxa.york.ac.uk
Chitinases                    Henrissat B.       cermav@frgren81.bitnet
Clusterin                     Peitsch M.C.       peitsch@ulbio1.unil.ch
CTF/NF-I                      Mermod N.          nmermod@ulys.unil.ch
Cytochromes P450              Holsztynska E.J.   ela@netcom.uucp
                                                 netcom!ela@apple.com
DEAD-box helicases            Linder P.          linder@urz.unibas.ch
EF-hand calcium-binding       Cox J.A.           cox@sc2a.unige.ch
                              Kretsinger R.H.    rhk5i@virginia.bitnet
Enoyl-CoA hydratase           Hofmann K.O.       khofmann@cipvax.biolan.uni-koeln.de
fruR/lacI family HTH proteins Reizer J.          jreizer@ucsd.edu
GATA-type zinc-fingers        Boguski M.S.       boguski@ncbi.nlm.nih.gov
Glucanases                    Henrissat B.       cermav@frgren81.bitnet
                              Beguin P.          phycel@pasteur.bitnet
G-protein coupled receptors   Chollet A.         chollet@clients.switch.ch
                              Attwood T.K.       bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins    Boguski M.S.       boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17          Landsman D.        landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases    Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.edu
Integrases                    Roy P.H.           2020000@lavalvx1.bitnet
Lipocalins                    Boguski M.S.       boguski@ncbi.nlm.nih.gov
                              Peitsch M.C.       peitsch@ulbio1.unil.ch
MAC components / perforin     Peitsch M.C.       peitsch@ulbio1.unil.ch
Myelin proteolipid protein    Hofmann K.O.       khofmann@cipvax.biolan.uni-koeln.de
PEP requiring enzymes         Reizer J.          jreizer@ucsd.edu
Phytochromes                  Partis M.D.        partis@gcri.afrc.ac.uk
Prokaryotic carbohydrate      Reizer J.          jreizer@ucsd.edu
            kinases
Protein kinases               Hanks S.           hanks@vuctrvax.bitnet
                              Hunter T.          hunter@salk.bitnet
PTS proteins                  Reizer J.          jreizer@ucsd.edu
Restriction-modification      Bickle T.          bickle@urz.unibas.ch
            enzymes           Roberts R.J.       roberts@cshl.org






<PAGE>



Ribosomal protein S3          Hallick R.         hallick%biotec@arizona.edu
Ribosomal protein S15         Ellis S.R.         srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases    Harayama S.        harayama@cmu.unige.ch
Sodium symporters             Reizer J.          jreizer@ucsd.edu
Subtilases                    Brannigan J.       jab5@vaxa.york.ac.uk
Thiol proteases               Turk B.            turk@ijs.ac.mail.yu
Thiol proteases inhibitors    Turk B.            turk@ijs.ac.mail.yu
TPR repeats                   Boguski M.S.       boguski@ncbi.nlm.nih.gov
Transit peptides              von Heijne G.      gunnar@cbts.sunet.se
Type-II membrane antigens     Levy S.            levy@cellbio.stanford.edu
Uracil-DNA glycosylase        Aasland R.         aasland@bio.uib.no
Xylose isomerase              Jenkins J.         jenkins@frira.afrc.ac.uk
WAP-type domain               Claverie J.-M.     jmc@ncbi.nlm.nih.gov


   B.2  Requirements to fulfill to become an on-line expert

   An expert  should be  a scientist  working with  specific famili(es)  of
   proteins (or specific domains) and which would:

   a) Review the  protein sequences in SWISS-PROT and the patterns/matrices
      in PROSITE relevant to their field of research.
   b) Agree to  be contacted  by people  that have obtained new sequence(s)
      which seem to belong to "their" familie(s) of proteins.
   c) Have access  to electronic  mail and be willing to use it to send and
      receive data.

   If you are willing to be part of this scheme please contact Amos Bairoch
   at one of the following electronic mail addresses:

                             bairoch@cmu.unige.ch
                           bairoch@cgecmu51.bitnet

























<PAGE>



           APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                                                       **********************
                        *********************** <----- * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * -----> **********************
                        *  Sequence Data      *
***************** ----> *  Library            *        **********************
* FLYBASE       * <---- *********************** <----- * ECD [E. coli map]  *
* [Drosophila   *                ^  |       ^          **********************
* genetic maps] * --------+      |  |       |
***************** <-----+ |      |  |       +--------- **********************
                        | |      |  |       +--------- * TFD [Trans. fact.] *
                        | |      |  |       | +------> **********************
                        | |      |  |       | |
*****************       | v      |  v       v |        **********************
* REBASE        *       ***********************        * ENZYME [Nomencl.]  *
* [Restriction  * <---- *  SWISS-PROT         * <----- **********************
*  enzymes]     *       *  Protein Sequence   *            |
*****************       *  Data Bank          *            v
                        ***********************        **********************
*****************         | ^  |  ^ |  ^ |  |          * OMIM   [Diseases]  *
* PROSITE       * <-------+ |  |  | |  | |  +--------> **********************
* [Patterns]    * ----------+  |  | |  | |
*****************              |  | |  | +-----------> **********************
             |                 |  | |  +-------------- * E. coli 2D gels    *
             |                 |  | |                  **********************
             |                 |  | |
             |                 |  | +----------------> **********************
             |                 |  +------------------- * EcoGene/EcoSeq     *
             |                 v                       **********************
             |          ***********************
             +--------> * PDB [3D structures] *
                        ***********************





















<PAGE>
  

Swiss-Prot release 20.0

Published November 1, 1991



                    SWISS-PROT RELEASE 20.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 20.0  of SWISS-PROT  contains 22654 sequence entries, comprising
   7'500'130 amino  acids abstracted from 22830 references. This represents
   an increase of 5% over release 19. The recent growth of the data bank is
   summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130

   1.2  Source of data

   Release 20.0  has been  updated using protein sequence data from release
   29.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 28.0 of the
   EMBL Nucleotide Sequence Database.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):           4129
   Entries with pointer(s) to only EMBL entri(es):          2970
   Entries with pointer(s) to both EMBL and PIR entri(es): 15061
   Entries with no pointers lines:                           494









<PAGE>




      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 19


   2.1  Sequences and annotations

   About 890  sequences have been added since release 19, the sequence data
   of 187  existing entries  has been  updated and  the annotations of 3030
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of
   proteins:

   -  Aminotransferases class-II and class-III
   -  Aromatic amino acids permeases
   -  Avidin/Streptavidin
   -  Bacterial porins
   -  Bacterial regulatory proteins, asnC family
   -  Bacterial regulatory proteins, crp family
   -  Bacterial ring hydroxylating dioxygenases
   -  Beta-ketoacyl synthases
   -  C3HC4-class zinc finger proteins
   -  Chalcone and resveratrol synthases
   -  Chromo domain proteins
   -  DEAD-box family ATP-dependent helicases
   -  Flagella basal body rod proteins
   -  GATA transcription factors
   -  Glutamate / Leucine / Phenylalanine dehydrogenases
   -  Glycosyl hydrolase family 1
   -  Glycosyl hydrolase family 9
   -  Glycosyl hydrolase family 10
   -  Glycosyl hydrolase family 17
   -  Heme oxygenase
   -  High potential iron-sulfur proteins
   -  HIV envelope proteins
   -  Initiation factor 5a (eIF-5a)
   -  Interferon regulatory factors
   -  Myelin P0 protein
   -  Myelin proteolipid protein
   -  PEP-utilizing enzymes
   -  pfkB family prokaryotic carbohydrate kinases
   -  Plant lipid transfer proteins
   -  PTS Hpr component
   -  Ribosomal proteins
   -  Saposins
   -  Serine/threonine protein kinases











<PAGE>



   2.2  Changes in the feature table

        2.2.1  The LIPID key

   Until this  release SWISS-PROT  was very  inconsistent in the use of the
   feature table  to annotate  the covalent binding of lipids (fatty acids,
   prenyl groups,  or glycolipids)  to a  specific position  in  a  protein
   sequence. The  attachment of  a myristate  group was  indicated  by  the
   `MYRISTYL' key,  the attachment of a palmitate group was indicated using
   either the  `MOD_RES' or  the `BINDING' keys, prenylation and GPI-anchor
   were indicated using the `BINDING' key.

   To correct  these inconsistencies  we have  introduced in this release a
   new key `LIPID' and we have deleted the `MYRISTYL' key.

   Definition of the new key:

                LIPID  - Covalent binding of a lipidic moiety

   The  chemical  nature  of  the  bound  lipid  moiety  is  given  in  the
   description. The general format of the LIPID description field is:

        FT   LIPID       xxx    xxx       MODIFICATION (COMMENT).

   The modifications which are currently defined are the following:

   MYRISTATE          Myristate group attached through an amide bond to the
                      N-terminal glycine  residue of  the mature  form of a
                      protein [1,2].

   PALMITATE          Palmitate group  attached through a thioether bond to
                      a cysteine  residue or  through an  ester bond  to  a
                      serine or threonine residue [1,2].

   FARNESYL           Farnesyl group attached through a thioether bond to a
                      cysteine residue [3].

   GERANYL-GERANYL    Geranyl-geranyl group  attached through  a  thioether
                      bond to a cysteine residue [3].

   GPI-ANCHOR         Glycosyl-phosphatidylinositol (GPI)  group linked  to
                      the alpha-carboxyl group of the C-terminal residue of
                      the mature form of a protein [4,5].

   N-ACYL DIGLYCERIDE N-terminal  cysteine   of  the   mature  form   of  a
                      prokaryotic lipoprotein  with an  amide-linked  fatty
                      acid and  a glyceryl  group to  which two fatty acids
                      are linked by ester linkages [6].

   [1] Grand R.J.A.
       Biochem. J. 258:626-638(1989).
   [2] McLhinney R.A.J.
       Trends Biochem. Sci. 15:387-391(1990).




<PAGE>



   [3] Glomset J.A., Gelb M.H., Farnsworth C.C.
       Trends Biochem. Sci. 15:139-142(1990).
   [4] Low M.G.
       FASEB J. 3:1600-1608(1989).
   [5] Low M.G.
       Biochimica Biophysica Acta 988:427-454(1989).
   [6] Hayashi S., Wu H.C.
       J. Bioenerg. Biomembr. 22:451-471(1990).

   Examples of LIPID key feature lines:

      FT   LIPID         1      1       MYRISTATE.
      FT   LIPID        65     65       PALMITATE (BY SIMILARITY).
      FT   LIPID       354    354       GPI-ANCHOR.


        2.2.2  The VARSPLIC key

   There are  some genes  which, by  the mechanism of alternative splicing,
   encode closely  related proteins  that differs  only by  the presence or
   absence of  one or  more domains.  Generally  a  single  sequence  entry
   represents the longest form of the protein and the feature table is used
   to indicate  the regions  which differ in alternative spliced forms. The
   `VARIANT' key was used for such purpose. For example (from entry P04085;
   PGDA$HUMAN):

      FT   VARIANT     194    196       GRP -> DVR (IN SHORT FORM).
      FT   VARIANT     197    211       MISSING (IN SHORT FORM).

   It was  very difficult  to write  software tools  that could distinguish
   between the  above usage of the `VARIANT' key and the more classical use
   to describe  polymorphisms  or  natural  mutations.  We  have  therefore
   introduced  in   this  release  a  new  key  `VARSPLIC'  which  is  used
   specifically to describe splicing variants.

   The example shown above is know represented by:

      FT   VARSPLIC    194    196       GRP -> DVR (IN SHORT FORM).
      FT   VARSPLIC    197    211       MISSING (IN SHORT FORM).


   2.3  Changes in the cross-references lines (DR)

        2.3.1  Cross-references to EcoGene

   Starting with this release we have added cross-references to the EcoGene
   section  of  the  EcoSeq/EcoMap  integrated  Escherichia  coli  database
   prepared  by   Ken  Rudd   at  the  National  Center  for  Biotechnology
   Information (NCBI)  (for a description see: Rudd K.E., Miller W., Werner
   C., Ostell  J., Tolstoshev  C., and Satterfield S.G.; Nucleic Acids Res.
   (1991) 19:637-647).






<PAGE>


   These cross-references are present in the DR lines:

   Data bank identifier: ECOGENE
   Primary identifier  : EcoGene gene accession number
   Secondary identifier: Gene designation
   Example             : DR   ECOGENE; EG10075; AROC.

   The collaboration  with Ken  Rudd goes  much further  than simply adding
   these cross-references.  Thanks to  this collaboration we have been able
   to update  hundreds of  Escherichia coli  sequence entries  (to add data
   concerning the function of some proteins, to resolve sequence conflicts,
   to add references and comments, etc.), we are also using his master list
   of  sequenced   genes  to  pinpoint  missing  sequences.  We  have  also
   implemented his  gene name  nomenclature for hypothetical proteins. This
   scheme is described below.

   Unnamed Escherichia  coli hypothetical  proteins and proteins of unknown
   function are  assigned gene  names based  upon their  position on the E.
   coli genomic  physical map. They all begin with the letter `Y'. The next
   two letters  designate which  1/100th of  the map  (starting at  the thr
   locus) contain  the ORF  in the  order YAA, YAB, ..YAJ, YBA, YBB, ..YBJ,
   YCA, YCB, ..YJJ. ORF's within any one of these 100 intervals are given a
   fourth letter  (A-Z) that serves to distinguish them but is not meant to
   convey position information.

        2.3.2  FlyBase

   The official name of Michael Ashburner Drosophila Genetics Maps Database
   has been changed from `DMAP' to `FlyBase'; we have therefore changed the
   data bank  identifier in  the DR  line from  `DMAP' to `FLYBASE' for all
   cross-references to that data collection.



   2.4  Minor change in the RL lines for thesis references

   Up till  now, thesis  references have been formatted as in the following
   example:

        RL   UNPUBLISHED (1972) THESIS, GEORGE WASHINGTON UNIVERSITY, USA.

   In recognition  of the  fact  that  theses  are  generally  regarded  as
   published references  we will  format them  as follows starting with the
   current release:

        RL   THESIS (19YY), INSTITUTION_NAME, COUNTRY.

   Example:

        RL   THESIS (1972), GEORGE WASHINGTON UNIVERSITY, USA.







<PAGE>



   For those  of you  who write  software to  parse reference  blocks,  the
   presence of  the word `THESIS' as the first word on the first RL line of
   a reference  block will  thus indicate a thesis reference. The remaining
   text consists  of a  parenthesized year followed by the institution name
   followed by the country where that institution is localized.


                            3. FORTHCOMING CHANGES

   The following changes will be implemented starting with release 21.

   3.1  Change in the format of the entry names

   The dollar  sign `$'  in entry  names will be replaced by the underscore
   character `_'.  This change  is made  on the behalf of users of sequence
   analysis software  running under  the Unix  operating system,  where the
   dollar sign  is a  reserved symbol.  Example: the entry name `CYC$HUMAN'
   will be changed to `CYC_HUMAN'.

   3.2  New line type GN

   The GN  (Gene Name) line is a new line that will be used to indicate the
   name(s) of  the gene(s)  that encodes  for the  protein being described.
   Currently this  information is  found in  the DE  line as  shown in  the
   following example:

        DE   SERUM ALBUMIN PRECURSOR (GENE NAME: ALB).

   The format of the GN line will be:

        GN   NAME1[ AND|OR NAME2...].

   Examples:

        GN   ALB.
        GN   REX-1.

   It often  occurs that  more than  one gene  name has been assigned to an
   individual locus. In that case all the synonyms will be listed. The word
   `OR' separates the different designations. The first name in the list is
   assumed to be the most correct (or most current) designation. Example:

        GN   HNS OR DRDX OR OSMZ OR BGLY.

   In a few cases, multiple genes encode for an identical protein sequence.
   In that case all the different gene names will be listed. The word `AND'
   separates the designations. Example:

        GN   CECA1 AND CECA2.








<PAGE>



   In very  rare cases  (only one  occurrence has been found in the current
   release) `AND'  and `OR' could be both present. In that case parenthesis
   are used as shown in the following example:

        GN   GVPA AND (GVPB OR GVPA2).


   3.3  New line type RM

   The RM  (Reference Medline)  line will  be used  to indicate the Medline
   Unique Identifier  (UID) of  a reference.  This information is currently
   listed in  the RC  line using  the  `MEDLINE'  token  as  shown  in  the
   following example:

        RC   MEDLINE=90205618;

   The format of the RM line will be:

        RM   nnnnnnnn

   where `nnnnnnnn' is the eight digit Medline Unique Identifier (UID).

   Example:

        RM   90205618


   3.4  Secondary structure information

   Thanks to  the help of Chris Sander of the Biocomputing group at EMBL we
   are going  to add  in the  feature table  of each  sequence  entry  that
   belongs to  a protein  whose tertiary  structure is known, the secondary
   structure information  corresponding to  that protein.  Complete details
   regarding this  new feature  will be  communicated in  the next  release
   notes.



                            4. ENZYME AND PROSITE

   4.1  The ENZYME data bank

   Release 7.0 of the ENZYME data bank is distributed along with release 19
   of SWISS-PROT.  ENZYME release 7.0 contains information relative to 3072
   enzymes. The  data bank  is complete  and up  to date.  Until new enzyme
   nomenclature data  is published  we only  plan to  update the SWISS-PROT
   pointers at  each release  of the  protein sequence  data bank,  correct
   eventual errors,  and complete  the information  concerning synonyms and
   cofactors using the literature.








<PAGE>



   4.2  The PROSITE data bank

   Release 8.00  of the PROSITE data bank is distributed along with release
   20 of  SWISS-PROT. Release 8.00 contains 530 documentation chapters that
   describes 605 different patterns.

        4.2.1  What's new in release 8.0

   Since the  last major release of PROSITE (release 7.01 of June 1991), 88
   new chapters have been added and 210 chapters have been updated. The new
   chapters are:

   -  Fibrinogen beta and gamma chains C-terminal domain signature
   -  Somatomedin B domain signature
   -  Cellulose-binding domain, bacterial type
   -  Cellulose-binding domain, fungal type
   -  Zinc finger, C3HC4 type, signature
   -  IRF family signature
   -  TEA domain signature
   -  Fibrillarin signature
   -  Bacterial regulatory proteins, asnC family signature
   -  Bacterial regulatory proteins, merR family signature
   -  HMG-I and HMG-Y DNA-binding domain (A+T-hook)
   -  Chromo domain
   -  Nuclear transition protein 1 signature
   -  Ribosomal protein L6 signature
   -  Ribosomal protein L16 signature
   -  Ribosomal protein L29 signature
   -  Ribosomal protein L33 signature
   -  Ribosomal protein L19e signature
   -  Ribosomal protein L32e signature
   -  Ribosomal protein S3 signature
   -  Ribosomal protein S5 signature
   -  Ribosomal protein S14 signature
   -  Ribosomal protein S4e signature
   -  Ribosomal protein S6e signature
   -  Ribosomal protein S24e signature
   -  FMN-dependent alpha-hydroxy acid dehydrogenases active site
   -  Eukaryotic molybdopterin oxidoreductases signature
   -  Delta 1-pyrroline-5-carboxylate reductase signature
   -  Pyridine nucleotide-disulphide oxidoreductases class-II active site
   -  Respiratory chain NADH dehydrogenase 30 Kd subunit signature
   -  Respiratory chain NADH dehydrogenase 49 Kd subunit signature
   -  Bacterial ring hydroxylating dioxygenases alpha-subunit signature
   -  Heme oxygenase signature
   -  Beta-ketoacyl synthases active site
   -  Transglutaminases active site
   -  Aminotransferases class-II pyridoxal-phosphate attachment site
   -  Aminotransferases class-III pyridoxal-phosphate attachment site
   -  Phosphoserine aminotransferase signature







<PAGE>



   -  pfkB family prokaryotic carbohydrate kinases signatures
   -  Phosphoribulokinase signature
   -  Thymidine kinase cellular-type signature
   -  DNA polymerase family X signature
   -  PEP-utilizing enzymes phosphorylation site signature
   -  cAMP phosphodiesterases class-II signature
   -  Ribonuclease III family signature
   -  Ribonuclease T2 family histidine active sites
   -  Glycosyl hydrolases family 1 active site
   -  Glycosyl hydrolases family 9 active site
   -  Glycosyl hydrolases family 10 active site
   -  Glycosyl hydrolases family 17 signature
   -  Alkylbase DNA glycosidases alkA family signature
   -  Matrixins cysteine switch
   -  Amidases signature
   -  ATP synthase c subunit signature
   -  Phosphoenolpyruvate carboxykinase (ATP) signature
   -  Fructose-bisphosphate aldolase class-II signature
   -  Porphobilinogen deaminase cofactor-binding site
   -  Ferrochelatase signature
   -  Aldose 1-epimerase putative active site
   -  Methylmalonyl-CoA mutase signature
   -  Ubiquitin-activating enzyme signature
   -  Adenylosuccinate synthetase active site
   -  Argininosuccinate synthase signatures
   -  Cytochrome b559 subunits heme-binding site signature
   -  High potential iron-sulfur proteins signature
   -  Bacterioferritin signature
   -  Hemerythrins signature
   -  Avidin / Streptavidin family signature
   -  Plant lipid transfer proteins signature
   -  Aromatic amino acids permeases signature
   -  General diffusion gram-negative porins signature
   -  Eukaryotic porin signature
   -  Myelin basic protein signature
   -  Myelin P0 protein signature
   -  Myelin proteolipid protein signature
   -  Synaptophysin / synaptoporin signature
   -  Flagella basal body rod proteins signature
   -  Plant viruses icosahedral capsid proteins 'S' region signature
   -  Bacterial chemotaxis sensory transducers signature
   -  Interleukin-10 signature
   -  LIF / OSM family signature
   -  Pyrokinins signature
   -  Hok/gef family cell toxic proteins signature
   -  Lambdoid phages regulatory protein CIII signature
   -  Stathmin family signature
   -  HlyD family secretion proteins signature
   -  Seminal vesicle protein II repeats signature








<PAGE>



        4.2.2  New /SKIP-FLAG qualifier for CC lines

   Some  PROSITE  keys  such  as  those  describing  commonly  found  post-
   translational modifications  (a typical  example is N-glycosylation) are
   found in  the majority of known protein sequences. While it is generally
   useful to note their presence, some programs may want, in some cases, to
   ignore those  keys. For  this purpose  these keys are indicated with the
   following qualifier in their CC lines:

   CC   /SKIP-FLAG=TRUE;


        4.2.3  The new 3D line type

   We have  introduced a new line: 3D (3D-structure), which is used to list
   the code(s)  of X-ray  crystallography Protein  Data Bank  (PDB) entries
   that contain structural data corresponding the sequence region described
   in a PROSITE entry. The format of the 3D line is:

   3D   name; [name2;...]

   Example:

   3D   7WGA; 9WGA; 1WGC; 2WGC;



                            5. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about annotations to be updated, as for example if the function
   of a protein has been clarified or if new post-translational information
   has become available.






















<PAGE>




                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.65   Gln (Q) 4.08   Leu (L) 9.12   Ser (S) 7.10
   Arg (R) 5.23   Glu (E) 6.27   Lys (K) 5.85   Thr (T) 5.85
   Asn (N) 4.46   Gly (G) 7.10   Met (M) 2.32   Trp (W) 1.30
   Asp (D) 5.25   His (H) 2.27   Phe (F) 3.95   Tyr (Y) 3.21
   Cys (C) 1.81   Ile (I) 5.45   Pro (P) 5.08   Val (V) 6.49

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 3029

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1303
                            2x:  548
                            3x:  308
                            4x:  189
                            5x:  131
                            6x:  100
                            7x:   73
                            8x:   47
                            9x:   64
                           10x:   32
                       11- 20x:  117
                       21-100x:   92
                         >100x:   25













<PAGE>





        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        1847          Human
         2        1596          Escherichia coli
         3        1083          Mouse
         4        1014          Rat
         5         778          Baker's yeast (Saccharomyces cerevisiae)
         6         511          Bovine
         7         435          Fruit fly (Drosophila melanogaster)
         8         374          Chicken
         9         311          Bacillus subtilis
        10         266          Rabbit
                   266          African clawed frog (Xenopus laevis)
        12         251          Vaccinia virus (strain Copenhagen)
        13         245          Pig
        14         208          Salmonella typhimurium
        15         193          Human cytomegalovirus (strain AD169)
        16         166          Bacteriophage T4
        17         160          Maize
        18         133          Rice
        19         118          Vaccinia virus (strain WR)
                   118          Tobacco
        21         113          Pea
        22         111          Wheat
        23         110          Staphylococcus aureus
        24         101          Slime mold (Dictyostelium discoideum)
                   101          Barley
        26         100          Sheep
        27          94          Fission yeast (Schizosaccharomyces pombe)
        28          93          Pseudomonas aeruginosa
        29          92          Spinach
        30          89          Pseudomonas putida
        31          86          Soybean
                    86          Neurospora crassa
        33          85          Dog
        34          84          Liverwort (Marchantia polymorpha)
        35          82          Klebsiella pneumoniae

















<PAGE>





   A.3  Repartition of the sequences by size



               From   To  Number             From   To   Number
                  1-  50    1494             1001-1100      210
                 51- 100    2500             1101-1200      131
                101- 150    3559             1201-1300      108
                151- 200    2181             1301-1400       64
                201- 250    1817             1401-1500       54
                251- 300    1634             1501-1600       32
                301- 350    1494             1601-1700       25
                351- 400    1436             1701-1800       26
                401- 450    1104             1801-1900       30
                451- 500    1188             1901-2000       23
                501- 550     881             2001-2100        9
                551- 600     619             2101-2200       25
                601- 650     438             2201-2300       31
                651- 700     319             2301-2400       11
                701- 750     297             2401-2500       13
                751- 800     232             >2500           53
                801- 850     187
                851- 900     194
                901- 950     119
                951-1000     116


   Currently the ten largest sequences are:


                            RYNR$RABIT  5037 a.a.
                            RYNR$HUMAN  5032 a.a.
                            APB$HUMAN   4563 a.a.
                            APOA$HUMAN  4548 a.a.
                            DYHC$TRIGR  4466 a.a.
                            POLG$BVDV   3988 a.a.
                            POLG$HCVA   3898 a.a.
                            POLG$HCVB   3898 a.a.
                            TRX$DROME   3759 a.a.
                            ACVA$PENCH  3746 a.a.















<PAGE>



                         APPENDIX B: ON-LINE EXPERTS


   B.1  List of on-line experts for PROSITE and SWISS-PROT


Field of expertise            Name               Email address
---------------------------   ------------------ ----------------------------
African swine fever virus     Yanez R.J.         ryanez@cbm2.uam.es
Alcohol dehydrogenases        Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt@medfys.ki.se
Aldehyde dehydrogenases       Joernvall H.       hans.jornvall@k1m.ki.se
                              Persson B.         bengt@medfys.ki.se
Alpha-crystallins/HSP-20      Leunissen J.A.M.   jackl@caos.caos.kun.nl
                              de Jong W.         u629000@hnykun11.bitnet
Alpha-2-macroglobulins        Van Leuven F.      fred@blekul13.bitnet
Apolipoproteins               Boguski M.S.       boguski@ncbi.nlm.nih.gov
Arrestins                     Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.edu
Bacteriophage P4 proteins     Halling C.         chh9@midway.uchicago.edu
Beta-lactamases               Brannigan J.       jab5@vaxa.york.ac.uk
Chitinases                    Henrissat B.       cermav@frgren81.bitnet
Clusterin                     Peitsch M.C.       peitsch@ulbio1.unil.ch
CTF/NF-I                      Mermod N.          nmermod@ulys.unil.ch
Cytochromes P450              Holsztynska E.J.   ela@netcom.uucp
                                                 netcom!ela@apple.com
EF-hand calcium-binding       Cox J.A.           cox@sc2a.unige.ch
                              Kretsinger R.H.    rhk5i@virginia.bitnet
Enoyl-CoA hydratase           Hofmann K.O.       khofmann@cipvax.biolan.uni-koeln.de
fruR/lacI family HTH proteins Reizer J.          jreizer@ucsd.edu
GATA-type zinc-fingers        Boguski M.S.       boguski@ncbi.nlm.nih.gov
Glucanases                    Henrissat B.       cermav@frgren81.bitnet
                              Beguin P.          phycel@pasteur.bitnet
G-protein coupled receptors   Chollet A.         chollet@clients.switch.ch
                              Attwood T.K.       bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins    Boguski M.S.       boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17          Landsman D.        landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases    Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.edu
Integrases                    Roy P.H.           2020000@lavalvx1.bitnet
Lipocalins                    Boguski M.S.       boguski@ncbi.nlm.nih.gov
                              Peitsch M.C.       peitsch@ulbio1.unil.ch
MAC components / perforin     Peitsch M.C.       peitsch@ulbio1.unil.ch
Myelin proteolipid protein    Hofmann K.O.       khofmann@cipvax.biolan.uni-koeln.de
PEP requiring enzymes         Reizer J.          jreizer@ucsd.edu
Phytochromes                  Partis M.D.        partis@gcri.afrc.ac.uk
Prokaryotic carbohydrate      Reizer J.          jreizer@ucsd.edu
            kinases
Protein kinases               Hanks S.           hanks@vuctrvax.bitnet
                              Hunter T.          hunter@salk.bitnet
PTS proteins                  Reizer J.          jreizer@ucsd.edu
Restriction-modification      Bickle T.          bickle@urz.unibas.ch
            enzymes           Roberts R.J.       roberts@cshl.org






<PAGE>




Ribosomal protein S3          Hallick R.         hallick%biotec@arizona.edu
Ribosomal protein S15         Ellis S.R.         srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases    Harayama S.        harayama@cmu.unige.ch
Sodium symporters             Reizer J.          jreizer@ucsd.edu
Subtilases                    Brannigan J.       jab5@vaxa.york.ac.uk
Thiol proteases               Turk B.            turk@ijs.ac.mail.yu
Thiol proteases inhibitors    Turk B.            turk@ijs.ac.mail.yu
TPR repeats                   Boguski M.S.       boguski@ncbi.nlm.nih.gov
Transit peptides              von Heijne G.      gunnar@cbts.sunet.se
Type-II membrane antigens     Levy S.            levy@cellbio.stanford.edu
Uracil-DNA glycosylase        Aasland R.         aasland@bio.uib.no
Xylose isomerase              Jenkins J.         jenkins@frira.afrc.ac.uk



   B.2  Requirements to fulfill to become an on-line expert

   An expert  should be  a scientist  working with  specific famili(es)  of
   proteins (or specific domains) and which would:

   a) Review the  protein sequences in SWISS-PROT and the patterns/matrices
      in PROSITE relevant to their field of research.
   b) Agree to  be contacted  by people  that have obtained new sequence(s)
      which seem to belong to "their" familie(s) of proteins.
   c) Have access  to electronic  mail and be willing to use it to send and
      receive data.

   If you are willing to be part of this scheme please contact Amos Bairoch
   at one of the following electronic mail addresses:

                           bairoch@cgecmu51.bitnet
                             bairoch@cmu.unige.ch
























<PAGE>




           APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                                                       **********************
                        *********************** <----- * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * -----> **********************
                        *  Sequence Data      *
***************** ----> *  Library            *        **********************
* FLYBASE       *       *********************** <----- * ECD [E. coli map]  *
* [Drosophila   *                ^  |       ^          **********************
* genetic maps] * --------+      |  |       |
***************** <-----+ |      |  |       +--------- **********************
                        | |      |  |       +--------- * TFD [Trans. fact.] *
                        | |      |  |       | +------> **********************
                        | |      |  |       | |
*****************       | v      |  v       v |        **********************
* REBASE        *       ***********************        * ENZYME [Nomencl.]  *
* [Restriction  * <---- *  SWISS-PROT         * <----- **********************
*  enzymes]     *       *  Protein Sequence   *            |
*****************       *  Data Bank          *            v
                        ***********************        **********************
*****************         | ^     |  ^ | |  |          * OMIM   [Diseases]  *
* PROSITE       * <-------+ |     |  | | |  +--------> **********************
* [Patterns]    * ----------+     |  | | |
*****************                 |  | | +-----------> **********************
             |                    |  | |               * E. coli 2D gels    *
             |                    |  | |               **********************
             |                    |  | |
             |                    |  | +-------------> **********************
             |                    v  +---------------- * EcoGene/EcoSeq     *
             |          ***********************        **********************
             +--------> * PDB [3D structures] *
                        ***********************





















<PAGE>
  

Swiss-Prot release 19.0

Published August 1, 1991



                    SWISS-PROT RELEASE 19.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 19.0  of SWISS-PROT  contains 21795 sequence entries, comprising
   7'173'785 amino  acids abstracted from 21773 references. This represents
   an increase of 6% over release 18. The recent growth of the data bank is
   summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785


   1.2  Source of data

   Release 19.0  has been  updated using protein sequence data from release
   28.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 27.0 of the
   EMBL Nucleotide Sequence Database.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):           3912
   Entries with pointer(s) to only EMBL entri(es):          3542
   Entries with pointer(s) to both EMBL and PIR entri(es): 13835
   Entries with no pointers lines:                           506









<PAGE>




      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 18


   2.1  Sequences and annotations

   About 1040 sequences have been added since release 18, the sequence data
   of 194  existing entries  has been  updated and  the annotations of 2220
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of
   proteins:

   -  Adenylosuccinate synthetase
   -  Aldose 1-epimerase
   -  Argininosuccinate synthase
   -  Amidases
   -  Bacterial regulatory proteins, lacI family
   -  Bacterial regulatory proteins, merR family
   -  Bacterioferritin
   -  Delta 1-pyrroline-5-carboxylate reductase
   -  DNA polymerase family X
   -  Eukaryotic molybdopterin-dependent oxidoreductases
   -  Eukaryotic porin
   -  Ferrochelatase
   -  Fibrillarin
   -  FMN-dependent alpha-hydroxy acid dehydrogenases
   -  Hemerythrins
   -  HlyD family secretion proteins
   -  Hok/gef family cell toxic proteins
   -  Matrixins
   -  Methylmalonyl-CoA mutase
   -  Phosphoribulokinase
   -  Plant viruses icosahedral capsid proteins
   -  Porphobilinogen deaminase
   -  Pyrokinins
   -  Ribonuclease T2 family
   -  Stathmin family
   -  Transglutaminases
   -  Ubiquitin-activating enzyme
   -  Zinc finger RFP/RPT-1 family


   2.2  New line types: RP and RC

   The following change has been implemented in release 19; the RN line has
   been replaced by three line types: a modified RN (Reference Number) line
   type containing just the reference number, a new RP (Reference Position)
   line type  containing the  extent of the work carried out by the authors
   of the  reference, and a new RC (Reference Comment) line type containing
   comments  relevant  to  the  reference  (strain,  tissue,  etc.).  Three
   examples of the usage of these new line types are given below.






<PAGE>




      RN   [1]
      RP   SEQUENCE FROM N.A., AND SEQUENCE OF 1-23.
      RC   STRAIN=K12;

      RN   [1]
      RP   SEQUENCE OF 24-56 AND 67-89.
      RC   STRAIN=BALB/C; TISSUE=BRAIN;

      RN   [1]
      RP   X-RAY CRYSTALLOGRAPHY, 1.8 ANGSTROMS.
      RC   MEDLINE=91002678;

   Within a  reference block  the RN  and RP  lines occur once, the RC line
   occurs zero or more times.

   The format of the RC line is:

      RC   TOKEN1=Text; TOKEN2=Text;....

   Where the  following tokens  are currently  defined:  MEDLINE,  PLASMID,
   SPECIES, STRAIN, TISSUE, and TRANSPOSON.

   The `SPECIES'  token is  only used  when an  entry describes  a sequence
   which is  identical in more than one species; similarly the `PLASMID' is
   only used  if an  entry describes  a sequence identical in more than one
   plasmid.


   2.3  MEDLINE unique identifiers

   Starting with  release 19  each journal  reference listed  in SWISS-PROT
   which  exists   in  the  National  Library  of  Medicine  (NLM)  MEDLINE
   bibliographic data  bank includes  the `Unique  Identifier' (UI) of that
   reference. This  information is stored in the new RC line type using the
   `MEDLINE' token. Example:

      RC   MEDLINE=90205618;

   It is  planned that,  in a few months, MEDLINE will add cross-references
   to SWISS-PROT.
















<PAGE>




   2.4  New cross-references

   We have  added cross-references  to the  Transcription Factors  Database
   (TFD) of  David Ghosh  (for a description see: Nucleic Acids Res. (1990)
   18:1749-1756); as well as to the Drosophila Genetic Maps database (DMAP)
   prepared  by   Michael  Ashburner  at  the  Department  of  Genetics  in
   Cambridge, England. These cross-references are present in the DR lines:

   Data bank identifier: DMAP
   Primary identifier  : Gene unique identifier number (UID).
   Secondary identifier: Latest release of DMAP that was used to derive
                         the cross-references.
   Example             : DR   DMAP; 00055; RELEASE 9107.

   Data bank identifier: TFD
   Primary identifier  : Unique identifier for the corresponding TFD
                         POLYPEPTIDES table entry (the TFD_ID field).
   Secondary identifier: Latest release of TFD that was used to derive
                         the cross-references.
   Example             : DR   TFD; P00040; RELEASE 3.0.


   2.5  Minor change in the DT line format

   There is  now a  single space character between the date and the comment
   part of  a DT  line instead  of the  two spaces  that used  to exist  in
   previous releases. Example:

      DT   01-MAY-1991  (REL. 18, CREATED)

   has been changed to:

      DT   01-MAY-1991 (REL. 18, CREATED)


   2.6  Minor change in the RL lines for submissions

   References for  sequence  information  submitted  to  the  international
   nucleic acid  databases (DDBJ,  EMBL, Genbank)  were represented  by the
   following subtype of RL lines:

      RL   SUBMITTED (JAN-1991) TO EMBL/GENBANK DATA BANKS.

   Starting with release 19, these RL lines use the following format:

      RL   SUBMITTED (JAN-1991) TO EMBL/GENBANK/DDBJ DATA BANKS.










<PAGE>




   2.7  Status of cross-references to PIR

   We have  continued adding cross-references to entries in the unannotated
   sections of  PIR (known  as PIR2  and PIR3);  currently we  have  cross-
   references to  14078 sequence  entries in PIR2/3 out of a total of 20265
   entries in those sections in release 28 of PIR.


                            3. FORTHCOMING CHANGES

   3.1  Change in the format of the entry names

   Starting with  release 21  we will  replace the dollar sign `$' in entry
   names by the underscore character `_'. This change is made on the behalf
   of users  of sequence analysis software running under the Unix operating
   system, where  the dollar  sign is a reserved symbol. Example: the entry
   name `CYC$HUMAN' will be changed to `CYC_HUMAN'.


                            4. ENZYME AND PROSITE

   4.1  The ENZYME data bank

   Release 6.0 of the ENZYME data bank is distributed along with release 19
   of SWISS-PROT.  ENZYME release 6.0 contains information relative to 3072
   enzymes. The  data bank  is complete  and up  to date.  Until new enzyme
   nomenclature data  is published  we only  plan to  update the SWISS-PROT
   pointers at  each release  of the  protein sequence  data bank,  correct
   eventual errors,  and complete  the information  concerning synonyms and
   cofactors using the literature.

   4.2  The PROSITE data bank

   Release 7.10  of the PROSITE data bank is distributed along with release
   19 of  SWISS-PROT. Release 7.10 contains 441 documentation chapters that
   describes 508 different patterns. Release 7.10 does not really represent
   a new release; the only changes between release 7.0 and 7.1 are updating
   of the  pointers to the SWISS-PROT entries whose name have been modified
   between release  18 and  19. The  next release  of PROSITE (8.0) will be
   distributed with release 20 of SWISS-PROT.


                            5. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about annotations to be updated, as for example if the function
   of a protein has been clarified or if new post-translational information
   has become available.






<PAGE>




                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.64   Gln (Q) 4.09   Leu (L) 9.11   Ser (S) 7.10
   Arg (R) 5.23   Glu (E) 6.27   Lys (K) 5.85   Thr (T) 5.85
   Asn (N) 4.46   Gly (G) 7.11   Met (M) 2.32   Trp (W) 1.30
   Asp (D) 5.24   His (H) 2.27   Phe (F) 3.96   Tyr (Y) 3.21
   Cys (C) 1.82   Ile (I) 5.45   Pro (P) 5.09   Val (V) 6.49

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 2986

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1302
                            2x:  543
                            3x:  303
                            4x:  184
                            5x:  128
                            6x:   93
                            7x:   78
                            8x:   40
                            9x:   60
                           10x:   32
                       11- 20x:  111
                       21-100x:   88
                         >100x:   24













<PAGE>




        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        1790          Human
         2        1512          Escherichia coli
         3        1043          Mouse
         4         972          Rat
         5         735          Baker's yeast (Saccharomyces cerevisiae)
         6         499          Bovine
         7         414          Fruit fly (Drosophila melanogaster)
         8         362          Chicken
         9         286          Bacillus subtilis
        10         261          Rabbit
        11         252          African clawed frog (Xenopus laevis)
        12         251          Vaccinia virus (strain Copenhagen)
        13         238          Pig
        14         201          Salmonella typhimurium
        15         193          Human cytomegalovirus (strain AD169)
        16         166          Bacteriophage T4
        17         154          Maize
        18         132          Rice
        19         113          Vaccinia virus (strain WR)
        20         112          Tobacco
                                Pea
        22         110          Wheat
        23         105          Staphylococcus aureus
        24         101          Slime mold (Dictyostelium discoideum)
        25          98          Sheep
        26          97          Barley
        27          90          Fission yeast (Schizosaccharomyces pombe)
        28          89          Spinach
        29          87          Pseudomonas aeruginosa
        30          85          Soybean
        31          84          Liverwort (Marchantia polymorpha)
        32          81          Agrobacterium tumefaciens
        33          80          Dog
                                Klebsiella pneumoniae
        35          79          Neurospora crassa


















<PAGE>




   A.3  Repartition of the sequences by size



               From   To  Number             From   To   Number
                  1-  50    1458             1001-1100      203
                 51- 100    2439             1101-1200      124
                101- 150    3466             1201-1300       97
                151- 200    2112             1301-1400       63
                201- 250    1757             1401-1500       49
                251- 300    1558             1501-1600       27
                301- 350    1395             1601-1700       24
                351- 400    1365             1701-1800       26
                401- 450    1052             1801-1900       29
                451- 500    1121             1901-2000       22
                501- 550     855             2001-2100        9
                551- 600     583             2101-2200       24
                601- 650     422             2201-2300       30
                651- 700     304             2301-2400       11
                701- 750     284             2401-2500       13
                751- 800     226             >2500           51
                801- 850     178
                851- 900     190
                901- 950     116
                951-1000     112


   Currently the ten largest sequences are:


                            RYNR$RABIT  5037 a.a.
                            RYNR$HUMAN  5032 a.a.
                            APB$HUMAN   4563 a.a.
                            APOA$HUMAN  4548 a.a.
                            POLG$BVDV   3988 a.a.
                            POLG$HCVA   3898 a.a.
                            POLG$HCVB   3898 a.a.
                            TRX$DROME   3759 a.a.
                            ACVA$PENCH  3746 a.a.
                            DMD$HUMAN   3685 a.a.
















<PAGE>




                         APPENDIX B: ON-LINE EXPERTS



   B.1  List of on-line experts for PROSITE and SWISS-PROT

Field of expertise            Name                Email address
----------------------------- ------------------  --------------------------
African swine fever virus     Yanez R.J.          ryanez@cbm2.uam.es
Alcohol dehydrogenases        Bengt P.            bengt@medfys.ki.se
Aldehyde dehydrogenases       Bengt P.            bengt@medfys.ki.se
Alpha-crystallins/HSP-20      Leunissen J.A.M.    jackl@caos.caos.kun.nl
Alpha-2-macroglobulins        Van Leuven F.       fred@blekul13.bitnet
Apolipoproteins               Boguski M.S.        boguski@ncbi.nlm.nih.gov
Arrestins                     Kolakowski L.F.Jr.  lfk@athena.mit.edu
Bacteriophage P4 proteins     Halling C.          chh9@midway.uchicago.edu
Beta-lactamases               Brannigan J.        jab5@vaxa.york.ac.uk
Chitinases                    Henrissat B.        cermav@frgren81.bitnet
CTF/NF-I                      Mermod N.           nmermod@clsuni51.bitnet
Cytochromes P450              Holsztynska E.J.    ela@netcom.uucp
                                                  netcom!ela@apple.com
EF-hand calcium-binding       Cox J.A.            cox@cgeuge52.bitnet
                              Kretsinger R.H.     rhk5i@virginia.bitnet
Eryf1-type zinc-fingers       Boguski M.S.        boguski@ncbi.nlm.nih.gov
fruR/lacI family HTH proteins Reizer J.           jreizer@ucsd.edu
Glucanases                    Henrissat B.        cermav@frgren81.bitnet
                              Beguin P.           phycel@pasteur.bitnet
G-protein coupled receptors   Chollet A.          chollet@clients.switch.ch
                              Attwood T.K.        bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins    Boguski M.S.        boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17          Landsman D.         landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases    Kolakowski L.F.Jr.  lfk@athena.mit.edu
Integrases                    Roy P.H.            2020000@lavalvx1.bitnet
Phytochromes                  Partis M.D.         partis@gcri.afrc.ac.uk
Prokaryotic carbohydrate      Reizer J.           jreizer@ucsd.edu
            kinases
Protein kinases               Hanks S.            hanks@vuctrvax.bitnet
Restriction-modification      Bickle T.           bickle@urz.unibas.ch
            enzymes           Roberts R.J.        roberts@cshl.org
Ribosomal protein S3          Hallick R.          hallick%biotec@arizona.edu
Ribosomal protein S15         Ellis S.R.          srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases    Harayama S.         harayama@cgecmu51.bitnet
Sodium symporters             Reizer J.           jreizer@ucsd.edu
Subtilisin family proteases   Brannigan J.        jab5@vaxa.york.ac.uk
Thiol proteases               Turks B.            turk@ijs.ac.mail.yu
Thiol proteases inhibitors    Turks B.            turk@ijs.ac.mail.yu
TPR repeats                   Boguski M.S.        boguski@ncbi.nlm.nih.gov
Transit peptides              von Heijne G.       gunnar@cbts.sunet.se
Type-II membrane antigens     Levy S.             levy@cellbio.stanford.edu
Xylose isomerase              Jenkins J.          jenkins@frira.afrc.ac.uk






<PAGE>





   B.2  Requirements to fulfill to become an on-line expert

   An expert  should be  a scientist  working with  specific famili(es)  of
   proteins (or specific domains) and which would:

   a) Review the  protein sequences in SWISS-PROT and the patterns/matrices
      in PROSITE relevant to their field of research.
   b) Agree to  be contacted  by people  that have obtained new sequence(s)
      which seem to belong to "their" familie(s) of proteins.
   c) Have access  to electronic  mail and be willing to use it to send and
      receive data.

   If you are willing to be part of this scheme please contact Amos Bairoch
   at one of the following electronic mail addresses:

                           bairoch@cgecmu51.bitnet
                             bairoch@cmu.unige.ch






































<PAGE>
  

Swiss-Prot release 18.0

Published May 1, 1991


                    SWISS-PROT RELEASE 18.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 18.0  of SWISS-PROT  contains 20772 sequence entries, comprising
   6'792'034 amino  acids abstracted from 20580 references. This represents
   an increase of 4% over release 17. The recent growth of the data bank is
   summarized below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034


   1.2  Source of data

   Release 18.0  has been  updated using protein sequence data from release
   27.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 26.0 of the
   EMBL Nucleotide Sequence Data Library.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank  we list  here the  statistics  concerning  the  DR  (Databank
   Reference) pointer lines:

   Entries with pointer(s) to only PIR entri(es):           3863
   Entries with pointer(s) to only EMBL entri(es):          3435
   Entries with pointer(s) to both EMBL and PIR entri(es): 13001
   Entries with no pointers lines:                           473



      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 17


   2.1  Sequences and annotations

   About 780  sequences have been added since release 17, the sequence data
   of 139  existing entries  has been  updated and  the annotations of 3010
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of
   proteins:

   -  Bacterial luciferase subunits
   -  Beta-amylases
   -  Carboxylesterases type-B
   -  Class-I aminoacyl-tRNA synthetases
   -  Clusterins
   -  Fumarate reductases / succinate dehydrogenases
   -  G-protein coupled receptors
   -  GTPase-activating proteins
   -  IMP dehydrogenase / GMP reductase
   -  Insulin-like growth factor binding proteins
   -  Leucine/isoleucine-binding proteins
   -  Lipocalins
   -  Manganese-dependent dipeptidases
   -  Nickel-dependent hydrogenases
   -  P-II proteins (glnB)
   -  Pectinesterases
   -  Phenylalanine and histidine ammonia-lyases
   -  Phosphoenolpyruvate carboxykinase (ATP and GTP)
   -  Polygalacturonases
   -  Prokaryotic molybdopterin oxidoreductases
   -  Receptors tyrosine kinase class IV (FGF receptors)
   -  Ribosomal proteins
   -  Signal recognition particle 54 Kd protein
   -  Somatostatins
   -  Thymosin beta-4 family
   -  Tyrosinases
   -  Wnt-1 family


   2.2  Change in the OS line

   As previously announced we have inverted the order of the information in
   the OS  line. We  switched from  'English common  name (Latin  name)` to
   'Latin name (English common name)`. Example:

      OS   HUMAN (HOMO SAPIENS).

   as been changed to:

      OS   HOMO SAPIENS (HUMAN).


   2.3  Cross-references to the Escherichia coli gene-protein database

   We have  added cross-references  to the  Escherichia  coli  gene-protein
   database (2D gels spots) (for a description see: VanBogelen R.A., Hutton
   M.E., and Neidhardt F.C., in Electrophoresis (1990), 12:1131-1166).

   These cross-references  are present  in the  DR  lines.  The  data  bank
   identifier is  EC-2D-GEL, the  primary identifier  is the  2D  gel  spot
   alphanumeric designation,  and the  secondary identifier  is the  latest
   edition of  the data  bank that  we  have  used  to  derive  the  cross-
   reference. Example  of a  DR line  for the Escherichia coli gene-protein
   database:

      DR   EC-2D-GEL; G052.0; 3RD EDITION.


   2.4  Status of cross-references to PIR

   We have  continued adding cross-references to entries in the unannotated
   sections of  PIR (known  as PIR2  and PIR3);  currently we  have  cross-
   references to  13118 sequence  entries in PIR2/3 out of a total of 19051
   entries in those sections in release 27 of PIR.


   2.5  Documentation changes

   -  The EC2DTOSP.TXT  document is  an index  of  Escherichia  coli  Gene-
      protein database entries referenced in SWISS-PROT (see section 2.3).

   -  The SPEINDEX.TXT document is a species index.

   -  The JOURLIST.TXT  document  now  indicates,  when  it  exist,  the  6
      characters CODEN  designation of the journals cited in SWISS-PROT and
      in PROSITE. Example of an entry in the JOURLIST.TXT file:

             Abbrev: EMBO J.
             Title : EMBO Journal
             ISSN  : 0261-4189
             Coden : EMJ0DG

   -  The SPECODES.TXT  document is  no longer distributed. The information
      contained in  this document was duplicating that found in the species
      index.


   2.6  Absence of the line-types: CA and CF

   We announced  in the  last two release notes that, starting with release
   18, the enzyme entries in SWISS-PROT would have two new line-types:

      CA   Description_of_catalytic_activity.
      CF   Description_of_cofactor.


   We finally decided not to implement this change as it would create line-
   types specific  to a subset of entries (enzymes); it would open the door
   to the  creation of  too many types of lines. We believe that the use of
   topics in  the comment line is a better approach for the storage of such
   information.



                            3. FORTHCOMING CHANGES

   3.1  New line-types: RP and RC

   We plan  to implement the following change in release 19; the current RN
   line will  be replaced  by three  line types:  a modified  RN (Reference
   Number) line  type containing  just  the  reference  number,  a  new  RP
   (Reference Position) line type containing the extent of the work carried
   out by  the authors  of the  reference, and a new RC (Reference Comment)
   line type containing comments relevant to the reference (strain, tissue,
   etc.). Three examples of the usage of these new lines are given below.

      RN   [1]
      RP   SEQUENCE FROM N.A., AND SEQUENCE OF 1-23.
      RC   STRAIN=K12;

      RN   [1]
      RP   SEQUENCE OF 24-56 AND 67-89.
      RC   STRAIN=BALB/C; TISSUE=BRAIN;

      RN   [2]
      RP   X-RAY CRYSTALLOGRAPHY 1.8 ANGSTROMS.

   -  Each reference block will continue to have exactly one RN line.
   -  There will  always be  a single  RP line  which will  be in free text
      format.
   -  As many  RC lines  as are needed to display the comments will appear;
      if a reference has no comment then the RC line will not appear.
   -  A precise  syntax will be used to display the information that appear
      on the RC line.

   The syntax of the Rc line is:

      RC   TOKEN=Text; TOKEN=Text;....

   Where  the  following  token  are  already  defined:  MEDLINE,  PLASMID,
   SPECIES, STRAIN, and TISSUE. Additional tokens will probably be added to
   this list.


   3.2  MEDLINE unique identifiers

   Starting with  release 19  each journal  reference listed  in SWISS-PROT
   which exists  in the  MEDLINE bibliographic  data bank  will include the
   "Unique Identifier"  (UI) of that reference in MEDLINE. This information
   will be stored in the new RC line using the "MEDLINE" token. Example:

        RC   MEDLINE=90205618;

   It is  planned that,  in a few months, MEDLINE will add cross-references
   to SWISS-PROT.


                            4. ENZYME AND PROSITE

   4.1  The ENZYME data bank

   Release 5.0 of the ENZYME data bank is distributed along with release 18
   of SWISS-PROT.  ENZYME release 5.0 contains information relative to 3072
   enzymes. The  data bank  is complete  and up  to date.  Until new enzyme
   nomenclature data  is published  we only  plan to  update the SWISS-PROT
   pointers at  each release  of the  protein sequence  data bank,  correct
   eventual errors,  and complete  the information  concerning synonyms and
   cofactors using the literature.

   4.2  The PROSITE data bank

   Release 7.0  of the  PROSITE data bank is distributed along with release
   18 of  SWISS-PROT. Release  7.0 contains 441 documentation chapters that
   describes 508  different patterns.  Since  the  last  major  release  of
   PROSITE (release  6.0 of November 1990), 69 new chapters have been added
   and 163 chapters have been updated.


                            5. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about annotations to be updated, as for example if the function
   of a protein has been clarified or if new post-translational information
   has become available.




                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.65   Gln (Q) 4.09   Leu (L) 9.11   Ser (S) 7.08
   Arg (R) 5.23   Glu (E) 6.28   Lys (K) 5.85   Thr (T) 5.85
   Asn (N) 4.44   Gly (G) 7.12   Met (M) 2.32   Trp (W) 1.30
   Asp (D) 5.24   His (H) 2.27   Phe (F) 3.95   Tyr (Y) 3.21
   Cys (C) 1.83   Ile (I) 5.44   Pro (P) 5.09   Val (V) 6.49

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 2864

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1266
                            2x:  520
                            3x:  284
                            4x:  172
                            5x:  121
                            6x:   94
                            7x:   73
                            8x:   37
                            9x:   58
                           10x:   29
                       11- 20x:  101
                       21-100x:   86
                         >100x:   23



        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        1715          Human
         2        1454          Escherichia coli
         3        1001          Mouse
         4         933          Rat
         5         675          Baker's yeast (Saccharomyces cerevisiae)
         6         486          Bovine
         7         388          Fruit fly (Drosophila melanogaster)
         8         354          Chicken
         9         282          Bacillus subtilis
        10         255          Rabbit
        11         251          Vaccinia virus (strain Copenhagen)
        12         246          African clawed frog (Xenopus laevis)
        13         230          Pig
        14         193          Human cytomegalovirus (strain AD169)
        15         191          Salmonella typhimurium
        16         160          Bacteriophage T4
        17         152          Maize
        18         128          Rice
        19         113          Vaccinia virus (strain WR)
        20         112          Tobacco
                                Pea
        22         106          Wheat
        23         102          Staphylococcus aureus
        24          96          Sheep
        25          93          Barley
                                Slime mold (Dictyostelium discoideum)
        27          84          Agrobacterium tumefaciens
                                Liverwort (Marchantia polymorpha)
                                Spinach
        30          83          Soybean
        31          80          Fission yeast (Schizosaccharomyces pombe)
        32          79          Pseudomonas aeruginosa
                                Klebsiella pneumoniae
        34          78          Dog
        35          77          Neurospora crassa



   A.3  Repartition of the sequences by size


               From   To  Number             From   To   Number
                  1-  50    1390             1001-1100      191
                 51- 100    2369             1101-1200      122
                101- 150    3358             1201-1300       94
                151- 200    2034             1301-1400       57
                201- 250    1666             1401-1500       46
                251- 300    1490             1501-1600       24
                301- 350    1311             1601-1700       23
                351- 400    1281             1701-1800       23
                401- 450     979             1801-1900       27
                451- 500    1057             1901-2000       22
                501- 550     811             2001-2100        9
                551- 600     555             2101-2200       23
                601- 650     390             2201-2300       28
                651- 700     288             2301-2400       11
                701- 750     270             2401-2500       13
                751- 800     214             >2500           50
                801- 850     162
                851- 900     172
                901- 950     112
                951-1000     100



   Currently the ten largest sequences are:


                            RYNR$RABIT  5037 a.a.
                            RYNR$HUMAN  5032 a.a.
                            APB$HUMAN   4563 a.a.
                            APOA$HUMAN  4548 a.a.
                            POLG$BVDV   3988 a.a.
                            POLG$HCVA   3898 a.a.
                            POLG$HCVB   3898 a.a.
                            TRX$DROME   3759 a.a.
                            ACVA$PENCH  3746 a.a.
                            DMD$HUMAN   3685 a.a.



                         APPENDIX B: ON-LINE EXPERTS



   B.1  List of on-line experts for PROSITE and SWISS-PROT


Field of expertise           Name                 Email address
---------------------------  -------------------  --------------------------
Alcohol dehydrogenases       Bengt P.             bengt@medfys.ki.se
Aldehyde dehydrogenases      Bengt P.             bengt@medfys.ki.se
Alpha-2-macroglobulins       Van Leuven F.        fred@blekul13.bitnet
Apolipoproteins              Boguski M.S.         boguski@ncbi.nlm.nih.gov
Arrestins                    Kolakowski L.F. Jr.  lfk@athena.mit.edu
Bacteriophage P4             Halling C.           chh9@midway.uchicago.edu
Beta-lactamases              Brannigan J.         jab5@vaxa.york.ac.uk
Chitinases                   Henrissat B.         cermav@frgren81.bitnet
CTF/NF-I                     Mermod N.            nmermod@clsuni51.bitnet
Cytochromes P450             Holsztynska E.J.     ela@netcom.uucp
                                                  netcom!ela@apple.com
EF-hand calcium-binding      Cox J.A.             cox@cgeuge52.bitnet
                             Kretsinger R.H.      rhk5i@virginia.bitnet
Eryf1-type zinc-fingers      Boguski M.S.         boguski@ncbi.nlm.nih.gov
Glucanases                   Henrissat B.         cermav@frgren81.bitnet
                             Beguin P.            phycel@pasteur.bitnet
G-protein coupled receptors  Chollet A.           chollet@clients.switch.ch
                             Attwood T.K.         bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins   Boguski M.S.         boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17         Landsman D.          landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases   Kolakowski L.F. Jr.  lfk@athena.mit.edu
Integrases                   Roy P.H.             2020000@lavalvx1.bitnet
Phytochromes                 Partis M.D.          partis@gcri.afrc.ac.uk
Protein kinases              Hanks S.             hanks@vuctrvax.bitnet
Restriction-modification     Bickle T.            bickle@urz.unibas.ch
                             Roberts R.J.         roberts@cshl.org
Ring-cleavage dioxygenases   Harayama S.          harayama@cgecmu51.bitnet
Subtilisin family proteases  Brannigan J.         jab5@vaxa.york.ac.uk
Thiol proteases              Turks B.             turk@ijs.ac.mail.yu
Thiol proteases inhibitors   Turks B.             turk@ijs.ac.mail.yu
TPR repeats                  Boguski M.S.         boguski@ncbi.nlm.nih.gov
Transit peptides             von Heijne G.        gunnar@cbts.sunet.se
Type-II membrane antigens    Levy S.              levy@cellbio.stanford.edu
Xylose isomerase             Jenkins J.           jenkins@frira.afrc.ac.uk


   B.2  Requirements to fulfill to become an on-line expert

   An expert  should be  a scientist  working with  specific famili(es)  of
   proteins (or specific domains) and which would:

   a) Review the  protein sequences in SWISS-PROT and the patterns/matrices
      in PROSITE relevant to their field of research.
   b) Agree to  be contacted  by people  that have obtained new sequence(s)
      which seem to belong to "their" familie(s) of proteins.
   c) Have access  to electronic  mail and be willing to use it to send and
      receive data.

   If you are willing to be part of this scheme please contact Amos Bairoch
   at one of the following electronic mail addresses:

                           bairoch@cgecmu51.bitnet
                           bairoch@cmu.unige.ch
  

Swiss-Prot release 17.0

Published February 1, 1991


                    SWISS-PROT RELEASE 17.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

   Release 17.0  of SWISS-PROT  contains 20024 sequence entries, comprising
   6'524'504 amino  acids abstracted from 19591 references. This represents
   an increase of 9% over release 16. The recent growth of the data bank is
   summarized below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504


   1.2  Source of data

   Release 17.0  has been  updated using protein sequence data from release
   26.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 25.0 of the
   EMBL Nucleotide Sequence Data Library.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank  we list  here the  statistics  concerning  the  DR  (Databank
   Reference) pointer lines:

   Entries with pointer(s) to only PIR entri(es):           3752
   Entries with pointer(s) to only EMBL entri(es):          3713
   Entries with pointer(s) to both EMBL and PIR entri(es): 12112
   Entries with no pointers lines:                           447




      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 16


   2.1  Sequences and annotations

   About 1700 sequences have been added since release 16, the sequence data
   of 312  existing entries  has been  updated and  the annotations of 3750
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of
   proteins:

   -  6-phosphogluconate dehydrogenase
   -  Aconitase
   -  Alpha-2 macroglobulin family
   -  ATP synthase a subunit
   -  Catalases
   -  Chalcone resveratrol synthases
   -  Citrate synthase
   -  Dihydroorotase
   -  DNA polymerase family A
   -  Eukaryotic cobalamin-binding proteins
   -  Fatty acid desaturases
   -  Fungal  Zn(2)-Cys(6)   binuclear   cluster   domain   transcriptional
      activators
   -  Gamma-glutamyltranspeptidase
   -  Glutamine amidotransferases class-I
   -  Glutamine amidotransferases class-II
   -  Gonadotropin-releasing hormones
   -  Guanylate cyclases
   -  LIM-1 domain proteins
   -  Myotoxins
   -  Nucleoside diphosphate kinases
   -  Pathogenesis-related proteins BetvI family
   -  Peroxidases
   -  Polyprenyl synthetases
   -  Ribosomal proteins
   -  Rotamases (cyclophilin and FKBP)
   -  Small cytokines (PF4/IL-8 and MCAF/MIP-1 subfamilies)
   -  Sodium symporters
   -  Thiol-activated cytolysins


   2.2 Status of cross-references to PIR

   Older releases of SWISS-PROT contained cross-references to entries which
   were present  only in  the annotated  section of PIR (currently known as
   PIR1); we  started adding cross-references to entries in the unannotated
   sections of PIR (known as PIR2 and PIR3).



                            3. FORTHCOMING CHANGES

   3.1  New line-types: RC and RP

   We plan  to implement the following change in release 19; the current RN
   line will  be replaced  by three  line types:  a modified  RN (Reference
   Number) line  type containing  just  the  reference  number,  a  new  RC
   (Reference Comment)  line  type  containing  comments  relevant  to  the
   reference (strain, tissue, etc.), and a new RP (Reference Position) line
   type containing  the extent of the sequencing carried out by the authors
   of the  reference. Three  examples of  the usage  of these new lines are
   given below.

      RN   [1]
      RC   STRAIN=K12;
      RP   SEQUENCE FROM N.A., AND SEQUENCE OF 1-23.

      RN   [1]
      RC   STRAIN=BALB/C; TISSUE=BRAIN;
      RP   SEQUENCE OF 24-56 AND 67-89.

      RN   [2]
      RC   X-RAY CRYSTALLOGRAPHY=1.8 ANGSTROMS;


   Each reference  block will continue to have exactly one RN line. As many
   RC lines  as are  needed to display the reference's comment will appear.
   If a  reference has no comment then the RC line will not appear. As many
   RP lines  as are  needed to display the extent of sequencing carried out
   by the  authors of  the reference.  If  a  reference  does  not  pertain
   directly to sequencing data then the RP line will not appear.

   3.2  New line-types: CA and CF

   As we announced in the last two release notes, starting with release 18,
   the enzyme entries in SWISS-PROT will have two new line-types:

      CA   Description_of_catalytic_activity.
      CF   Description_of_cofactor.

   These lines  will be  automatically generated  at each release of SWISS-
   PROT from  the information  stored in  the ENZYME  data bank.  They will
   replace the  'CATALYTIC ACTIVITY`  and 'COFACTORS`  comment  lines  (CC)
   topics. Example:

      CC   -!- CATALYTIC ACTIVITY: L-ASPARTATE + 2-OXOGLUTARATE =
               OXALOACETATE + L-GLUTAMATE.
      CC   -!- COFACTOR: PYRIDOXAL PHOSPHATE.

   will be changed to:

      CA   L-ASPARTATE + 2-OXOGLUTARATE = OXALOACETATE + L-GLUTAMATE.
      CF   PYRIDOXAL PHOSPHATE.



   3.3  Change in the OS line

   As we announced in the last two release notes, starting with release 18,
   we will invert the order of the information in the OS line. Currently we
   have 'English  common name  (Latin name)`, we will switch to 'Latin name
   (English common name)`. Example:

      OS   HUMAN (HOMO SAPIENS).

   will be changed to:

      OS   HOMO SAPIENS (HUMAN).


                            4. ENZYME AND PROSITE

   4.1  The ENZYME data bank

   Release 4.0 of the ENZYME data bank is distributed along with release 17
   of SWISS-PROT.  ENZYME release 4.0 contains information relative to 3072
   enzymes. The  data bank  is complete  and up  to date.  Until new enzyme
   nomenclature data  is published  we only  plan to  update the SWISS-PROT
   pointers at  each release  of the  protein sequence  data bank,  correct
   eventual errors,  and complete  the information  concerning synonyms and
   cofactors using the literature.

   4.2  The PROSITE data bank

   Release 6.1  of the  PROSITE data bank is distributed along with release
   17 of  SWISS-PROT. PROSITE  release 6.1  does not really represent a new
   release; the  only changes  between release 6.0 and 6.1 are  updating of
   the pointers  to the  SWISS-PROT entries  whose name  have been modified
   between release  16 and  17. The  next release  of PROSITE (7.0) will be
   distributed with release 18.0 of SWISS-PROT.


                            5. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about annotations to be updated, as for example if the function
   of a protein has been clarified or if new post-translational information
   has become available.



                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.64   Gln (Q) 4.09   Leu (L) 9.09   Ser (S) 7.10
   Arg (R) 5.24   Glu (E) 6.28   Lys (K) 5.86   Thr (T) 5.86
   Asn (N) 4.44   Gly (G) 7.12   Met (M) 2.31   Trp (W) 1.30
   Asp (D) 5.24   His (H) 2.27   Phe (F) 3.95   Tyr (Y) 3.21
   Cys (C) 1.83   Ile (I) 5.44   Pro (P) 5.10   Val (V) 6.48

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 2630

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1138
                            2x:  476
                            3x:  257
                            4x:  166
                            5x:  116
                            6x:   87
                            7x:   69
                            8x:   43
                            9x:   61
                           10x:   16
                       11- 20x:   98
                       21-100x:   81
                         >100x:   22



         A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        1659          Human
         2        1376          Escherichia coli
         3         940          Mouse
         4         871          Rat
         5         643          Baker's yeast (Saccharomyces cerevisiae)
         6         453          Bovine
         7         376          Fruit fly (Drosophila melanogaster)
         8         331          Chicken
         9         252          Bacillus subtilis
        10         246          Rabbit
        11         236          African clawed frog (Xenopus laevis)
        12         232          Vaccinia virus (strain Copenhagen)
        13         220          Pig
        14         190          Human cytomegalovirus (strain AD169)
        15         176          Salmonella typhimurium
        16         160          Bacteriophage T4
        17         142          Maize
        18         124          Rice
        19         111          Tobacco
        20         108          Vaccinia virus (strain WR)
        21         104          Pea
        22         102          Wheat
        23          96          Staphylococcus aureus
        24          91          Slime mold (Dictyostelium discoideum)
        25          86          Barley
        26          85          Sheep
        27          84          Liverwort (Marchantia polymorpha)
        28          83          Soybean
        29          82          Spinach
        30          73          Caenorhabditis elegans
                    73          Neurospora crassa



   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    1174             1001-1100      162
                 51- 100    2099             1101-1200      105
                101- 150    3021             1201-1300       88
                151- 200    1820             1301-1400       52
                201- 250    1468             1401-1500       45
                251- 300    1316             1501-1600       20
                301- 350    1147             1601-1700       22
                351- 400    1131             1701-1800       20
                401- 450     867             1801-1900       22
                451- 500     959             1901-2000       21
                501- 550     718             2001-2100        9
                551- 600     475             2101-2200       22
                601- 650     336             2201-2300       24
                651- 700     258             2301-2400       11
                701- 750     243             2401-2500       11
                751- 800     183             >2500           37
                801- 850     144
                851- 900     150
                901- 950     95
                951-1000     89



   Currently the ten largest sequences are:

                            RYNR$RABIT  5037 a.a.
                            APB$HUMAN   4563 a.a.
                            APOA$HUMAN  4548 a.a.
                            POLG$BVDV   3988 a.a.
                            POLG$HCVA   3898 a.a.
                            TRX$DROME   3759 a.a.
                            ACVA$PENCH  3746 a.a.
                            DMD$HUMAN   3685 a.a.
                            DMD$CHICK   3660 a.a.
                            POLG$KUNJM  3433 a.a.

  

Swiss-Prot release 16.0

Published November 1, 1990


                    SWISS-PROT RELEASE 16.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

Release 16.0 of SWISS-PROT contains 18364 sequence entries, comprising 5'986'949
amino acids abstracted from 17763 references.  This represents an increase of 9%
over release 15.  The recent growth of the data bank is summarized below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949


More than 1400 sequences have been added since release 15, the sequence data  of
271  existing  entries has been updated and the annotations of 3500 entries have
been revised.  In particular  we  have  used  reviews  articles  to  update  the
annotations of the following groups or families of proteins:

   -  Alpha and beta adrenergic receptors
   -  Arrestins
   -  Chromogranins / secretogranins
   -  CTF/NF-I family
   -  ClpP proteases
   -  ets family
   -  GABA(A) receptors
   -  Gram-positive cocci surface proteins
   -  Hexokinases
   -  Integrins alpha and beta chains
   -  NMePhe pili proteins
   -  p53 proteins
   -  Poly(ADP-ribose) polymerase
   -  Profilins
   -  S-Adenosylmethionine synthetases
   -  Site-specific recombinases
   -  Synaptobrevins
   -  Type-II membrane antigens
   -  UDP-glucoronosyl transferases
   -  Uteroglobin family
   -  LBP / BPI / CETP family



2  DATA SOURCES

Release 16.0 has been updated using protein sequence data from release  25.0  of
the  PIR  (Protein  Identification  Resource)  protein  data  bank,  as  well as
translation of nucleotide sequence data from release 24.0 of the EMBL Nucleotide
Sequence Data Library.

As an indication to the source of the sequence data in the SWISS-PROT data  bank
we  list  here  the  statistics  concerning  the DR (Databank Reference) pointer
lines:

   Entries with pointer(s) to only PIR entri(es):            3335
   Entries with pointer(s) to only EMBL entri(es):           5468
   Entries with pointer(s) to both EMBL and PIR entri(es):   8908
   Entries with no pointers lines:                            653



3  CHANGES AT THIS RELEASE

3.1  Cross-References To MIM

We have finished adding cross-references to all human protein  sequence  entries
which are represented in the latest edition of the MIM (Mendelian Inheritance in
Man) book [1].

There are currently 842 SWISS-PROT entries that have cross-references to one  or
more MIM catalog number.

A new document file, called MIMTOSP.DOC, is provided with SWISS-PROT,  it  is  a
sorted  list  of the MIM catalog entries cross-referenced in SWISS- PROT and the
corresponding protein sequence entry names.



4  FORTHCOMING CHANGES

We plan to implement the following changes in release  18  (these  changes  were
announced for release 16, but we are postponing their application so as to leave
more time to sequence analysis software developers to update their packages).



4.1  New Linetypes:  CA and CF

As we announced in the last release notes, the enzyme  entries  in  SWISS-  PROT
will have two new line-types:

      CA   Description_of_catalytic_activity.
      CF   Description_of_cofactor.

These lines will be automatically generated at each release of SWISS- PROT  from
the  information  stored  in  the  ENZYME  data  bank.   They  will  replace the
'CATALYTIC ACTIVITY` and 'COFACTORS` comment lines (CC) topics.  For example:

      CC   -!- CATALYTIC ACTIVITY: L-ASPARTATE + 2-OXOGLUTARATE =
               OXALOACETATE + L-GLUTAMATE.
--------------------------------------------------------------------------------
(1)  McKusick Victor A., Mendelian Inheritance in Man, Catalogs of autosomal
     dominant, autosomal recessive, and X-linked phenotypes, Ninth edition,
     Johns Hopkins University Press, Baltimore, (1990).


      CC   -!- COFACTOR: PYRIDOXAL PHOSPHATE.

will be changed to:

      CA   L-ASPARTATE + 2-OXOGLUTARATE = OXALOACETATE + L-GLUTAMATE.
      CF   PYRIDOXAL PHOSPHATE.



4.2  OS Line Format

We will invert the order of the information in the OS line.  Currently  we  have
"English  common  name  (Latin  name)";  we  will switch to "Latin name (English
common name)".  For example:

     OS   HUMAN (HOMO SAPIENS).

will be changed to:

     OS   HOMO SAPIENS (HUMAN).



5  ENZYME AND PROSITE DATABASES

Release 3.0 of the ENZYME data bank is distributed  along  with  release  16  of
SWISS-PROT.   ENZYME  release  3.0  contains information relative to 3071 enzyme
entries.  The  data  bank  is  complete  and  up  to  date.   Until  new  enzyme
nomenclature  data  is published we only plan to update the SWISS- PROT pointers
at each release of the protein sequence data bank, correct eventual errors,  and
complete the information concerning synonyms and cofactors using the literature.

Release 6.0 of the PROSITE data bank is distributed along  with  release  16  of
SWISS-PROT.  PROSITE release 6 contains 375 documentation chapters that describe
433 different patterns.  Since release 5.1 77 new chapters have been  added  and
131 have been updated.



6  DISTRIBUTION MEDIA

Data is available on magnetic tape, TK50 cassette and CD-ROM.  This  section  of
the  release  notes  applies to tape and TK50 cassette only; CD-ROM releases are
accompanied by their own release notes which detail the file  organisation  used
on CD.



6.1  Tape Formats

The distribution tapes are 9-track industry standard magnetic tapes.  Each  file
consists  of  fixed-length  80  byte  records,  padded  with  trailing blanks as
appropriate (except for VMS Backup  format  tapes  which  have  variable  length
records).   Tape  format details (density, blocksize, label type, character set)
are attached to each tape.

In many formats, a release requires more than one  tape  volume.   In  order  to
support  sequential volume serial numbers for multi-volume tape sets, the volume
labels are EMBL01 for the first tape, EMBL02 for the second tape, and so on.

VMS Backup format tapes (and all TK50 cassettes) contain the files listed below,
in the order shown, as a single save set called SWISS15.BCK.



6.2  Documentation

The documentation files on tape (those ending with a file extension of .DOC) are
designed to be easily printable.  As with all other tape files they have a fixed
record length of 80 bytes.  The page length of 63 lines per page was  chosen  so
that the pages will fit both on DIN A4 paper and on American 8-1/2" x 11" paper.

Page throws are indicated by lines with  the  six  character  string  <PAGE>  in
positions  1-6,  and  nothing  else.  If you wish to print any of these files we
suggest you copy them down onto disk, use your local  editor  to  replace  every
occurrence of <PAGE> in columns 1-6 by a formfeed (or whatever is appropriate to
force a page throw on your printer), and then print them.



6.3  Release 16 Files

The distribution tape(s) contain the files shown below,  in  the  order  listed.
Where more than one tape is required, subsequent volumes will continue where the
preceding volume left off.

   File Number    File Name       Description                      #Records
   -----------    ------------    -----------------------------    --------
             1    CONTENTS.DOC    Tape Contents (this table)             64
             2    SWISSPRT.USR    User Manual                          1830
             3    RELNOTES.DOC    Release Notes (this document)         895
             4    SPECIES.NDX     Species Index                       10297
             5    KEYWORD.NDX     Keyword Index                       20291
             6    AUTHOR.NDX      Author Index                        88407
             7    SHORTDIR.NDX    Short Directory Index               38209
             8    ACNUMBER.NDX    Accession Number Index              18543
             9    CITATION.NDX    Citation Index                      36092
            10    SPIDCODE.NDX    Species ID Code Index                3285
            11    EMBLTOSP.DOC    EMBL/SWISS-PROT Xreferences         17022
            12    ORGCODES.DOC    Organism Code List                   3518
            13    MIMTOSP.DOC     MIM/SWISS-PROT Xreferences           1022
            14    PDBTOSP.DOC     PDB/SWISS-PROT Xreferences            574
            15    JOURLIST.DOC    Journal Abbreviation List            1470
            16    DATASUB.TXT     Data Submission Form                  315
            17    SEQ.DAT         SWISS-PROT Sequence Entries        602162
            18    PROSITE.USR     PROSITE Database User Manual          915
            19    PROSITE.LIS     PROSITE Entry List                    508
            20    PROSITE.DOC     PROSITE Entry Documentation         15669
            21    PROSITE.DAT     PROSITE Database Entries             8229
            22    ENZYME.USR      ENZYME Database User Manual           487
            23    ENZYME.DAT      ENZYME Database                     20871


7  INDEX FILE FORMATS

The index key of each index file (keywords, authors, citations, etc.) is  sorted
alphabetically;  the  names  of  all entries containing the index key are listed
alphabetically after the key.  Each entry name is  accompanied  by  its  primary
accession number.

Except for the short directory, accession number and species  id  code  indices,
all  index  files have the same layout:  each value of the index key begins on a
new line in column 1, and the associated entry names begin  on  the  next  line.
Lines containing entry names are in fixed-format, layed out as follows:

                     Columns   Description
                     -------   ---------------------------
                     14-23     entry name (left-justified)
                     29-34     primary accession number

                     36-45     entry name (left-justified)
                     51-56     primary accession number

                     58-67     entry name (left-justified)
                     73-78     primary accession number


Up to three entry names fit on each such line; if a given  index  key  has  more
than  three  entries associated with it, additional lines are used (with exactly
the same layout).  This index file format is  identical  to  that  of  the  EMBL
Nucleotide Sequence database.



7.1  Species Index

This file lists all  species  which  appear  in  the  database.   It  is  sorted
alphabetically  on  (english)  common name.  The latin genus and species will be
listed, if present in  the  database  entries.   Mitochondrion  and  chloroplast
sequences  appear  under  separate  index  keys,  immediately  after the related
nuclear sequences.  An excerpt from the species index file is given  below  (the
ruler is presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
CHILEAN POTATO-TREE (SOLANUM CRISPUM)
             PLAS$SOLCR     P00297
CHIMPANZEE IMMUNODEFICIENCY VIRUS (CIV) (SIV(CPZ))
             ENV$SIVCZ      P17281 GAG$SIVCZ      P17282 NEF$SIVCZ      P17664
             POL$SIVCZ      P17283 REV$SIVCZ      P17280 TAT$SIVCZ      P17285
             VIF$SIVCZ      P17284 VPR$SIVCZ      P17287 VPU$SIVCZ      P17286
CHIMPANZEE (PAN SATYRUS)
             HBA3$PANSA     P01935
CHIMPANZEE (PAN TROGLODYTES)
             CD4$PANTR      P16004 HA1A$PANTR     P13748 HA1B$PANTR     P13749
             HA1C$PANTR     P16209 HA1D$PANTR     P16210 HA1E$PANTR     P16215
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.2  Keyword Index

This file lists all keywords which appear in the database (on the KW lines).  It
is  sorted alphabetically on keyword.  An excerpt from the keyword index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACETYLCHOLINE RECEPTOR INHIBITOR
             CXA1$CONGE     P01519 CXA1$CONMA     P01521 CXA1$CONST     P15471
             CXA2$CONGE     P01520
ACIDIC PROTEIN
             143E$BOVIN     P11576 B23$RAT        P13084 B231$HUMAN     P08693
             B232$HUMAN     P06748 BAT$HALHA      P13260 CALQ$CANFA     P12637
             CALQ$RABIT     P07221 CENB$HUMAN     P07199 CMGA$HUMAN     P10645
             CMGA$RAT       P10354 GRPE$ECOLI     P09372 MK16$YEAST     P10962
             NFL$BOVIN      P02548 NO38$CHICK     P16039 NU38$XENLA     P07222
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.3  Author Index

This file lists all author names which appear in citations (on the RA lines)  in
the database.  It is sorted alphabetically on name.  Names are presented as they
appear in the database entries (i.e.  as cited in  publications)  -  we  do  not
attempt  to  handle  multiple  surname spellings, or different initials, for the
same author.  An excerpt from the author index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
AAN F.
             PT3G$SALTY     P02908
AARDEN L.A.
             IL6$HUMAN      P05231
AARONSON R.P.
             HEMA$INCCA     P03465
AARONSON S.
             PGDS$HUMAN     P16234
AARONSON S.A.
             3ORF$EIAV1     P11305 DBL$HUMAN      P10911 ENV$EIAV1      P11306
             ENV$SMSAV      P03384 GAG$AVISN      P03342 GAG$MSVMO      P03334
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.4  Short Directory Index

This file contains summary  information  about  all  entries  in  the  database,
including  a  brief description of the entry, its sequence length, molecule type
and data management class.  The file is sorted  alphabetically  on  entry  name.
The lines are fixed-format, layed out as follows:


     Columns   Field Name          Description
     -------   ---------------     -------------------------------------------
     01-10     entry name          left-justified
     14-14     data class          s = standard
                                   u = unannotated
                                   p = preliminary
                                   r = unreviewed
     16-18     molecule type       PRT (protein)
     20-25     sequence length     right-justified
     27-80     description         left-justified


If an entry's description occupies more than 54 characters (cols 27-80), it will
be  continued  onto  one or more continuation lines.  Continuation lines contain
description text (left-justified) in cols  27-80;  cols  01-26  are  blank.   An
excerpt  from  the  short  description  index  file is given below (the ruler is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
104K$THEPA   s PRT    924 104 KD MICRONEME-RHOPTRY ANTIGEN - THEILERIA PARVA
10KA$MYCTU   s PRT    100 BCG-A HEATSHOCK PROTEIN (10 KD ANTIGEN) -
                          MYCOBACTERIUM TUBERCULOSIS
10KS$HUMAN   s PRT     91 10 KD SECRETORY PROTEIN PRECURSOR - HUMAN (HOMO
                          SAPIENS)
10KS$RAT     s PRT     18 10 KD SECRETORY PROTEIN (CC10) (FRAGMENT) - RAT
                          (RATTUS NORVEGICUS)
110K$PLAKN   s PRT    296 110 KD ANTIGEN (PK110) (FRAGMENT) - PLASMODIUM
                          KNOWLESI
11KD$ADE02   s PRT     79 11 KD CORE PROTEIN PRECURSOR (LATE L2 MU CORE PROTEIN)
                          (PROTEIN X) - ADENOVIRUS TYPE 2
11SB$CUCMA   s PRT    480 11-S GLOBULIN BETA SUBUNIT PRECURSOR - PUMPKIN
                          (CUCURBITA MAXIMA)
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.5  Accession Number Index

This  file  lists  all  accession  numbers  in  the  database.   It  is   sorted
alphabetically  on  accession  number.  Each accession number is followed by the
name and primary accession number of every entry in which it occurs.

The lines are fixed-format, layed out exactly the same as the other index files;
the only difference is that the index key (accession number) appears on the same
line (in cols 1-6) as the list of entries which contain the key.  In  the  other
index  files,  the  index  key  appears  on  a  line by itself.  This index key,
however, is short enough to fit on the same line as the  entries,  and  we  have
done this to save space.

Accession numbers which have been deleted from the database also appear in  this
index, containing the word DELETED (left-justified) in the entry name field.

An excerpt from the accession number index file is given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
P00836       ATPE$HORVU     P00836
P00837       ATPG$ECOLI     P00837
P00838       ATPG$ECOLI     P00837
P00839       ATPL$BOVIN     P00839 ATPM$BOVIN     P07926
P00840       ATP9$MAIZE     P00840
P00841       ATP9$YEAST     P00841
P00842       ATPL$NEUCR     P00842
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.6  Citation Index

This file lists all journal citations which  appear  in  the  database.   It  is
sorted  alphabetically  on citation.  An excerpt from the citation index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ANN. GENET. SEL. ANIM. 4:515-521(1972)
             CASK$BOVIN     P02668
ANN. INST. PASTEUR IMMUNOL. 127C:261-271(1976)
             KV3F$HUMAN     P01624
ANN. INST. PASTEUR IMMUNOL. 132D:77-88(1981)
             HV41$MOUSE     P01811
ANN. N.Y. ACAD. SCI. 165:360-377(1969)
             HBB$ATEGE      P02034
ANN. N.Y. ACAD. SCI. 241:436-438(1974)
             HBB$RABIT      P02057
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.7  Species ID Code Index

This file lists all species id codes which appear in the database.  It is sorted
alphabetically  on species id code.  Each code is followed by a (sorted) list of
all sequence entry codes which are associated with the  species  id  code.   The
sequence  entry  code is the first component of each entry name (before the "$")
and the species id code is the second component of each entry  name  (after  the
"$").   For  example, the entry called COA3$ADEA2 has a species id code of ADEA2
and a sequence entry code of COA3.

Each species id code starts on a  new  line,  occupying  columns  1-5,  and  the
associated  sequence entry codes (each up to 4 characters long) start in columns
10, 15, 20, 25, 30, 35, ...  70, 75.  If there are more than 14  sequence  entry
codes  for  a  given species id, as many continuation lines as necessary will be
used, with columns 1-9 left blank.

An excerpt from the species id code index file is  given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACHKL    CALM
ACHLY    API
ACIBA    KKA
ACICA    BEND CATA CATM DHGA DHGB ELH2 MURO PQQ1 PQQ2 PQQ3 PQQ5 PQQL PQQR TRPB
         TRPC TRPD TRPF TRPG
ACIFE    HGDA HGDB YHGD
ACIGL    ASPQ
ACIGU    PRT1
ACISP    CYMO
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



8  WE NEED YOUR HELP !

We welcome  any  feedback  from  our  users.   We  would  especially  appreciate
information  about  any sequences belonging to your field of expertise which are
missing from the database.  We also would like to be notified about  annotations
which  should be updated (e.g.  if the function of a protein has been clarified,
or if new post-translational information has become available).



                                   APPENDIX A

                                SOME STATISTICS



A.1  AMINO ACID COMPOSITION

Composition in percent for the complete database:

   Ala (A) 7.67   Gln (Q) 4.09   Leu (L) 9.09   Ser (S) 7.09
   Arg (R) 5.25   Glu (E) 6.29   Lys (K) 5.85   Thr (T) 5.86
   Asn (N) 4.42   Gly (G) 7.14   Met (M) 2.31   Trp (W) 1.31
   Asp (D) 5.23   His (H) 2.27   Phe (F) 3.95   Tyr (Y) 3.20
   Cys (C) 1.83   Ile (I) 5.40   Pro (P) 5.11   Val (V) 6.48
   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03

Classification of the amino acids by their frequency:

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Arg, Asp, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



A.2  DISTRIBUTION OF SEQUENCES BY SPECIES

Total number of species represented in this release of the database:  2492.

        Species represented 1x: 1097
                            2x:  459
                            3x:  238
                            4x:  153
                            5x:  114
                            6x:   86
                            7x:   52
                            8x:   37
                            9x:   51
                           10x:   17
                       11- 20x:   97
                       21-100x:   72
                         >100x:   19


Table of the most common species:

    Number   Frequency          Species
         1        1550          Human
         2        1326          Escherichia coli
         3         886          Mouse
         4         791          Rat
         5         591          Baker's yeast (Saccharomyces cerevisiae)
         6         422          Bovine
         7         342          Fruit fly (Drosophila melanogaster)
         8         311          Chicken
         9         229          Rabbit
        10         226          Bacillus subtilis
        11         220          African clawed frog (Xenopus laevis)
        12         205          Pig
        13         189          Human cytomegalovirus (strain AD169)
        14         168          Salmonella typhimurium
        15         154          Bacteriophage T4
        16         133          Maize
        17         118          Rice
        18         108          Tobacco
        19         105          Vaccinia virus
        20          95          Wheat
        21          94          Pea
        22          88          Staphylococcus aureus
        23          86          Slime mold (Dictyostelium discoideum)
        24          84          Liverwort (Marchantia polymorpha)
        25          83          Sheep
        26          81          Spinach
        27          80          Barley
                    80          Soybean
        29          70          Herpes simplex virus type 1 (strain 17)
                    70          Fission yeast (Schizosaccharomyces pombe)



A.3  DISTRIBUTION OF SEQUENCES BY LENGTH


               From   To  Number             From   To   Number
                  1-  50    1174             1001-1100      162
                 51- 100    2099             1101-1200      105
                101- 150    3021             1201-1300       88
                151- 200    1820             1301-1400       52
                201- 250    1468             1401-1500       45
                251- 300    1316             1501-1600       20
                301- 350    1147             1601-1700       22
                351- 400    1131             1701-1800       20
                401- 450     867             1801-1900       22
                451- 500     959             1901-2000       21
                501- 550     718             2001-2100        9
                551- 600     475             2101-2200       22
                601- 650     336             2201-2300       24
                651- 700     258             2301-2400       11
                701- 750     243             2401-2500       11
                751- 800     183             >2500           37
                801- 850     144
                851- 900     150
                901- 950     95
                951-1000     89


Currently the five largest sequences are:

                            RYNR$RABIT  5037 a.a.
                            APB$HUMAN   4563 a.a.
                            APOA$HUMAN  4548 a.a.
                            DMD$HUMAN   3685 a.a.
                            DMD$CHICK   3660 a.a.
  

Swiss-Prot release 15.0

Published August 1, 1990


                    SWISS-PROT RELEASE 15.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

Release 15.0 of SWISS-PROT contains 16941 sequence entries, comprising 5'486'399
amino  acids  abstracted  from 16223 references.  This represents an increase of
12% over release 14.  The recent growth of the data bank is summarized below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399


Almost 1600 sequences have been added since release 14, the sequence data of 159
existing  entries has been updated and the annotations of 2660 entries have been
revised.  In particular we have used reviews articles to update the  annotations
of the following groups or families of proteins:

   -  43 Kd postsynaptic proteins
   -  Alanine racemases
   -  Antifreeze proteins
   -  Aplysia neuropeptides
   -  Biopterin-dependent aromatic amino acid hydroxylases
   -  Cell shape bacterial proteins
   -  Connexins
   -  Cytomegaloviruses proteins
   -  Disintegrins
   -  Eryf1-type proteins
   -  Flagellin and other flagellar proteins
   -  Fucosidases
   -  HMG proteins
   -  Inorganic pyrophosphatases
   -  Kinesin heavy chain family proteins
   -  Methylated-DNA--protein-cysteine methyltransferases
   -  Neuromodulins (GAP-43)
   -  Pectate lyases
   -  Peroxisomal proteins
   -  Potexviruses sequences
   -  PTS proteins
   -  Pyruvoyl-dependent enzymes
   -  Ribonucleotide reductases small subunit
   -  SRF-type transcription factors
   -  Synapsins
   -  TonB-dependent receptor proteins
   -  TPR repeat proteins
   -  Transcription factor TFIID
   -  Tyrosine specific protein phosphatases
   -  Uricases



2  DATA SOURCES

Release 15.0 has been updated using protein sequence data from release  24.0  of
the  PIR  (Protein  Identification  Resource)  protein  data  bank,  as  well as
translation of nucleotide sequence data from release 23.0 of the EMBL Nucleotide
Sequence Data Library.

As an indication to the source of the sequence data in the SWISS-PROT data  bank
we  list  here  the  statistics  concerning  the DR (Databank Reference) pointer
lines:

   Entries with pointer(s) to only PIR entri(es):            2868
   Entries with pointer(s) to only EMBL entri(es):           8049
   Entries with pointer(s) to both EMBL and PIR entri(es):   5083
   Entries with no pointers lines (entered in house):         941



3  CHANGES AT THIS RELEASE

3.1  DR Line Format

The DR line format has been extended  to  accept  cross-references  to  MIM  and
REBASE.   MIM  (Mendelian Inheritance in Man) is a data bank that holds clinical
data on a range of human genetic disease (1); REBASE is a data bank  that  holds
data concerning type 2 restriction enzymes (2).

For a MIM reference the primary identifier is the catalog number of the  disease
(or  phenotype)  and  the  secondary identifier is the last revision of the data
bank that we have used to derive the cross- reference.

For a REBASE reference  the  primary  identifier  is  the  entry  name  and  the
secondary  identifier  is the latest release number of the REBASE data bank that
we have used to derive the cross-reference.

Examples:

      DR   MIM; 24990; EIGHTH EDITION.
      DR   REBASE; BSURI; RELEASE 9007.


(1)  McKusick Victor A., Mendelian Inheritance in Man, Catalogs of autosomal
     dominant, autosomal recessive, and X-linked phenotypes, Eighth edition,
     Johns Hopkins University Press, Baltimore, (1988).

(2)  Roberts Rich, Cold Spring Harbor Laboratory, Box 100, Cold Spring Harbor,
     NY 11724, USA.


4  FORTHCOMING CHANGES

We plan to implement the following changes in release 16:


4.1  New Linetypes:  CA and CF

As we announced in the last release notes, the enzyme  entries  in  SWISS-  PROT
will have two new line-types:

      CA   Description_of_catalytic_activity.
      CF   Description_of_cofactor.

These lines will be automatically generated at each release of SWISS- PROT  from
the  information  stored  in  the  ENZYME  data  bank.   They  will  replace the
'CATALYTIC ACTIVITY` and 'COFACTORS` comment lines (CC) topics.  For example:

      CC   -!- CATALYTIC ACTIVITY: L-ASPARTATE + 2-OXOGLUTARATE =
               OXALOACETATE + L-GLUTAMATE.
      CC   -!- COFACTOR: PYRIDOXAL PHOSPHATE.

will be changed to:

      CA   L-ASPARTATE + 2-OXOGLUTARATE = OXALOACETATE + L-GLUTAMATE.
      CF   PYRIDOXAL PHOSPHATE.



4.2  OS Line Format

We will invert the order of the information in the OS line.  Currently  we  have
"English  common  name  (Latin  name)";  we  will switch to "Latin name (English
common name)".  For example:

     OS   HUMAN (HOMO SAPIENS).

will be changed to:

     OS   HOMO SAPIENS (HUMAN).



5  ENZYME AND PROASITE DATABASES

Release 2.0 of the ENZYME data bank is distributed  along  with  release  15  of
SWISS-PROT.   ENZYME  release  2.0  contains information relative to 3071 enzyme
entries.  As it is the case for SWISS-PROT, cross-references were added to  MIM.
See  the  User's  manual  of  ENZYME for a complete description of the syntax of
these cross-references.

Release 5.1 of the PROSITE data bank is distributed along  with  release  15  of
SWISS-PROT.   PROSITE  release  5.1 does not really represent a new release; the
only changes between release 5.0 and 5.1 are corrections of  two  format  errors
and  updating  of  the  pointers  to the SWISS-PROT entries whose name have been
modified between release 14 and 15.  The next release of PROSITE (6.0)  will  be
distributed with release 16.0 of SWISS- PROT.



6  DISTRIBUTION MEDIA

Data is available on magnetic tape, TK50 cassette and CD-ROM.  This  section  of
the  release  notes  applies to tape and TK50 cassette only; CD-ROM releases are
accompanied by their own release notes which detail the file  organisation  used
on CD.



6.1  Tape Formats

The distribution tapes are 9-track industry standard magnetic tapes.  Each  file
consists  of  fixed-length  80  byte  records,  padded  with  trailing blanks as
appropriate (except for VMS Backup  format  tapes  which  have  variable  length
records).   Tape  format details (density, blocksize, label type, character set)
are attached to each tape.

In many formats, a release requires more than one  tape  volume.   In  order  to
support  sequential volume serial numbers for multi-volume tape sets, the volume
labels are EMBL01 for the first tape, EMBL02 for the second tape, and so on.

VMS Backup format tapes (and all TK50 cassettes) contain the files listed below,
in the order shown, as a single save set called SWISS15.BCK.



6.2  Documentation

The documentation files on tape (those ending with a file extension of .DOC) are
designed to be easily printable.  As with all other tape files they have a fixed
record length of 80 bytes.  The page length of 63 lines per page was  chosen  so
that the pages will fit both on DIN A4 paper and on American 8-1/2" x 11" paper.

Page throws are indicated by lines with  the  six  character  string  <PAGE>  in
positions  1-6,  and  nothing  else.  If you wish to print any of these files we
suggest you copy them down onto disk, use your local  editor  to  replace  every
occurrence of <PAGE> in columns 1-6 by a formfeed (or whatever is appropriate to
force a page throw on your printer), and then print them.



6.3  Release 15 Files

The distribution tape(s) contain the files shown below,  in  the  order  listed.
Where more than one tape is required, subsequent volumes will continue where the
preceding volume left off.


   File Number    File Name       Description                      #Records
   -----------    ------------    -----------------------------    --------
             1    CONTENTS.DOC    Tape Contents (this table)             63
             2    SWISSPRT.USR    User Manual                          1861
             3    RELNOTES.DOC    Release Notes (this document)         934
             4    SPECIES.NDX     Species Index                        9588
             5    KEYWORD.NDX     Keyword Index                       18482
             6    AUTHOR.NDX      Author Index                        81319
             7    SHORTDIR.NDX    Short Directory Index               34794
             8    ACNUMBER.NDX    Accession Number Index              17088
             9    CITATION.NDX    Citation Index                      32937
            10    SPIDCODE.NDX    Species ID Code Index                3091
            11    EMBLTOSP.DOC    EMBL/SWISS-PROT Xreferences         15057
            12    ORGCODES.DOC    Organism Code List                   3223
            13    PDBTOSP.DOC     PDB/SWISS-PROT Xreferences            567
            14    JOURLIST.DOC    Journal Abbreviation List            1383
            15    DATASUB.TXT     Data Submission Form                  312
            16    SEQ.DAT         SWISS-PROT Sequence Entries        542448
            17    PROSITE.USR     PROSITE Database User Manual          915
            18    PROSITE.LIS     PROSITE Entry List                    424
            19    PROSITE.DOC     PROSITE Entry Documentation         12313
            20    PROSITE.DAT     PROSITE Database Entries             6399
            21    ENZYME.USR      ENZYME Database User Manual           487
            22    ENZYME.DAT      ENZYME Database                     17561



7  INDEX FILE FORMATS

The index key of each index file (keywords, authors, citations, etc.) is  sorted
alphabetically;  the  names  of  all entries containing the index key are listed
alphabetically after the key.  Each entry name is  accompanied  by  its  primary
accession number.

Except for the short directory, accession number and species  id  code  indices,
all  index  files have the same layout:  each value of the index key begins on a
new line in column 1, and the associated entry names begin  on  the  next  line.
Lines containing entry names are in fixed-format, layed out as follows:

                     Columns   Description
                     -------   ---------------------------
                     14-23     entry name (left-justified)
                     29-34     primary accession number

                     36-45     entry name (left-justified)
                     51-56     primary accession number

                     58-67     entry name (left-justified)
                     73-78     primary accession number


Up to three entry names fit on each such line; if a given  index  key  has  more
than  three  entries associated with it, additional lines are used (with exactly
the same layout).  This index file format is  identical  to  that  of  the  EMBL
Nucleotide Sequence database.


7.1  Species Index

This file lists all  species  which  appear  in  the  database.   It  is  sorted
alphabetically  on  (english)  common name.  The latin genus and species will be
listed, if present in  the  database  entries.   Mitochondrion  and  chloroplast
sequences  appear  under  separate  index  keys,  immediately  after the related
nuclear sequences.  An excerpt from the species index file is given  below  (the
ruler is presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
CHILEAN POTATO-TREE (SOLANUM CRISPUM)
             PLAS$SOLCR     P00297
CHIMPANZEE IMMUNODEFICIENCY VIRUS (CIV) (SIV(CPZ))
             ENV$SIVCZ      P17281 GAG$SIVCZ      P17282 NEF$SIVCZ      P17664
             POL$SIVCZ      P17283 REV$SIVCZ      P17280 TAT$SIVCZ      P17285
             VIF$SIVCZ      P17284 VPR$SIVCZ      P17287 VPU$SIVCZ      P17286
CHIMPANZEE (PAN SATYRUS)
             HBA3$PANSA     P01935
CHIMPANZEE (PAN TROGLODYTES)
             CD4$PANTR      P16004 HA1A$PANTR     P13748 HA1B$PANTR     P13749
             HA1C$PANTR     P16209 HA1D$PANTR     P16210 HA1E$PANTR     P16215
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.2  Keyword Index

This file lists all keywords which appear in the database (on the KW lines).  It
is  sorted alphabetically on keyword.  An excerpt from the keyword index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACETYLCHOLINE RECEPTOR INHIBITOR
             CXA1$CONGE     P01519 CXA1$CONMA     P01521 CXA1$CONST     P15471
             CXA2$CONGE     P01520
ACIDIC PROTEIN
             143E$BOVIN     P11576 B23$RAT        P13084 B231$HUMAN     P08693
             B232$HUMAN     P06748 BAT$HALHA      P13260 CALQ$CANFA     P12637
             CALQ$RABIT     P07221 CENB$HUMAN     P07199 CMGA$HUMAN     P10645
             CMGA$RAT       P10354 GRPE$ECOLI     P09372 MK16$YEAST     P10962
             NFL$BOVIN      P02548 NO38$CHICK     P16039 NU38$XENLA     P07222
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.3  Author Index

This file lists all author names which appear in citations (on the RA lines)  in
the database.  It is sorted alphabetically on name.  Names are presented as they
appear in the database entries (i.e.  as cited in  publications)  -  we  do  not
attempt  to  handle  multiple  surname spellings, or different initials, for the
same author.  An excerpt from the author index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
AAN F.
             PT3G$SALTY     P02908
AARDEN L.A.
             IL6$HUMAN      P05231
AARONSON R.P.
             HEMA$INCCA     P03465
AARONSON S.
             PGDS$HUMAN     P16234
AARONSON S.A.
             3ORF$EIAV1     P11305 DBL$HUMAN      P10911 ENV$EIAV1      P11306
             ENV$SMSAV      P03384 GAG$AVISN      P03342 GAG$MSVMO      P03334
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.4  Short Directory Index

This file contains summary  information  about  all  entries  in  the  database,
including  a  brief description of the entry, its sequence length, molecule type
and data management class.  The file is sorted  alphabetically  on  entry  name.
The lines are fixed-format, layed out as follows:

     Columns   Field Name          Description
     -------   ---------------     -------------------------------------------
     01-10     entry name          left-justified
     14-14     data class          s = standard
                                   u = unannotated
                                   p = preliminary
                                   r = unreviewed
     16-18     molecule type       PRT (protein)
     20-25     sequence length     right-justified
     27-80     description         left-justified


If an entry's description occupies more than 54 characters (cols 27-80), it will
be  continued  onto  one or more continuation lines.  Continuation lines contain
description text (left-justified) in cols  27-80;  cols  01-26  are  blank.   An
excerpt  from  the  short  description  index  file is given below (the ruler is
presented for your convenience - it does not appear in the index file):


1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
104K$THEPA   s PRT    924 104 KD MICRONEME-RHOPTRY ANTIGEN - THEILERIA PARVA
10KA$MYCTU   s PRT    100 BCG-A HEATSHOCK PROTEIN (10 KD ANTIGEN) -
                          MYCOBACTERIUM TUBERCULOSIS
10KS$HUMAN   s PRT     91 10 KD SECRETORY PROTEIN PRECURSOR - HUMAN (HOMO
                          SAPIENS)
10KS$RAT     s PRT     18 10 KD SECRETORY PROTEIN (CC10) (FRAGMENT) - RAT
                          (RATTUS NORVEGICUS)
110K$PLAKN   s PRT    296 110 KD ANTIGEN (PK110) (FRAGMENT) - PLASMODIUM
                          KNOWLESI
11KD$ADE02   s PRT     79 11 KD CORE PROTEIN PRECURSOR (LATE L2 MU CORE PROTEIN)
                          (PROTEIN X) - ADENOVIRUS TYPE 2
11SB$CUCMA   s PRT    480 11-S GLOBULIN BETA SUBUNIT PRECURSOR - PUMPKIN
                          (CUCURBITA MAXIMA)
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.5  Accession Number Index

This  file  lists  all  accession  numbers  in  the  database.   It  is   sorted
alphabetically  on  accession  number.  Each accession number is followed by the
name and primary accession number of every entry in which it occurs.

The lines are fixed-format, layed out exactly the same as the other index files;
the only difference is that the index key (accession number) appears on the same
line (in cols 1-6) as the list of entries which contain the key.  In  the  other
index  files,  the  index  key  appears  on  a  line by itself.  This index key,
however, is short enough to fit on the same line as the  entries,  and  we  have
done this to save space.

Accession numbers which have been deleted from the database also appear in  this
index, containing the word DELETED (left-justified) in the entry name field.

An excerpt from the accession number index file is given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
P00836       ATPE$HORVU     P00836
P00837       ATPG$ECOLI     P00837
P00838       ATPG$ECOLI     P00837
P00839       ATPL$BOVIN     P00839 ATPM$BOVIN     P07926
P00840       ATP9$MAIZE     P00840
P00841       ATP9$YEAST     P00841
P00842       ATPL$NEUCR     P00842
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80


7.6  Citation Index

This file lists all journal citations which  appear  in  the  database.   It  is
sorted  alphabetically  on citation.  An excerpt from the citation index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ANN. GENET. SEL. ANIM. 4:515-521(1972)
             CASK$BOVIN     P02668
ANN. INST. PASTEUR IMMUNOL. 127C:261-271(1976)
             KV3F$HUMAN     P01624
ANN. INST. PASTEUR IMMUNOL. 132D:77-88(1981)
             HV41$MOUSE     P01811
ANN. N.Y. ACAD. SCI. 165:360-377(1969)
             HBB$ATEGE      P02034
ANN. N.Y. ACAD. SCI. 241:436-438(1974)
             HBB$RABIT      P02057
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80


7.7  Species ID Code Index

This file lists all species id codes which appear in the database.  It is sorted
alphabetically  on species id code.  Each code is followed by a (sorted) list of
all sequence entry codes which are associated with the  species  id  code.   The
sequence  entry  code is the first component of each entry name (before the "$")
and the species id code is the second component of each entry  name  (after  the
"$").   For  example, the entry called COA3$ADEA2 has a species id code of ADEA2
and a sequence entry code of COA3.

Each species id code starts on a  new  line,  occupying  columns  1-5,  and  the
associated  sequence entry codes (each up to 4 characters long) start in columns
10, 15, 20, 25, 30, 35, ...  70, 75.  If there are more than 14  sequence  entry
codes  for  a  given species id, as many continuation lines as necessary will be
used, with columns 1-9 left blank.

An excerpt from the species id code index file is  given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACHKL    CALM
ACHLY    API
ACIBA    KKA
ACICA    BEND CATA CATM DHGA DHGB ELH2 MURO PQQ1 PQQ2 PQQ3 PQQ5 PQQL PQQR TRPB
         TRPC TRPD TRPF TRPG
ACIFE    HGDA HGDB YHGD
ACIGL    ASPQ
ACIGU    PRT1
ACISP    CYMO
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80


8  WE NEED YOUR HELP !

We welcome  any  feedback  from  our  users.   We  would  especially  appreciate
information  about  any sequences belonging to your field of expertise which are
missing from the database.  We also would like to be notified about  annotations
which  should be updated (e.g.  if the function of a protein has been clarified,
or if new post-translational information has become available).


                                   APPENDIX A

                                SOME STATISTICS



A.1  AMINO ACID COMPOSITION

Composition in percent for the complete database:

   Ala (A) 7.70   Gln (Q) 4.09   Leu (L) 9.09   Ser (S) 7.07
   Arg (R) 5.25   Glu (E) 6.28   Lys (K) 5.84   Thr (T) 5.86
   Asn (N) 4.42   Gly (G) 7.15   Met (M) 2.30   Trp (W) 1.31
   Asp (D) 5.23   His (H) 2.27   Phe (F) 3.95   Tyr (Y) 3.20
   Cys (C) 1.83   Ile (I) 5.39   Pro (P) 5.11   Val (V) 6.49
   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.04

Classification of the amino acids by their frequency:

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Arg, Asp, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



A.2  DISTRIBUTION OF SEQUENCES BY SPECIES

Total number of species represented in this release of the database:  2363.

        Species represented 1x: 1051
                            2x:  435
                            3x:  237
                            4x:  135
                            5x:  108
                            6x:   84
                            7x:   46
                            8x:   31
                            9x:   48
                           10x:   20
                       11- 20x:   84
                       21-100x:   66
                         >100x:   18

Table of the most common species:

    Number   Frequency          Species

         1        1462          Human
         2        1246          Escherichia coli
         3         829          Mouse
         4         712          Rat
         5         549          Baker's yeast (Saccharomyces cerevisiae)
         6         404          Bovine
         7         297          Fruit fly (Drosophila melanogaster)
         8         267          Chicken
         9         220          Rabbit
        10         200          Bacillus subtilis
        11         191          Pig
        12         189          Human cytomegalovirus (strain AD169)
        13         159          African clawed frog (Xenopus laevis)
        14         157          Salmonella typhimurium
        15         142          Bacteriophage T4
        16         126          Maize
        17         113          Rice
        18         106          Tobacco
        19          98          Vaccinia virus
        20          92          Wheat
        21          86          Pea
        22          84          Liverwort (Marchantia polymorpha)
        23          80          Staphylococcus aureus
        24          78          Slime mold (Dictyostelium discoideum)
                    78          Spinach
        26          77          Soybean
        27          76          Barley
                    76          Sheep
        29          70          Herpes simplex virus type 1 (strain 17)
        30          67          Varicella-Zoster virus (strain Dumas)


A.3  DISTRIBUTION OF SEQUENCES BY LENGTH


          From   To  Number             From   To   Number

             1-  50    1079             1001-1100      145
            51- 100    1962             1101-1200       98
           101- 150    2833             1201-1300       79
           151- 200    1690             1301-1400       50
           201- 250    1345             1401-1500       41
           251- 300    1205             1501-1600       18
           301- 350    1039             1601-1700       21
           351- 400    1056             1701-1800       18
           401- 450     785             1801-1900       16
           451- 500     860             1901-2000       19
           501- 550     665             2001-2100        8
           551- 600     435             2101-2200       22
           601- 650     315             2201-2300       22
           651- 700     236             2301-2400       10
           701- 750     225             2401-2500       10
           751- 800     158             >2500           32
           801- 850     131
           851- 900     144
           901- 950      85
           951-1000      84

Currently the five largest sequences are:

     RYNR$RABIT  5037 a.a.
     APB$HUMAN   4563 a.a.
     APOA$HUMAN  4548 a.a.
     DMD$HUMAN   3685 a.a.
     DMD$CHICK   3660 a.a.
  

Swiss-Prot release 14.0

Published April 1, 1990


                    SWISS-PROT RELEASE 14.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

Release 14.0 of SWISS-PROT contains 15409 sequence entries,  comprising  4914264
amino  acids  abstracted  from 15054 references.  This represents an increase of
13% over release 13.0.  The recent growth of the database is summarized below:

     Release    Date   Number of entries     Nb of amino acids

     3.0        11/86               4160               969 641
     4.0        04/87               4387             1 036 010
     5.0        09/87               5205             1 327 683
     6.0        01/88               6102             1 653 982
     7.0        04/88               6821             1 885 771
     8.0        09/88               7724             2 224 465
     9.0        12/88               8702             2 498 140
     10.0       03/89              10008             2 952 613
     11.0       06/89              10856             3 265 966
     12.0       10/89              12305             3 797 482
     13.0       01/90              13837             4 347 336
     14.0       04/90              15409             4 914 264

Almost 1600 sequences have been added since release 13, the sequence data of 191
existing  entries has been updated and the annotations of 2650 entries have been
revised.  In particular we have used reviews articles to update the  annotations
of the following groups or families of proteins:

     2-S seed storage proteins
     3-hydroxyacyl-CoA dehydrogenases
     Acyl-CoA dehydrogenases
     Acylphosphatases
     Albumin / AFP / VDBP family
     Aldo/keto reductases
     Alkaline phosphatases
     Amino acid permeases
     Arthropod hemocyanins / insect LSPs
     Bacterial activator proteins, gntR family
     Calcitonin / CGRP / IAPP family
     Cecropin family
     Copper type II, ascorbate-dependent monooxygenases
     Cytochromes b/b6
     DNA ligases
     Enoyl-CoA hydratases
     Glutamine synthetases
     Growth factor and cytokines receptors
     Isocitrate lyases
     Lamins
     Leguminous lectins
     Nerve growth factor family
     Nitrogenase component 1 subunits
     Phosphoglycerate mutases
     Phosphoribosyl pyrophosphate synthetases
     Phosphoribosylglycinamide synthetases
     Phytochromes
     Platelet-derived growth factor (PDGF) family
     Proliferating cell nuclear antigen (PCNA)
     Protamine P1
     Rieske iron-sulfur proteins
     Rotaviruses proteins
     Sulfatases
     Thiolases
     Tryptophan synthetases
     Ubiquitin carboxyl-terminal hydrolases
     Ubiquitin-conjugating enzymes
     Ureases
     Urokinases and tissue plasminogen activators
     Zinc carboxypeptidases



2  DATA SOURCES

Release 14.0 has been updated using protein sequence data from release  23.0  of
the  PIR  (Protein  Identification  Resource)  protein  data  bank,  as  well as
translation of nucleotide sequence data from release 22.0 of the EMBL Nucleotide
Sequence Database.

As an indication to the source of the sequence data in the SWISS-PROT data  bank
we  list  here  the  statistics  concerning  the DR (Databank Reference) pointer
lines:

Entries with pointer(s) to only PIR entri(es):                  3079
Entries with pointer(s) to only EMBL entri(es):                 6896
Entries with pointer(s) to both EMBL and PIR entri(es):         4557
Entries with no pointers lines (entered in house):               877



3  CHANGES AT THIS RELEASE

3.1  OG Line Format

The OG (OrGanelle) line format has been extended to take  into  account  protein
sequences  from  genes  originating  from  the  cyanelle  of  bacteria  such  as
Cyanophora paradoxa.  The valid syntax is:

     OG   CYANELLE.



3.2  DR Line Format

The DR line format has been extended to accept cross-references to PROSITE,  the
data  bank  of sites and pattern in proteins which is now being distributed with
SWISS-PROT.  The primary identifier is the PROSITE  accession  number,  and  the
secondary identifier is the PROSITE entry name.  Examples:

     DR   PROSITE; PS00088; SOD_MN.
     DR   PROSITE; PS00021; KRINGLE.


3.3  New CC Line Topics

As of release 14 we have added two new topics for the  comments  (CC)  linetype:
ENZYME REGULATION, and TISSUE SPECIFICITY.  Example of their usage:

     CC   -!- ENZYME REGULATION: THE ACTIVITY OF THIS ENZYME IS CONTROLLED
     CC       BY ADENYLATION. THE FULLY ADENYLATED ENZYME COMPLEX IS
     CC       INACTIVE.

     CC   -!- TISSUE SPECIFICITY: KIDNEY; SUBMAXILLARY GLAND; URINE.



3.4  Documentation Files

The file ECNUMBER.DOC, which contained an index  of  SWISS-PROT  enzyme  entries
classified  by  EC  number,  is  no longer distributed.  This information can be
found in the new database ENZYME which is now distributed with SWISS-PROT.

The JOURLIST.DOC file now includes the ISSN numbers for all  journals  cited  in
SWISS-PROT and PROSITE.

For the sake of  consistency  with  the  newly  introduced  PROSITE  and  ENZYME
databases,  the  name  of  the SWISS-PROT User Manual file has been changed from
USRMAN.DOC to SWISSPRT.USR.  All three databases  now  have  .USR  as  the  file
extension for their User Manual files.



4  NEW PROSITE DATABASE

PROSITE is a compilation of sites and patterns found in protein sequences.  This
database consists of two files:  the first file contains the patterns as well as
the results of the scan of  SWISS-PROT  for  these  patterns,  the  second  file
contains  the documentation that fully describes each pattern.  A sample pattern
entry and its corresponding documentation entry are shown below.

The use of protein sequence patterns (or motifs) to determine the function(s) of
proteins  is  becoming  very  rapidly  one  of  the  essential tools of sequence
analysis.  PROSITE, as a stand-alone database is pertinent  for  such  purposes.
But  we  also  believe that PROSITE is an important addition to SWISS-PROT as it
allows the flexible classification of proteins into families.

PROSITE is distributed with  SWISS-PROT;  for  a  complete  description  of  the
content  and  format  of this database you should refer to the User Manual (file
PROSITE.USR).


------------------------ Start of Sample PROSITE Entry -------------------------

ID   CARBOXYPEPTIDASE_SER; PATTERN.
AC   PS00131;
DT   APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE).
DE   Serine carboxypeptidases, serine active site.
PA   G-E-S-Y-A-G.
NR   /RELEASE=14,15409;
NR   /TOTAL=7(7); /POSITIVE=7(7); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR   /FALSE_NEG=0(0);
CC   /TAXO-RANGE=??E??; /MAX-REPEAT=1;
CC   /SITE=3,active_site;
DR   P10619, PRTP$HUMAN, T; P09620, KEX1$YEAST, T; P00729, CBPY$YEAST, T;
DR   P07519, CBP1$HORVU, T; P08818, CBP2$HORVU, T; P08819, CBP2$WHEAT, T;
DR   P11515, CBPG$WHEAT, T;
DO   PDOC00122;
//

{PDOC00122}
{PS00131; CARBOXYPEPTIDASE_SER}
{BEGIN}
************************************************
* Serine carboxypeptidases, serine active site *
************************************************

All  known  carboxypeptidases  are either  metallo carboxypeptidases or serine
carboxypeptidases (EC 3.4.16.-). The catalytic activity of the serine carboxy-
peptidases, like  that  of  the  serine  proteases  of  the trypsin family, is
provided by a charge relay system involving an aspartic acid residue hydrogen-
bonded to an histidine, which itself is hydrogen-bonded to a serine.  Proteins
known or proposed to be serine carboxypeptidases are:

   - Barley serine carboxypeptidases I and II [1,2].
   - Wheat serine carboxypeptidase II [3].
   - A probable wheat serine carboxypeptidase induced by gibberellin [4].
   - Yeast carboxypeptidase Y [5], a  vacuolar protease involved in the degra-
     dation of small peptides.
   - Yeast KEX1 protease [6], which  is  involved  in  killer toxin and alpha-
     factor precursor processing.
   - Human 'protective protein' [7], a  lysosomal  protein which appears to be
     essential for both the activity of beta-galactosidase and neuraminidase.

The sequence in the  vicinity of the  active site  serine residue is perfectly
conserved in all these serine carboxypeptidases.

-Consensus pattern: G-E-S-Y-A-G
                    [S is the active site residue]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: NONE.
-Last update: October 1989 / Text revised.

[ 1] Sorensen S.B., Breddam K., Svendsen I.
     Carlsberg Res. Commun. 51:475-485(1986).
[ 2] Sorensen S.B., Svendsen I., Breddam K.
     Carlsberg Res. Commun. 52:285-295(1987).
[ 3] Breddam K., Sorensen S.B., Svendsen I.
     Carlsberg Res. Commun. 52:297-311(1987).
[ 4] Baulcombe D.C., Barker R.F., Jarvis M.G.
     J. Biol. Chem. 262:13726-13735(1987).
[ 5] Valls L.A., Hunter C.P., Rothman J.H., Stevens T.H.
     Cell 48:887-897(1987).
[ 6] Dmochowska A., Dignard D., Henning D., Thomas D.Y., Bussey H.
     Cell 50:573-584(1987).
[ 7] Galjart N.J., Gillemans N., Harris A., van de Horst G.T.J.,
     Verheijen F.W., Galjaard H., D'Azzo A.
     Cell 54:755-764(1988).
{END}

------------------------- End of Sample PROSITE Entry --------------------------



5  NEW ENZYME DATABASE

A new secondary database called ENZYME has been established.   It  contains  the
following  data for each type of characterized enzyme for which an EC number has
been assigned:

      o  EC number

      o  Recommended name

      o  Alternative names (if any)

      o  Catalytic activity

      o  Cofactors (if any)

      o  Pointers to the SWISS-PROT entry(ies) that correspond to the enzyme
         (if any)


We believe that the ENZYME database will  be  useful  to  anybody  working  with
enzymes  and will allow programs to be developed that can help with the creation
of new metabolic pathways.

A sample entry is shown below:

     ID   1.14.17.3
     DE   PEPTIDYLGLYCINE MONOOXYGENASE.
     AN   PEPTIDYL ALPHA-AMIDATING ENZYME.
     CA   PEPTIDYLGLYCINE + ASCORBATE + O(2) = PEPTIDYL(2-HYDROXYGLYCINE) +
     CA   DEHYDROASCORBATE + H(2)O.
     CC   THE PRODUCT IS UNSTABLE AND DISMUTATES TO GLYOXYLATE AND THE
     CC   CORRESPONDING DESGLYCINE PEPTIDE AMIDE.
     CF   COPPER.
     DR   P10731, AMD$BOVIN ;  P14925, AMD$RAT   ;  P08478, AMD1$XENLA;
     DR   P12890, AMD2$XENLA;
     //

The impact of this new database on SWISS-PROT is the following:

      o  The ECNUMBER.DOC file is now obsolete and is now longer generated.

      o  Instead of having CC (comments) lines with the topics:

         CC   -!- CATALYTIC ACTIVITY:  description_of_catalytic_activity.
         CC   -!- COFACTOR:  description_of_cofactor.

         the enzyme entries in SWISS-PROT will, in future releases, have two new
         linetypes:

         CA   Description_of_catalytic_activity
         CF   Description_of_cofactor

         These  lines  will  be  automatically  generated  at  each  release  of
         SWISS-PROT  from  the  information  stored in the ENZYME database.  The
         introduction of these new  linetypes  is  planned  for  release  16  of
         SWISS-PROT.


ENZYME is distributed with SWISS-PROT; for a complete description of the content
and  format  of  this  database  you  should  refer  to  the  User  Manual (file
ENZYME.USR).



6  DISTRIBUTION MEDIA

Data is available on magnetic tape, TK50 cassette and CD-ROM.  This  section  of
the  release  notes  applies to tape and TK50 cassette only; CD-ROM releases are
accompanied by their own release notes which detail the file  organisation  used
on CD.



6.1  Tape Formats

The distribution tapes are 9-track industry standard magnetic tapes.  Each  file
consists  of  fixed-length  80  byte  records,  padded  with  trailing blanks as
appropriate (except for VMS Backup  format  tapes  which  have  variable  length
records).   Tape  format details (density, blocksize, label type, character set)
are attached to each tape.

In many formats, a release requires more than one  tape  volume.   In  order  to
support  sequential volume serial numbers for multi-volume tape sets, the volume
labels are EMBL01 for the first tape, EMBL02 for the second tape, and so on.

VMS Backup format tapes (and all TK50 cassettes) contain the files listed below,
in the order shown, as a single save set called SWISS14.BCK.



6.2  Documentation

The documentation files on tape (those ending with a file extension of .DOC) are
designed to be easily printable.  As with all other tape files they have a fixed
record length of 80 bytes.  The page length of 63 lines per page was  chosen  so
that the pages will fit both on DIN A4 paper and on American 8-1/2" x 11" paper.
Page throws are indicated by lines with  the  six  character  string  <PAGE>  in
positions  1-6,  and  nothing  else.  If you wish to print any of these files we
suggest you copy them down onto disk, use your local  editor  to  replace  every
occurrence of <PAGE> in columns 1-6 by a formfeed (or whatever is appropriate to
force a page throw on your printer), and then print them.



6.3  Release 14 Files

The distribution tape(s) contain the files shown below,  in  the  order  listed.
Where more than one tape is required, subsequent volumes will continue where the
preceding volume left off.

   File Number    File Name       Description                      #Records
   -----------    ------------    -----------------------------    --------
             1    CONTENTS.DOC    Tape Contents (this table)             63
             2    SWISSPRT.USR    User Manual                          1829
             3    RELNOTES.DOC    Release Notes (this document)        1063

             4    SPECIES.NDX     Species Index                        8779
             5    KEYWORD.NDX     Keyword Index                       16830
             6    AUTHOR.NDX      Author Index                        74268
             7    SHORTDIR.NDX    Short Directory Index               31363
             8    ACNUMBER.NDX    Accession Number Index              15535
             9    CITATION.NDX    Citation Index                      30411
            10    SPIDCODE.NDX    Species ID Code Index                2842

            11    EMBLTOSP.DOC    EMBL/SWISS-PROT Xreferences         13567
            12    ORGCODES.DOC    Organism Code List                   3199
            13    PDBTOSP.DOC     PDB/SWISS-PROT Xreferences            575
            14    JOURLIST.DOC    Journal Abbreviation List            1343
            15    DATASUB.TXT     Data Submission Form                  323

            16    SEQ.DAT         SWISS-PROT Sequence Entries        490281

            17    PROSITE.USR     PROSITE Database User Manual          915
            18    PROSITE.LIS     PROSITE Entry List                    424
            19    PROSITE.DOC     PROSITE Entry Documentation         12313
            20    PROSITE.DAT     PROSITE Database Entries             6401

            21    ENZYME.USR      ENZYME Database User Manual           427
            22    ENZYME.DAT      ENZYME Database                     12008



7  INDEX FILE FORMATS

The index key of each index file (keywords, authors, citations, etc.) is  sorted
alphabetically;  the  names  of  all entries containing the index key are listed
alphabetically after the key.  Each entry name is  accompanied  by  its  primary
accession number.

Except for the short directory, accession number and species  id  code  indices,
all  index  files have the same layout:  each value of the index key begins on a
new line in column 1, and the associated entry names begin  on  the  next  line.
Lines containing entry names are in fixed-format, layed out as follows:


                     Columns   Description
                     -------   ---------------------------
                     14-23     entry name (left-justified)
                     29-34     primary accession number

                     36-45     entry name (left-justified)
                     51-56     primary accession number

                     58-67     entry name (left-justified)
                     73-78     primary accession number


Up to three entry names fit on each such line; if a given  index  key  has  more
than  three  entries associated with it, additional lines are used (with exactly
the same layout).  This index file format is  identical  to  that  of  the  EMBL
Nucleotide Sequence database.



7.1  Species Index

This file lists all  species  which  appear  in  the  database.   It  is  sorted
alphabetically  on  (english)  common name.  The latin genus and species will be
listed, if present in  the  database  entries.   Mitochondrion  and  chloroplast
sequences  appear  under  separate  index  keys,  immediately  after the related
nuclear sequences.  An excerpt from the species index file is given  below  (the
ruler is presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
CHILEAN POTATO-TREE (SOLANUM CRISPUM)
             PLAS$SOLCR     P00297
CHIMPANZEE (PAN SATYRUS)
             HBA3$PANSA     P01935
CHIMPANZEE (PAN TROGLODYTES)
             CD4$PANTR      P16004 HA1A$PANTR     P13748 HA1B$PANTR     P13749
             HA1C$PANTR     P16209 HA1D$PANTR     P16210 HA1E$PANTR     P16215
             HA1M$PANTR     P13750 HA1N$PANTR     P13751 HBAZ$PANTR     P06347
             MBP$PANTR      P06906 MYG$PANTR      P02145 NUO4$PANTR     P03906
             NUO5$PANTR     P03916
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.2  Keyword Index

This file lists all keywords which appear in the database (on the KW lines).  It
is  sorted alphabetically on keyword.  An excerpt from the keyword index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):


1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACETYLCHOLINE RECEPTOR INHIBITOR
             CXA1$CONGE     P01519 CXA1$CONMA     P01521 CXA1$CONST     P15471
             CXA2$CONGE     P01520
ACIDIC PROTEIN
             143E$BOVIN     P11576 B23$RAT        P13084 B231$HUMAN     P08693
             B232$HUMAN     P06748 BAT$HALHA      P13260 CALQ$CANFA     P12637
             CALQ$RABIT     P07221 CENB$HUMAN     P07199 CMGA$HUMAN     P10645
             CMGA$RAT       P10354 GRPE$ECOLI     P09372 MK16$YEAST     P10962
             NFL$BOVIN      P02548 NO38$CHICK     P16039 NU38$XENLA     P07222
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.3  Author Index

This file lists all author names which appear in citations (on the RA lines)  in
the database.  It is sorted alphabetically on name.  Names are presented as they
appear in the database entries (i.e.  as cited in  publications)  -  we  do  not
attempt  to  handle  multiple  surname spellings, or different initials, for the
same author.  An excerpt from the author index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
AAN F.
             PT3G$SALTY     P02908
AARDEN L.A.
             IL6$HUMAN      P05231
AARONSON R.P.
             HEMA$INCCA     P03465
AARONSON S.
             PGDS$HUMAN     P16234
AARONSON S.A.
             3ORF$EIAV1     P11305 DBL$HUMAN      P10911 ENV$EIAV1      P11306
             ENV$SMSAV      P03384 GAG$AVISN      P03342 GAG$MSVMO      P03334
             GAG$SMSAV      P03330 KMOS$MSVMO     P00538 PDGB$HUMAN     P01127
             POL$EIAV1      P11204 POL$MMTVB      P03365 POL$SMSAV      P03359
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.4  Short Directory Index

This file contains summary  information  about  all  entries  in  the  database,
including  a  brief description of the entry, its sequence length, molecule type
and data management class.  The file is sorted  alphabetically  on  entry  name.
The lines are fixed-format, layed out as follows:

     Columns   Field Name          Description
     -------   ---------------     -------------------------------------------
     01-10     entry name          left-justified
     14-14     data class          s = standard
                                   u = unannotated
                                   p = preliminary
                                   r = unreviewed
     16-18     molecule type       PRT (protein)
     20-25     sequence length     right-justified
     27-80     description         left-justified


If an entry's description occupies more than 54 characters (cols 27-80), it will
be  continued  onto  one or more continuation lines.  Continuation lines contain
description text (left-justified) in cols  27-80;  cols  01-26  are  blank.   An
excerpt  from  the  short  description  index  file is given below (the ruler is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
104K$THEPA   p PRT    924 104 KD MICRONEME-RHOPTRY ANTIGEN - THEILERIA PARVA
10KA$MYCTU   s PRT    100 BCG-A HEATSHOCK PROTEIN (10 KD ANTIGEN) -
                          MYCOBACTERIUM TUBERCULOSIS
10KS$HUMAN   s PRT     91 10 KD SECRETORY PROTEIN PRECURSOR - HUMAN (HOMO
                          SAPIENS)
110K$PLAKN   s PRT    296 110 KD ANTIGEN (PK110) (FRAGMENT) - PLASMODIUM
                          KNOWLESI
11KD$ADE02   s PRT     79 11 KD CORE PROTEIN PRECURSOR (LATE L2 MU CORE PROTEIN)
                          (PROTEIN X) - ADENOVIRUS TYPE 2
11SB$CUCMA   s PRT    480 11-S GLOBULIN BETA SUBUNIT PRECURSOR - PUMPKIN
                          (CUCURBITA MAXIMA)
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.5  Accession Number Index

This  file  lists  all  accession  numbers  in  the  database.   It  is   sorted
alphabetically  on  accession  number.  Each accession number is followed by the
name and primary accession number of every entry in which it occurs.

The lines are fixed-format, layed out exactly the same as the other index files;
the only difference is that the index key (accession number) appears on the same
line (in cols 1-6) as the list of entries which contain the key.  In  the  other
index  files,  the  index  key  appears  on  a  line by itself.  This index key,
however, is short enough to fit on the same line as the  entries,  and  we  have
done this to save space.

Accession numbers which have been deleted from the database also appear in  this
index, containing the word DELETED (left-justified) in the entry name field.

An excerpt from the accession number index file is given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
P00835       ATPE$MAIZE     P00835
P00836       ATPE$HORVU     P00836
P00837       ATPG$ECOLI     P00837
P00838       ATPG$ECOLI     P00837
P00839       ATPL$BOVIN     P00839 ATPM$BOVIN     P07926
P00840       ATP9$MAIZE     P00840
P00841       ATP9$YEAST     P00841
P00842       ATPL$NEUCR     P00842
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.6  Citation Index

This file lists all journal citations which  appear  in  the  database.   It  is
sorted  alphabetically  on citation.  An excerpt from the citation index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ANN. GENET. SEL. ANIM. 4:515-521(1972)
             CASK$BOVIN     P02668
ANN. INST. PASTEUR IMMUNOL. 127C:261-271(1976)
             KV3F$HUMAN     P01624
ANN. INST. PASTEUR IMMUNOL. 132D:77-88(1981)
             HV41$MOUSE     P01811
ANN. N.Y. ACAD. SCI. 165:360-377(1969)
             HBB$ATEGE      P02034
ANN. N.Y. ACAD. SCI. 241:436-438(1974)
             HBB$RABIT      P02057
ANN. N.Y. ACAD. SCI. 356:1-13(1980)
             CALM$METSE     P02596
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.7  Species ID Code Index

This file lists all species id codes which appear in the database.  It is sorted
alphabetically  on species id code.  Each code is followed by a (sorted) list of
all sequence entry codes which are associated with the  species  id  code.   The
sequence  entry  code is the first component of each entry name (before the "$")
and the species id code is the second component of each entry  name  (after  the
"$").   For  example, the entry called COA3$ADEA2 has a species id code of ADEA2
and a sequence entry code of COA3.

Each species id code starts on a  new  line,  occupying  columns  1-5,  and  the
associated  sequence entry codes (each up to 4 characters long) start in columns
10, 15, 20, 25, 30, 35, ...  70, 75.  If there are more than 14  sequence  entry
codes  for  a  given species id, as many continuation lines as necessary will be
used, with columns 1-9 left blank.

An excerpt from the species id code index file is  given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACEME    PER  RBS1 RBS2 RBS3 RBS4 RBS5
ACENE    CYC
ACHKL    CALM
ACHLY    API
ACIBA    KKA
ACICA    BEND CATA CATM DHGA DHGB ELH2 MURO PQQ1 PQQ2 PQQ3 PQQ5 PQQL PQQR TRPC
         TRPD TRPG
ACIFE    HGDA HGDB YHGD
ACIGL    ASPQ
ACIGU    PRT1
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



8  WE NEED YOUR HELP !

We welcome  any  feedback  from  our  users.   We  would  especially  appreciate
information  about  any sequences belonging to your field of expertise which are
missing from the database.  We also would like to be notified about  annotations
which  should be updated (e.g.  if the function of a protein has been clarified,
or if new post-translational information has become available).



                                   APPENDIX A

                                SOME STATISTICS



A.1  AMINO ACID COMPOSITION

Composition in percent for the complete database:

     Ala (A) 7.69   Gln (Q) 4.11   Leu (L) 9.10   Ser (S) 7.06
     Arg (R) 5.21   Glu (E) 6.30   Lys (K) 5.86   Thr (T) 5.84
     Asn (N) 4.43   Gly (G) 7.17   Met (M) 2.30   Trp (W) 1.32
     Asp (D) 5.24   His (H) 2.26   Phe (F) 3.95   Tyr (Y) 3.20
     Cys (C) 1.83   Ile (I) 5.39   Pro (P) 5.12   Val (V) 6.48
     Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.04


Classification of the amino acids by their frequency:

     Leu, Ala, Gly, Ser, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
     Phe, Tyr, Met, His, Cys, Trp


A.2  DISTRIBUTION OF SEQUENCES BY SPECIES

Total number of species represented in this release of the database:  2192.

     Species represented 1x: 964
                         2x: 413
                         3x: 224
                         4x: 127
                         5x:  96
                         6x:  76
                         7x:  40
                         8x:  29
                         9x:  47
                        10x:  24
                    11- 20x:  75
                    21-100x:  60
                      >100x:  17

Table of the most common species:

     Number   Frequency          Species

          1        1336          Human
          2        1150          Escherichia coli
          3         763          Mouse
          4         652          Rat
          5         501          Baker's yeast (Saccharomyces cerevisiae)
          6         387          Bovine
          7         263          Fruit fly (Drosophila melanogaster)
          8         253          Chicken
          9         210          Rabbit
         10         176          Pig
         11         169          Bacillus subtilis
         12         146          African clawed frog (Xenopus laevis)
         13         134          Salmonella typhimurium
         14         132          Bacteriophage T4
         15         118          Maize
         16         104          Rice
                    104          Tobacco
         18          85          Wheat
         19          84          Liverwort (Marchantia polymorpha)
         20          77          Pea
         21          75          Slime mold (Dictyostelium discoideum)
                     75          Spinach
                     75          Staphylococcus aureus
         24          74          Soybean
         25          73          Vaccinia virus



A.3  DISTRIBUTION OF SEQUENCES BY LENGTH


     From   To  Number             From   To   Number

        1-  50    1012             1001-1100      131
       51- 100    1810             1101-1200       84
      101- 150    2648             1201-1300       68
      151- 200    1546             1301-1400       46
      201- 250    1233             1401-1500       35
      251- 300    1093             1501-1600       17
      301- 350     929             1601-1700       20
      351- 400     939             1701-1800       16
      401- 450     699             1801-1900       13
      451- 500     786             1901-2000       18
      501- 550     601             2001-2100        7
      551- 600     381             2101-2200       19
      601- 650     279             2201-2300       19
      651- 700     199             2301-2400       10
      701- 750     196             2401-2500        9
      751- 800     125             >2500           29
      801- 850     115
      851- 900     134
      901- 950      71
      951-1000      72

Currently the five largest sequences are:

     RYNR$RABIT  5037 a.a.
     APB$HUMAN   4563 a.a.
     APOA$HUMAN  4548 a.a.
     DMD$HUMAN   3685 a.a.
     DMD$CHICK   3660 a.a.
  

Swiss-Prot release 13.0

Published January 18, 1990

             SWISS-PROT RELEASE 13.0 RELEASE NOTES


   Date:     January 18, 1990
   Author:   A. Bairoch


                               1. INTRODUCTION
   1.1  Evolution

   Release 13.0  of SWISS-PROT  contains 13837 sequence entries, comprising
   4'347'336 amino  acids abstracted from 13560 references. This represents
   an increase of 14% over release 12.0. The recent growth of the data bank
   is summarized below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336


   1.2  Source of data

   Release 13.0  has been  updated using protein sequence data from release
   22.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 21.0 of the
   EMBL Nucleotide Sequence Data Library.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank  we list  here the  statistics  concerning  the  DR  (Databank
   Reference) pointer lines:

   Entries with pointer(s) to only PIR entri(es):           3104
   Entries with pointer(s) to only EMBL entri(es):          5894
   Entries with pointer(s) to both EMBL and PIR entri(es):  3989
   Entries with no pointers lines (entered in house):        850


      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 11

   As the  last SWISS-PROT  release to  be distributed to PC/Gene users was
   release 11,  we list  here the changes that were made both in release 12
   and 13.


   2.1  Sequences and annotations

   Almost 3000  sequences have  been added  since release  11, the sequence
   data of  380 existing  entries has  been updated  and the annotations of
   4900 entries  have been  revised. In  particular we  have  used  reviews
   articles to  update the  annotations of the following groups or families
   of proteins:

   -  11-S plant seed storage proteins
   -  3'5'-cyclic nucleotide phosphodiesterases
   -  Acyl carrier proteins
   -  Aldehyde dehydrogenases
   -  Aminoacyl-transfer RNA synthetases
   -  AraC family bacterial transcription regulation proteins
   -  Arginases
   -  Bacteriophage P22 proteins
   -  Bacteriophage T4 proteins
   -  Band 3 protein family
   -  Biotin-requiring enzymes
   -  Chloroplast photosystems I and II proteins
   -  Creatine kinases
   -  Crp family bacterial activator proteins
   -  Cyclins
   -  Dehydrins
   -  DNA mismatch repair proteins
   -  Endothelins / sarafotoxins
   -  Enolases
   -  Eukaryotic thiol proteases
   -  Extradiol ring-cleavage dioxygenases
   -  Flavodoxins
   -  Globins, annelids
   -  Globins, molluscs
   -  Glucose-6-phosphate dehydrogenases
   -  Glutaredoxins
   -  GTP-binding elongation factors
   -  Glutamate dehydrogenases
   -  Glycerate and 3-phosphoglycerate dehydrogenases
   -  Glycophorins
   -  Granzymes
   -  GTP-binding elongation factors
   -  Heat-labile enterotoxins
   -  Heat shock hsp90 proteins
   -  Insulin family proteins
   -  Insulin-like growth factor binding proteins
   -  Insect larval cuticle proteins
   -  Insect-type alcohol dehydrogenases / ribitol dehydrogenase family
   -  Integrins
   -  Interleukin 7
   -  Iron-containing alcohol dehydrogenases
   -  Lysosome-associated membrane glycoproteins
   -  LysR family bacterial activator proteins
   -  L-lactate dehydrogenases
   -  Macrolide-lincosamide-streptogramin B resistance proteins
   -  Malate dehydrogenase
   -  Mammalian defensins
   -  MHC class II proteins
   -  Mitochondrial energy transfer proteins
   -  Myc-type proteins
   -  Nerve growth factors
   -  N-4 cytosine-specific DNA methylases
   -  Peroxidases
   -  Phosphoglucose isomerases
   -  Phosphoglycerate kinases
   -  Phospholipases A2
   -  Picornaviruses genome polyproteins
   -  Ribosomal proteins
   -  Rubredoxins
   -  Serine hydroxymethyltransferases
   -  Serine/threonine specific protein phosphatases
   -  Staphyloccocal enterotoxins / Streptococcal pyrogenic exotoxins
   -  Sugar transporters
   -  Thaumatin family proteins
   -  Transferrins
   -  Tryptophan synthase alpha and beta chains
   -  Tyrosine protein kinases
   -  Uracil-DNA glycosylases
   -  Vertebrate galactoside-binding lectins
   -  Zinc-containing alcohol dehydrogenases


   2.2  New line-type

   Release 12  introduced a  new type  of data  line, the  OG line.  The OG
   (OrGanelle) lines  indicate  whether  the  gene  coding  for  a  protein
   originates from  the mitochondria,  the chloroplast,  or a  plasmid. The
   format for the OG line is:

   OG   CHLOROPLAST.
   OG   MITOCHONDRION.
   OG   PLASMID name.

   Where 'name' is the name of the plasmid.

   Previously this  information was  stored in the OS line, as shown in the
   example below.

   OS   WHEAT (TRITICUM AESTIVUM) CHLOROPLAST.

   The above example is now stored as:

   OS   WHEAT (TRITICUM AESTIVUM).
   OG   CHLOROPLAST.


   2.3  New topic for the comments (CC) line type

   As of release 12 we have added a new 'topic' for the comments (CC) line-
   type: CAUTION.  It is  used to  indicate  that  possible  errors  and/or
   grounds for confusion may exist. Example of its usage:

   CC   -!- CAUTION: ALSO SEE VERSION 2 OF THIS PROTEIN THAT DIFFERS DUE
   CC       TO A FRAMESHIFT.


   2.4  Small change in the RL lines for submissions

   RL lines  for data  submitted to  EMBL or Genbank was represented by two
   subtypes of RL lines, as illustrated in the following examples:

   RL   SUBMITTED (OCT-1989) TO THE EMBL DATA LIBRARY.
   RL   SUBMITTED (OCT-1989) TO GENBANK.

   Starting with  release 13,  all these  lines are  now in  the  following
   format:

   RL   SUBMITTED (OCT-1989) TO EMBL/GENBANK DATA BANKS.


   2.5  Documentation changes

   -  ACINDEX.TXT is  a new  document file  which is  an index  of all  the
      accession numbers  which appear  in SWISS-PROT  and the  name of  the
      entries in which they occur.
   -  PDBTOSP.TXT is  a new  document file  which is  an index  of all  the
      Brookhaven PDB entries referenced in SWISS-PROT.
   -  The JOURLIST.TXT document now indicates the abbreviation and the full
      names of all journals cited in SWISS-PROT.

   Important: for  more  detailed  information  concerning  the  SWISS-PROT
   documentation please consult appendix C of these release notes.



     3. IMPORTANT NOTES CONCERNING SWISS-PROT RELEASE 13 AND PC/GENE 6.01

   3.1  The ryanodine receptor

   The rabbit  skeletal muscle ryanodine receptor (RYNR$RABIT) is a protein
   of 5037  amino acid  residues. PC/Gene  release 6.01  can  only  analyze
   proteins of  up to  5000 residues.  This limitation will be increased in
   the next  major version  (6.50). Until  this release  we have dealt with
   this protein  in the  following way:  the sequence  entry RYNR$RABIT was
   split into two parts. RYN1$RABIT contains the first 4563 residues (which
   corresponds to the cytoplasmic domain), and RYN2$RABIT contains residues
   4561 to 5037.

   Note: due  to this  modification there are 13838 sequence entries in the
   PC/Gene version  of the SWISS-PROT data bank (instead of 13837 as listed
   in section 1.1 of these release notes).

   3.2  The OG line

   The OG  line-type introduced  in release  12 (see  section 2.2)  is  not
   supported by  release 6.01  of PC/Gene.  This means  that although these
   lines are present in the SWISS-PROT data base (either on the CD-ROM disk
   or on the bulk files on the floppy disks), you can not make use of them.
   Release 6.50 will fully support OG lines.


   Note to  CD-ROM users:  library files  containing the  names of  all the
   sequences  which   originate  from   either  the   mitochondria  or  the
   chloroplast are available on the CD-ROM (for more details see the CD-ROM
   release notes).



                            4. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about annotations to be updated, as for example if the function
   of a protein has been clarified or if new post-translational information
   has become available.



                         APPENDIX A: SOME STATISTICS


   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.69   Gln (Q) 4.10   Leu (L) 9.11   Ser (S) 7.03
   Arg (R) 5.22   Glu (E) 6.29   Lys (K) 5.87   Thr (T) 5.84
   Asn (N) 4.42   Gly (G) 7.18   Met (M) 2.30   Trp (W) 1.33
   Asp (D) 5.23   His (H) 2.27   Phe (F) 3.93   Tyr (Y) 3.20
   Cys (C) 1.83   Ile (I) 5.37   Pro (P) 5.13   Val (V) 6.49

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.04


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 2032

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 916
                            2x: 389
                            3x: 187
                            4x: 127
                            5x:  87
                            6x:  67
                            7x:  32
                            8x:  28
                            9x:  44
                           10x:  20
                       11- 20x:  66
                       21-100x:  53
                         >100x:  16


        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        1200          Human
         2        1072          Escherichia coli
         3         685          Mouse
         4         576          Rat
         5         443          Baker's yeast (Saccharomyces cerevisiae)
         6         366          Bovine
         7         239          Fruit fly (Drosophila melanogaster)
         8         227          Chicken
         9         195          Rabbit
        10         160          Bacillus subtilis
        11         159          Pig
        12         131          African clawed frog (Xenopus laevis)
        13         122          Bacteriophage T4
        14         112          Salmonella typhimurium
        15         103          Tobacco
        16         102          Maize
        17          92          Rice
        18          84          Liverwort (Marchantia polymorpha)
        19          78          Wheat
        20          74          Staphylococcus aureus
        21          71          Vaccinia Virus
        22          70          Herpes virus (Type 1, strain 17)
                    70          Pea
                    70          Spinach
        25          69          Soybean


   A.3  Repartition of the sequences by size

      From   To  Number             From   To   Number
         1-  50     923             1001-1100      112
        51- 100    1635             1101-1200       72
       101- 150    2474             1201-1300       59
       151- 200    1405             1301-1400       39
       201- 250    1113             1401-1500       33
       251- 300     960             1501-1600       16
       301- 350     842             1601-1700       15
       351- 400     815             1701-1800       13
       401- 450     618             1801-1900        9
       451- 500     678             1901-2000       15
       501- 550     532             2001-2100        6
       551- 600     329             2101-2200       18
       601- 650     243             2201-2300       18
       651- 700     173             2301-2400       10
       701- 750     174             2401-2500        8
       751- 800     110             >2500           26
       801- 850     101
       851- 900     121
       901- 950      60
       951-1000      62

   Currently the five largest sequences are:   RYNR$RABIT  5037 a.a.
                                               APB$HUMAN   4563 a.a.
                                               APOA$HUMAN  4548 a.a.
                                               DMD$HUMAN   3685 a.a.
                                               DMD$CHICK   3660 a.a.


                          APPENDIX B: DOCUMENTATION

   SWISS-PROT documentation consists of the following items:

   USERMAN .TXT   SWISS-PROT user manual
   SP13_REL.TXT   Release notes (this document)
   SHORTDES.TXT   Short description of entries in  SWISS-PROT (this document
                  contains  the  same  information as  that available  in the
                  catalog file (PROT_CAT.TXT) but it is formatted differently)
   JOURLIST.TXT   List of abbreviations for journals cited
   KEYWLIST.TXT   List of keywords in use
   SPECLIST.TXT   List of organism (species) identification codes
   SPECODES.TXT   List of sequence entry codes classified by species
   ACINDEX .TXT   Accession number index
   AUTINDEX.TXT   Author index
   CITINDEX.TXT   Citation index
   ECINDEX .TXT   Index of enzymes classified by their EC number
   EMBLTOSP.TXT   Index  of the  EMBL Data Library sequences referenced in
                  SWISS-PROT
   PDBTOSP .TXT   Index of Brookhaven PDB entries referenced in SWISS-PROT

   -  All these  document files  are available on the CD-ROM disk. They are
      stored in the '\\\\DOC_DBAS\\\\SPROT' directory.
   -  Except for  AUTINDEX.TXT and  SHORTDES.TXT, all  the  other  document
      files are stored in two SWISS-PROT documentation floppy disks.
   -  Some of  these documents  are also distributed in a printed form (see
      table below).

             Document         Documentation   Printed
                              Disk N#         copy
             ----------------------------------------
             USERMAN .TXT     1               Yes
             SP13_REL.TXT     1               Yes
             SHORTDES.TXT     N.A.            [*]
             JOURLIST.TXT     1               Yes
             KEYWLIST.TXT     1               Yes
             SPECLIST.TXT     1               Yes
             SPECODES.TXT     1               No
             ACINDEX .TXT     1               No
             AUTINDEX.TXT     N.A.            No
             CITINDEX.TXT     2               No
             ECINDEX .TXT     1               Yes
             EMBLTOSP.TXT     1               No
             PDBTOSP .TXT     1               Yes

   [*] The content  of the   catalog file  (PROT_CAT.TXT) is  provided in a
       printed form. It contains  the same information as that available in
       SHORT_DES.TXT, but it is formatted differently.



                       APPENDIX C: FLOPPY DISK VERSION

   C.1  IBM PC/AT 1.2 Mb disks

   SWISS-PROT release  13 is  stored on  eighteen 1.2  Mb disks.  Each  one
   contains a single bulk file (PRT13_01.BLK to PRT13_18.BLK):

   Disk     First sequence        Last Sequence
    1       10KA$MYCTU            ATCD$RAT
    2       ATCE$PIG              CATL$CHICK
    3       CATL$HUMAN            CRAM$CRAAB
    4       CRB$DROME             DRN1$SHEEP
    5       DRNE$VIBCH            FM19$BACNO
    6       FM1A$ECOLI            H3$CAEEL
    7       H3$CHICK              HPIS$RHOTE
    8       HPIS$THIPF            KABL$MOUSE
    9       KAC$HUMAN             LPID$EDWTA
   10       LPIV$ECOLI            NEF$HIV1R
   11       NEF$HIV1S             PEPA$HUMAN
   12       PEPA$MACFU            PRTS$HUMAN
   13       PRTS$SERMA            RL7$DICDI
   14       RL7$ECOLI             SST2$YEAST
   15       ST12$YEAST            TRY1$ECOLI
   16       TRY1$HUMAN            VIP$GADMO
   17       VIP$GOAT              YP3$CHLTR
   18       YP4$CHLTR             ZP3$MOUSE


   C.2  IBM PS/2 1.4 Mb disks

   The number  and content  of the  1.4 Mb  disks for  the PS/2 systems are
   exactly identical to those of the 1.2 Mb disks (see above).

   C.3  Catalog file

   The SWISS-PROT  catalog file for PC/Gene (PROT_CAT.TXT) is stored on two
   disks (CATALOG  disks 1  and 2).  Insert the  first disk  in your floppy
   drive and  type: INSCAT.  Follow the  program instructions,  you will be
   prompted to enter the second disk once the content of the first one have
   been copied.

   C.4  Documentation disks

   There are  two documentation  disks. The  content of  these two disks is
   described in appendix B.
  

Swiss-Prot release 12.0

Published October 14, 1989

             SWISS-PROT RELEASE 12.0 RELEASE NOTES


   Date:     October 14, 1989
   Author:   A. Bairoch


                         1. INTRODUCTION

   1.1  Evolution

   Release 12.0  of SWISS-PROT  contains 12305 sequence entries, comprising
   3'797'482 amino  acids abstracted from 12147 references. This represents
   an increase of 16% over release 11.0. The recent growth of the data bank
   is summarized below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482


   1.2  Source of data

   Release 12.0  has been  updated using protein sequence data from release
   21.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 20.0 of the
   EMBL Nucleotide Sequence Data Library.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank  we list  here the  statistics  concerning  the  DR  (Databank
   Reference) pointer lines:

   Entries with pointer(s) to only PIR entri(es):           3125
   Entries with pointer(s) to only EMBL entri(es):          4873
   Entries with pointer(s) to both EMBL and PIR entri(es):  3575
   Entries with no pointers lines (entered in house):        732



      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 11

   2.1  Sequences and annotations

   Some 1466  new sequences  have been  added since  the last  release, the
   sequence  data  of  173  existing  entries  has  been  updated  and  the
   annotations of  2400 entries  have been  revised. In  particular we have
   used reviews  articles to update the annotations of the following groups
   or families of proteins:

      Acyl carrier proteins
      Aminoacyl-transfer RNA synthetases
      Biotin-requiring enzymes
      Chloroplast photosystems I and II proteins
      Creatine kinases
      Crp bacterial activator proteins
      Enolases
      Glucose-6-phosphate dehydrogenases
      Glutaredoxins
      GTP-binding elongation factors
      Heat shock hsp90 proteins
      Insulin family proteins
      Insulin-like growth factor binding proteins
      Insect-type alcohol dehydrogenases / ribitol dehydrogenase family
      Integrins
      Iron-containing alcohol dehydrogenases
      LysR bacterial activator proteins
      Malate dehydrogenase
      Mammalian defensins
      Mitochondrial energy transfer proteins
      Myc-type proteins
      Phosphoglucose isomerases
      Phosphoglycerate kinases
      Serine/threonine specific protein phosphatases
      Sugar transporters
      Uracil-DNA glycosylases
      Vertebrate galactoside-binding lectins
      Zinc-containing alcohol dehydrogenases


   2.2  New line-type

   This release  introduce an  new type  of data  line, the OG line. The OG
   (OrGanelle) lines  indicate if  the gene coding for a protein originates
   from the mitochondria, the chloroplast, or a plasmid. The format for the
   OG line is:

   OG   CHLOROPLAST.
   OG   MITOCHONDRION.
   OG   PLASMID name.

   Where 'name' is the name of the plasmid.

   Previously this  information was  stored in the OS line, as shown in the
   example below.

   OS   WHEAT (TRITICUM AESTIVUM) CHLOROPLAST.

   The above example will now be stored as:

   OS   WHEAT (TRITICUM AESTIVUM).
   OG   CHLOROPLAST.


   2.3  New topic for the comments (CC) line type

   As of release 12 we have added a new 'topic' for the comments (CC) line-
   type: CAUTION,  which is  used to  warn  about  possible  errors  and/or
   grounds for confusion. Example of its usage:

     CC   -!- CAUTION: ALSO SEE VERSION 2 OF THIS PROTEIN THAT DIFFERS DUE
     CC       TO A FRAMESHIFT.


   2.4  Documentation changes

   -  ACINDEX.TXT is  a new  document file  which is  an index  of all  the
      accession numbers  which appear  in SWISS-PROT  and the  name of  the
      entries in which they occur.
   -  PDBTOSP.TXT is  a new  document file  which is  an index  of all  the
      Brookhaven PDB entries referenced in SWISS-PROT.
   -  The JOURLIST.TXT document now indicates the abbreviation and the full
      names of all journals cited in SWISS-PROT.


                             3. THE NEXT RELEASE

   SWISS-PROT release 13.0 will be available in January 1990.

   Starting with  release 13 SWISS-PROT will be distributed with PROSITE, a
   data bank  of sites  and patterns  in proteins.  Both data banks will be
   fully cross-referenced.



                            4. WE NEED YOUR HELP !

   We welcome  any feedback  from our users. We especially would appreciate
   that you notify us if you find that sequences belonging to your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about annotations to be updated, as for example if the function
   of a protein has been clarified or if new post-translational information
   has become available.



                         APPENDIX A: SOME STATISTICS

   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.70   Gln (Q) 4.10   Leu (L) 9.12   Ser (S) 7.03
   Arg (R) 5.21   Glu (E) 6.23   Lys (K) 5.85   Thr (T) 5.85
   Asn (N) 4.39   Gly (G) 7.21   Met (M) 2.29   Trp (W) 1.34
   Asp (D) 5.21   His (H) 2.27   Phe (F) 3.95   Tyr (Y) 3.21
   Cys (C) 1.85   Ile (I) 5.38   Pro (P) 5.14   Val (V) 6.50

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Thr = Lys, Ile, Arg = Asp, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 1841

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 834
                            2x: 345
                            3x: 180
                            4x: 117
                            5x:  73
                            6x:  60
                            7x:  27
                            8x:  29
                            9x:  38
                           10x:  16
                       11- 20x:  61
                       21-100x:  49
                         >100x:  12



        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        1093          Human
         2         951          Escherichia coli
         3         621          Mouse
         4         519          Rat
         5         397          Baker's yeast (Saccharomyces cerevisiae)
         6         337          Bovine
         7         210          Fruit fly (Drosophila melanogaster)
         8         205          Chicken
         9         174          Rabbit
        10         149          Pig
        11         133          Bacillus subtilis
        12         113          African clawed frog (Xenopus laevis)
        13          98          Tobacco
        14          96          Salmonella typhimurium
        15          94          Maize
        16          89          Rice
        17          84          Liverwort (Marchantia polymorpha)
        18          80          Bacteriophage T4
        19          77          Wheat
        20          70          Herpes virus (Type 1, Strain 17)
        21          69          Spinach
        22          68          Vaccinia Virus
        23          67          Varicella-Zoster virus (Strain Dumas)
        24          63          Soybean
        25          62          Bacteriophage Lambda



   A.3  Repartition of the sequences by size

      From   To  Number             From   To   Number
         1-  50     782             1001-1100       96
        51- 100    1520             1101-1200       63
       101- 150    2297             1201-1300       51
       151- 200    1257             1301-1400       31
       201- 250     978             1401-1500       22
       251- 300     827             1501-1600       13
       301- 350     728             1601-1700       15
       351- 400     708             1701-1800       11
       401- 450     540             1801-1900        9
       451- 500     615             1901-2000       14
       501- 550     476                 >2000       68
       551- 600     282
       601- 650     215
       651- 700     156
       701- 750     130
       751- 800      93
       801- 850      93
       851- 900     112
       901- 950      51
       951-1000      52


   Currently the three largest sequences are:

   RYNR$RABIT  5037 a.a.
   APB$HUMAN   4563 a.a.
   APOA$HUMAN  4548 a.a.



                       APPENDIX B: DISKS FOR SWISS-PROT

   B.1  IBM PC/AT 1.2 Mb disks

   SWISS-PROT release  12 is  stored on sixteen 1.2 Mb disks. Each of these
   disk contains a single bulk file (PRT12_01.BLK to PRT12_16.BLK):

   Disk     First sequence        Last Sequence
    1       10K5$ECOLI            ATP6$YEAST
    2       ATP8$ASPAM            CHLN$ECOLI
    3       CHOA$STRSP            CYC$MIRLE
    4       CYC$MOUSE             FA10$HUMAN
    5       FA11$HUMAN            GTA1$RAT
    6       GTA2$RAT              HMEN$DROME
    7       HMEN$DROVI            KAD1$HUMAN
    8       KAD1$PIG              M4$DICDI
    9       M5$ECOLI              NRAM$INACR
   10       NRAM$INADA            POL$HIV2I
   11       POL$HIV2N             RBS4$LYCES
   12       RBS4$SOYBN            SMS1$HUMAN
   13       SMS1$ICTPU            TRPC$ACICA
   14       TRPC$ASPNG            VIP$HUMAN
   15       VIP$PIG               YU74$ECOLI
   16       YVL1$HCMVA            ZP3$MOUSE


   B.2  IBM PS/2 1.4 Mb disks

   The number  and content  of the  1.4 Mb  disks for  the PS/2 systems are
   exactly identical to those of the 1.2 Mb disks (see above).
  

Swiss-Prot release 11.0

Published July 10, 1989

             SWISS-PROT RELEASE 11.0 RELEASE NOTES


   Date:     July 10, 1989
   Author:   A. Bairoch


                         1. INTRODUCTION

   1.1  Evolution

   Release 11.0 of SWISS-PROT contains 10856 sequence entries,
   comprising 3'265'966  amino  acids  abstracted  from  10775
   references. This  represents an increase of 9% over release
   10.0. The  recent growth  of the  data bank  is  summarised
   below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966



   1.2  Source of data

   Release 11.0  has been  updated using protein sequence data
   from  release  20.0  of  the  PIR  (Protein  Identification
   Resource) protein  data bank,  as well  as  translation  of
   nucleotide sequence  data from  release 19.0  of  the  EMBL
   nucleotide sequence Data Library.

   As an  indication to the source of the sequence data in the
   SWISS-PROT data bank we list here the statistics concerning
   the DR (Databank Reference) pointer lines:

   Entries with pointer(s) to only PIR entri(es):          2992
   Entries with pointer(s) to only EMBL entri(es):         3875
   Entries with pointer(s) to both EMBL and PIR entri(es): 3272
   Entries with no pointers lines (entered in house):       717



     2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE
                           RELEASE 10


   2.1  Sequences and annotations

   Some 848  new sequences  have been  added  since  the  last
   release, the sequence data of 113 existing entries has been
   updated and  the annotations  of  1366  entries  have  been
   revised. In  particular we  have used  reviews articles  to
   update the  annotations of the following groups or families
   of proteins:

      Adenylate kinases
      Bacterial restriction systems proteins
      Bacterial transduction systems proteins
      Caseins
      Chitin-binding proteins
      Cutinases
      Cytochromes P450
      DNA polymerases
      DNA topoisomerases type I
      Esterases
      Heat shock hsp70 proteins
      Lipases
      Microtubule-associated proteins
      2-oxo acid dehydrogenases complex components
      Paramyxoviruses proteins
      Protein disulfide isomerases
      Purine/pyrimidine phosphoribosyl transferases
      Rhabdoviruses proteins
      Ribonucleotide reductases
      Rotaviruses proteins
      Serine hydroxymethyltransferases
      Small, acid-soluble spore proteins
      Xylose isomerases


   2.2  Standardized journal abbreviations

   Journal  names   are  now   abbreviated  according  to  the
   conventions  used  by  the  National  Library  of  Medicine
   (Washington D.C.,  USA) and  are based  on the existing ISO
   and ANSI  standards. In  most cases  the changes are small,
   and the  new abbreviations  are at  least as  meaningful as
   the old ones. As in previous releases the abbreviations for
   the journals cited in SWISS-PROT are listed in the document
   file JOURLIST.TXT


   2.3  New feature key

   A new  feature key  has been  introduced in  this  release:
   THIOETH, which  describes  a  thioether  bond  between  two
   residues.



                       3. THE NEXT RELEASE

   SWISS-PROT release 12.0 will be available in November 1989.




                     4. WE NEED YOUR HELP !

   We welcome any feedback from our users. We especially would
   appreciate that  you notify  us if  you find that sequences
   belonging to  your field  of expertise are missing from the
   data  bank.  We  also  would  like  to  be  notified  about
   annotations to  be updated,  as for example if the function
   of  a   protein  has   been  clarified   or  if  new  post-
   translational information has become available.



                   APPENDIX A: SOME STATISTICS


   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data
        bank

   Ala (A) 7.74   Gln (Q) 4.11   Leu (L) 9.08   Ser (S) 7.01
   Arg (R) 5.22   Glu (E) 6.19   Lys (K) 5.83   Thr (T) 5.84
   Asn (N) 4.38   Gly (G) 7.27   Met (M) 2.27   Trp (W) 1.34
   Asp (D) 5.22   His (H) 2.29   Phe (F) 3.94   Tyr (Y) 3.23
   Cys (C) 1.88   Ile (I) 5.31   Pro (P) 5.17   Val (V) 6.51

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


        A.1.2  Classification of the amino acids by their
        frequency

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Arg = Asp,
   Pro, Asn, Gln, Phe, Tyr, His, Met, Cys, Trp


   A.2   Repartition of  the sequences  by their  organism  of
   origin

   Total number  of species represented in this release of the
   data bank: 1687

   Species represented 1x: 785
                       2x: 304
                       3x: 169
                       4x: 102
                       5x:  69
                       6x:  47
                       7x:  29
                       8x:  28
                       9x:  34
                      10x:  16
                  11- 20x:  51
                  21-100x:  42
                    >100x:  11


        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        1003          Human
         2         885          Escherichia coli
         3         555          Mouse
         4         458          Rat
         5         354          Baker's  yeast (Saccharomyces cerevisiae)
         6         313          Bovine
         7         185          Fruit fly (Drosophila melanogaster)
         8         183          Chicken
         9         151          Rabbit
        10         131          Pig
        11         102          African clawed frog (Xenopus laevis)
        12          96          Bacillus subtilis
        13          84          Salmonella typhimurium
        14          83          Maize
        15          79          Bacteriophage T4
        16          70          Herpes virus (Type 1, Strain 17)
                    70          Tobacco
        18          67          Varicella-Zoster virus (Strain Dumas)
        19          62          Bacteriophage Lambda
                    62          Vaccinia Virus
                    62          Wheat


   A.3  Repartition of the sequences by size

        From   To   Number            From   To   Number
           1-  50      626            1001-1100       77
          51- 100     1368            1101-1200       53
         101- 150     2159            1201-1300       44
         151- 200     1130            1301-1400       26
         201- 250      876            1401-1500       17
         251- 300      729            1501-1600       10
         301- 350      643            1601-1700       14
         351- 400      608            1701-1800       12
         401- 450      468            1801-1900        8
         451- 500      527            1901-2000        7
         501- 550      401                >2000       54
         551- 600      244
         601- 650      188
         651- 700      130
         701- 750      108
         751- 800       76
         801- 850       77
         851- 900       94
         901- 950       43
         951-1000       39

   Currently the two largest sequences are:

   APB$HUMAN   4563 a.a.
   APOA$HUMAN  4548 a.a.


                APPENDIX B: DISKS FOR SWISS-PROT


   B.1  IBM PC/AT 1.2 Mb disks

   SWISS-PROT is  stored on  fourteen 1.2  Mb disks.  Each  of
   these disk  contains a  single bulk  file (PRT11_01.BLK  to
   PRT11_14.BLK):

   Disk     First sequence        Last Sequence
    1       10KA$MYCTU            B1AR$HUMAN
    2       B1AR$MELGA            COLI$SQUAC
    3       COLI$STRCA            DPOL$HPBVY
    4       DPOL$HPBVZ            GC2$HUMAN
    5       GC3$HUMAN             HEMA$INCMI
    6       HEMA$INCP1            K1CS$BOVIN
    7       K1CS$HUMAN            MAP2$HUMAN
    8       MAS1$YEAST            ODB1$BOVIN
    9       ODB2$HUMAN            PRP2$MOUSE
   10       PRP2$RAT              RRPO$BPSP
   11       RRPO$CARMV            TKNG$RAT
   12       TKNK$BOVIN            VGLG$VSVJ
   13       VGLG$VSVO             YVL6$HCMVA
   14       YWL1$HCMVA            ZP3$MOUSE



   B.2  IBM PS/2 1.4 Mb disks

   The number  and content  of the  1.4 Mb  disks for the PS/2
   systems are  exactly identical to those of the 1.2 Mb disks
   (see above).
  

Swiss-Prot release 10.0

Published March 5, 1989

             SWISS-PROT RELEASE 10.0 RELEASE NOTES


   Date:     March 5, 1989
   Author:   A. Bairoch


                         1. INTRODUCTION

   1.1  Evolution

   Release 10.0 of SWISS-PROT contains 10008 sequence entries,
   comprising  2'952'613  amino  acids  abstracted  from  9920
   references. This represents an increase of 15% over release
   9.0. The  recent growth  of the  data  bank  is  summarised
   below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613


   1.2  Source of data

   Release 10.0  has been  updated using protein sequence data
   from  release  19.0  of  the  PIR  (Protein  Identification
   Resource) protein  data bank,  as well  as  translation  of
   nucleotide sequence  data from  release 18.0  of  the  EMBL
   nucleotide sequence Data Library.

   As an  indication to the source of the sequence data in the
   SWISS-PROT data bank we list here the statistics concerning
   the DR (Databank Reference) pointer lines:

   Entries with pointer(s) to only PIR entri(es):          3009
   Entries with pointer(s) to only EMBL entri(es):         3383
   Entries with pointer(s) to both EMBL and PIR entri(es): 2973
   Entries with no pointers lines (entered in house):       643


   2. Description  of the  changes made  to  SWISS-PROT  since
      release 9

   2.1  Sequences and annotations

   Some 1306  new sequences  have been  added since  the  last
   release, the sequence data of 159 existing entries has been
   updated and  the annotations  of  1699  entries  have  been
   revised. In  particular we  have used  reviews articles  to
   update the  annotations of the following groups or families
   of proteins:

      Aerolysin type toxins
      Asparaginases / Glutaminases
      Aspartyl proteases
      ATP-binding proteins 'active transport' family
      Bacterial and fungi ribonucleases
      Bowman-Birk serine protease inhibitors
      Cadherins
      Calcitonins
      Clathrin light chains
      Crystallins beta and gamma
      Cytosine-specific DNA methylases
      Colony stimulating factors
      Glucagon / GIP / secretin / VIP family
      E1-E2 ATPases
      Crystaline entomocidal toxin proteins
      Fos/jun proteins family.
      Galactose-1-phosphate uridyl transferase
      Glutathione S-transferase
      Herpes and Varicella Zooster viruses proteins
      Int-1 family
      Interferons alpha and beta
      Interferons induced proteins
      Kazal serine protease inhibitors
      Lipoxygenases
      LysR bacterial activator proteins family
      Manganese and Iron superoxide dismutases
      Myb family proteins
      Myc family proteins
      Nicotinic acetylcholine receptors
      Pancreatic hormone / Neuropeptide Y family
      Pancreatic ribonucleases
      Peptidyl-prolyl     cis-trans     isomerase     (ppiase)
      (cyclophilin)
      Platelet factor 4 family.
      Shiga/Ricin ribosomal inactivating toxins
      Somatotropin, prolactin and related hormones
      Squash family of serine protease inhibitors
      Thymidylate synthase
      TNF alpha and beta
      Topoisomerases type II
      Tropomyosins


   2.2  New format for the date (DT) line type

   The format of the DT line has been changed and is now:

    DT   DD-MMM-YYYY  (REL. XX, COMMENT)

   where DD  is the  day, MMM the month, YYYY the year, and XX
   the SWISS-PROT  release number.  The comment portion of the
   line indicates  the action  taken on  That date.  There are
   always three  DT lines  in each  entry,  each  of  them  is
   associated with a specific comment:

   -  The  first  DT  line  indicates  when  the  entry  first
      appeared in  the data  bank. The  associated comment  is
      "CREATED".
   -  The second  DT line indicates when the sequence data was
      last modified.  The associated comment is "LAST SEQUENCE
      UPDATE".
   -  The third DT line indicates when any data other then the
      sequence was  last modified.  The associated  comment is
      "LAST ANNOTATION UPDATE".

   Example of a block of DT lines:

   DT   01-JAN-1988  (REL. 06, CREATED)
   DT   01-AUG-1988  (REL. 08, LAST SEQUENCE UPDATE)
   DT   01-MAR-1989  (REL. 10, LAST ANNOTATION UPDATE)


   2.3   Extension of  the taxonomic  classification in the OC
   lines

   In  previous   releases  of  SWISS-PROT  the  OC  (Organism
   Classification) lines  only contained the first node of the
   taxonomic tree (PROKARYOTA, EUKARYOTA or VIRIDAE). Starting
   with release  10  we  are  implementing  a  full  taxonomic
   classification. In  release  10,  164  different  taxonomic
   nodes have  been  defined.  The  list  of  these  nodes  is
   available in the SPECLIST.TXT document file.

   2.4  New topic for the comments (CC) line type

   As of  release 10  we have  added a  new  'topic'  for  the
   comments  (CC)   line-type:  COFACTOR,  which  is  used  to
   describe enzyme cofactor(s). Example of its usage:

     CC   -!- COFACTOR: REQUIRES PYRIDOXAL PHOSPHATE.


   2.5  New feature key

   A new  feature key  has been  introduced in  this  release:
   TRANSIT, which  describes the  extent of  a transit peptide
   (mitochondrial or  chloroplastic). Examples  of TRANSIT key
   feature lines:

     FT   TRANSIT       1     25       MITOCHONDRION.
     FT   TRANSIT       1     42       CHLOROPLAST.


   3. THE NEXT RELEASE

   SWISS-PROT release 11.0 will be available in June 1989.


   4. WE NEED YOUR HELP !

   We welcome any feedback from our users. We especially would
   appreciate that  you notify  us if  you find that sequences
   belonging to  your field  of expertise are missing from the
   data  bank.  We  also  would  like  to  be  notified  about
   annotations to  be updated,  as for example if the function
   of  a   protein  has   been  clarified   or  if  new  post-
   translational information has become available.


   APPENDIX A: DISKS FOR SWISS-PROT

   A.1  IBM PC/AT 1.2 Mb disks

   SWISS-PROT is  stored on twelve 1.2 Mb disks. Each of these
   disk  contains   a  single   bulk  file   (PRT10_01.BLK  to
   PRT10_12.BLK):

   Disk     First sequence        Last Sequence
    1       10KA$MYCTU            C1QC$HUMAN
    2       C1S$HUMAN             CRA$PLAFA
    3       CRA2$MESAU            ENV$HIV1M
    4       ENV$HIV1P             H1$DROME
    5       H1$ECHCR              HYEP$HUMAN
    6       HYEP$RABIT            KV3T$MOUSE
    7       KV3U$MOUSE            NGFR$RAT
    8       NIF1$CLOPA            POLG$FMDVI
    9       POLG$FMDVO            RNP$DAMDA
   10       RNP$DAMKO             TKN$PHYFU
   11       TKN1$HYLMA            VIB1$AGRT9
   12       VIB2$AGRT6            ZIPP$DROME

   A.2  IBM PS/2 1.4 Mb disks

   The number  and content  of the  1.4 Mb  disks for the PS/2
   systems are  exactly identical to those of the 1.2 Mb disks
   (see above).

   APPENDIX B: SOME STATISTICS


   B.1  Amino acid composition

        COMPOSITION IN PERCENT FOR THE COMPLETE DATA BANK

   Ala (A) 7.77   Gln (Q) 4.11   Leu (L) 9.08   Ser (S) 7.00
   Arg (R) 5.23   Glu (E) 6.15   Lys (K) 5.81   Thr (T) 5.84
   Asn (N) 4.36   Gly (G) 7.30   Met (M) 2.26   Trp (W) 1.35
   Asp (D) 5.21   His (H) 2.29   Phe (F) 3.94   Tyr (Y) 3.22
   Cys (C) 1.89   Ile (I) 5.30   Pro (P) 5.21   Val (V) 6.51

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03


      CLASSIFICATION OF THE AMINO ACIDS BY THEIR FREQUENCY

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Arg, Asp, Pro,
   Asn, Gln, Phe, Tyr, His, Met, Cys, Trp



   B.2   Repartition of  the sequences  by their  organism  of
         origin

   Total number  of species represented in this release of the
   data bank: 1590

   Species represented 1x: 741
                       2x: 291
                       3x: 163
                       4x:  99
                       5x:  58
                       6x:  45
                       7x:  27
                       8x:  28
                       9x:  30
                      10x:  13
                  11- 20x:  43
                  21-100x:  42
                    >100x:  10



              TABLE OF THE MOST REPRESENTED SPECIES

   918: HUMAN       173: DROME       71: BPT4        59: VACCV
   838: ECOLI       143: RABIT       70: HSV11       57: WHEAT
   497: MOUSE       126: PIG         69: TOBAC       54: BPT7
   414: RAT          93: XENLA       67: VZVD        53: SOYBN
   324: YEAST        84: BACSU       62: LAMBD       53: SHEEP
   282: BOVIN        80: SALTY       60: MARPO
   176: CHICK        74: MAIZE       59: EPBAR


   B.3  Repartition of the sequences by size

     From  To   Number            From   To   Number
        1-  50     582            1001-1100       70
       51- 100    1291            1101-1200       46
      101- 150    2043            1201-1300       35
      151- 200    1066            1301-1400       24
      201- 250     805            1401-1500       17
      251- 300     662            1501-1600        8
      301- 350     588            1601-1700       11
      351- 400     549            1701-1800        9
      401- 450     420            1801-1900        8
      451- 500     461            1901-2000        5
      501- 550     369                >2000       49
      551- 600     216
      601- 650     168            Currently the two largest sequences are:
      651- 700     114            APB$HUMAN   4563 a.a.
      701- 750      99            APOA$HUMAN  4548 a.a.
      751- 800      70
      801- 850      66
      851- 900      85
      901- 950      37
      951-1000      35
  

Swiss-Prot release 9.0

Published December 1, 1988

             SWISS-PROT RELEASE 9.0 RELEASE NOTES


   Date:     December 1, 1988
   Author:   A. Bairoch


                         1. INTRODUCTION

   1.1  Evolution

   Release 9.0  of SWISS-PROT  contains 8702 sequence entries,
   comprising  2'498'140  amino  acids  abstracted  from  8735
   references. This represents an increase of 12% over release
   8.0. The  recent growth  of the  data  bank  is  summarised
   below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140


   1.2  Source of data

   Release 9.0  has been  updated using  protein sequence data
   from  release  17.0  of  the  PIR  (Protein  Identification
   Resource) protein  data bank,  as well  as  translation  of
   nucleotide sequence  data from  release 17.0  of  the  EMBL
   nucleotide sequence Data Library.

   As an  indication to the source of the sequence data in the
   SWISS-PROT data bank we list here the statistics concerning
   the DR (Databank Reference) pointer lines:

   Entries with pointer(s) to only PIR entri(es):          3049
   Entries with pointer(s) to only EMBL entri(es):         2640
   Entries with pointer(s) to both EMBL and PIR entri(es): 2532
   Entries with no pointers lines (entered in house):       481


   2. Description  of the  changes made  to  SWISS-PROT  since
      release 8

   2.1  Sequences and annotations

   Some 978  new sequences  have been  added  since  the  last
   release, the sequence data of 111 existing entries has been
   updated and  the annotations  of  1622  entries  have  been
   revised. In  particular we  have used  reviews articles  to
   update the  annotations of the following groups or families
   of proteins:

        Acidic ribosomal proteins
        Adipokinetic hormone family
        ATP synthase subunits
        Bacterial histone-like DNA-binding proteins
        Bombesin-like peptides family
        Chaperonins
        Cholecystokinin/gastrin family
        Ferredoxins
        G-protein coupled receptors
        Histones H2A
        Histones H4
        HIV/SIV viruses proteins
        Homeobox proteins
        Integrins alpha chains
        Intermediate filaments
        Myosin heavy chains
        Myosin light chains
        Ovomucoids
        Oxytocin/vasopressin family
        Potato inhibitor I family
        Prion protein
        Receptor tyrosine-protein kinase class II
        Receptor tyrosine-protein kinase class III
        Ribonucleoproteins
        Serine carboxypeptidases
        Silk moth chorion proteins
        Thiol proteases inhibitors


   2.2  Cross-references to the HIV Sequence Database

   Starting with  release 9  we have added cross-references to
   entries in  the Human  Retroviruses and AIDS compilation of
   nucleic   and    amino   acid   sequences   (HIV   Sequence
   Database)(1). An  example of  a DR  line pointing to an HIV
   entry is shown here:

   DR   HIV; M15654; 3ORF$BH102.


   2.3  New topic for the comments (CC) line type

   As of  release 9  we have  added  a  new  'topic'  for  the
   comments (CC) line-type: ALTERNATIVE SPLICING which is used
   to describe  the existence  of related  protein sequence(s)
   produced by alternative splicing of the same gene(s).

   Example of its usage:

   CC   -!- ALTERNATIVE SPLICING: SKELETAL MUSCLE AND FIBROBLAST
   CC       TROPOMYOSINS ARE OBTAINED BY ALTERNATIVE MRNA SPLICING.


   3. THE NEXT RELEASE

   SWISS-PROT release 10.0 will be available in February 1989.
   We plan to upgrade the OC lines so as to add at least a few
   level of  taxonomic nodes  (currently  only  the  top-node:
   viridae, prokaryota or eukaryota is available).


   4. WE NEED YOUR HELP !

   We welcome any feedback from our users. We especially would
   appreciate that  you notify  us if  you find that sequences
   belonging to  your field  of expertise are missing from the
   data  bank.  We  also  would  like  to  be  notified  about
   annotations to  be updated,  as for example if the function
   of  a   protein  has   been  clarified   or  if  new  post-
   translational information has become available.


   ____________________
   1  The HIV  Sequence Database  is edited  by G. Myers, A.B.
      Rabson,  S.F.   Josephs,  T.F.   Smith,  F.  Wong-Staal;
      published by  the  Theoretical  Biology  and  Biophysics
      Group T-10 at Los Alamos National Laboratory; and funded
      by the AIDS program of the National Institute of Allergy
      and Infectious Diseases through an interagency agreement
      with the United States Department of Energy.


   APPENDIX A: DISKS FOR SWISS-PROT

   -  Release 9  is the first release to be available on a CD-
      ROM disk.
   -  Starting with  release 9  we have  stopped  distributing
      SWISS-PROT on 360 Kb disks.

   A.1  IBM PC/AT 1.2 Mb disks

   SWISS-PROT is  stored on  ten 1.2  Mb disks.  Each of these
   disk  contains   a  single   bulk  file   (PROT9_01.BLK  to
   PROT9_10.BLK):

   Disk     First sequence        Last Sequence
    1       16K$TRVPS             CAS2$BOVIN
    2       CAS2$SHEEP            CYSE$ECOLI
    3       CYSP$MOUSE            GENK$ECOLI
    4       GENL$ECOLI            IBB3$SOYBN
    5       IBB4$MACAX            LPID$EDWTA
    6       LPIV$ECOLI            PA21$PIG
    7       PA21$PSEAU            RENI$RAT
    8       RENS$MOUSE            TOLC$ECOLI
    9       TOLL$DROME            VSI1$REOV1
   10       VSI1$REOV2            ZEB2$MAIZE

   A.2  IBM PS/2 1.4 Mb disks

   The number  and content  of the  1.4 Mb  disks for the PS/2
   systems are  exactly identical to those of the 1.2 Mb disks
   (see above).



   APPENDIX B: SOME STATISTICS

   B.1  Amino acid composition



        COMPOSITION IN PERCENT FOR THE COMPLETE DATA BANK

   Ala (A) 7.75   Gln (Q) 4.10   Leu (L) 9.06   Ser (S) 7.01
   Arg (R) 5.15   Glu (E) 6.17   Lys (K) 5.89   Thr (T) 5.84
   Asn (N) 4.37   Gly (G) 7.32   Met (M) 2.29   Trp (W) 1.36
   Asp (D) 5.19   His (H) 2.28   Phe (F) 3.94   Tyr (Y) 3.23
   Cys (C) 1.90   Ile (I) 5.32   Pro (P) 5.17   Val (V) 6.51

   Asx (B) 0.02   Glx (Z) 0.01   Xaa (X) 0.03


      CLASSIFICATION OF THE AMINO ACIDS BY THEIR FREQUENCY

   Leu, Ala, Gly, Ser, Val, Glu, Lys, Thr, Ile, Asp, Pro, Arg,
   Asn, Gln, Phe, Tyr, Met, His, Cys, Trp


   B.2   Repartition of  the sequences  by their  organism  of
         origin

   Total number  of species represented in this release of the
   data bank: 1345

   Species represented 1x:  700
                       2x:  276
                       3x:  155
                       4x:   88
                       5x:   48
                       6x:   43
                       7x:   23
                       8x:   28
                       9x:   24
                      10x:   10
                     >10x:   83



              TABLE OF THE MOST REPRESENTED SPECIES

   795: HUMAN       142: DROME       69: SALTY       59: VACCV
   703: ECOLI       141: CHICK       68: BPT4        54: BPT7
   452: MOUSE       129: RABIT       67: MAIZE       52: MARPO
   356: RAT         116: PIG         62: LAMBD       50: WHEAT
   297: YEAST        85: XENLA           TOBAC
   264: BOVIN        81: BACSU       59: EPBAR


   B.3  Repartition of the sequences by size

   From   To  Number            From   To   Number
      1-  50     487            1001-1100       54
     51- 100    1179            1101-1200       34
    101- 150    1893            1201-1300       29
    151- 200     938            1301-1400       21
    201- 250     661            1401-1500       13
    251- 300     554            1501-1600        4
    301- 350     492            1601-1700       10
    351- 400     483            1701-1800        8
    401- 450     354            1801-1900        6
    451- 500     387            1901-2000        3
    501- 550     327            >2000           43
    551- 600     176
    601- 650     135            Currently the two largest sequences are:
    651- 700      87            APB$HUMAN   4563 a.a.
    701- 750      82            APOA$HUMAN  4548 a.a.
    751- 800      61
    801- 850      52
    851- 900      68
    901- 950      30
    951-1000      31
  

Swiss-Prot release 8.0

Published August 4, 1988

             SWISS-PROT RELEASE 8.0 RELEASE NOTES


   Date:     August 4, 1988
   Author:   A. Bairoch


                         1. INTRODUCTION

   1.1  Evolution

   Release 8.0  of SWISS-PROT  contains 7724 sequence entries,
   comprising  2'224'465  amino  acids  abstracted  from  8088
   references. This represents an increase of 13% over release
   7.0. The  recent growth  of the  data  bank  is  summarised
   below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465



   1.2  Source of data

   Release 8.0  has been  updated using  protein sequence data
   from  release  16.0  of  the  PIR  (Protein  Identification
   Resource) protein  data bank,  as well  as  translation  of
   nucleotide sequence  data from  release 15.0  of  the  EMBL
   nucleotide sequence Data Library.

   As an  indication to the source of the sequence data in the
   SWISS-PROT data bank we list here the statistics concerning
   the DR (Databank Reference) pointer lines:

   Entries with pointer(s) to only PIR entri(es):          3385
   Entries with pointer(s) to only EMBL entri(es):         2084
   Entries with pointer(s) to both EMBL and PIR entri(es): 2057
   Entries with no pointers lines (entered in house):       198



     2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE
                            RELEASE 7


   2.1  Sequences and annotations

   Some 903  new sequences  have been  added  since  the  last
   release, the  sequence data of 93 existing entries has been
   updated and  the annotations  of almost  2400 entries  have
   been revised.  In particular  we have used reviews articles
   to update  the  annotations  of  the  following  groups  or
   families of proteins:

        1,6-fructose-bisphosphate aldolases.
        Alkaline phosphatases.
        Annexins.
        Endo- and Exo-glucanases.
        Bacterial ferredoxins.
        Beta-adrenergic receptors.
        Chemotaxis proteins.
        Collagens.
        Complement proteins.
        Cytochrome B5.
        Cytosolic lipid-binding proteins.
        Fumarases.
        Malate dehydrogenases.
        Muscarinic acetylcholine receptors.
        Opsins.
        Protamines.
        PTS system proteins.
        RNA polymerase sigma factors.
        Serpins.
        Small, acid-soluble spore proteins (SASP).
        Tachykinins.
        TGF-beta/Inhibins/MIS.
        Zinc-finger proteins.
        Zinc metallo endopeptidases.


   2.2  New feature table keys

   Two  new  feature  keys  have  been  introduced  with  this
   release:

   ZN_FING:   Extent of a "zinc-finger" region.

   NP_BIND:   Extent  of   a   nucleotide   phosphate  binding
              region. The  nature of  the nucleotide phosphate
              is indicated  in the description field. Examples
              of NP_BIND key feature lines:

   FT   NP_BIND      13     25       ATP.
   FT   NP_BIND      45     49       GTP (POTENTIAL).


   2.3  Enhancement in the format of the CC line

   A major  proportion of  the comment blocks are now arranged
   according to what we designate as 'topics`, the format of a
   comment block which belongs to a 'topic` is:

   CC    -!- TOPIC: FREE TEXT DESCRIPTION.

   The current topics and their definition are:

   CATALYTIC ACTIVITY Description of the reaction(s) catalysed
                      by an enzyme [*].
   DISEASE            Description of the disease(s) associated
                      with a deficiency of a protein.
   FUNCTION           General description  of the  function(s)
                      of a protein.
   INDUCTION          Description  of  the  compound(s)  which
                      stimulate the synthesis of a protein.
   PATHWAY            Description of  the metabolic pathway(s)
                      to which is associated a protein.
   SIMILARITY         Description   of    the   similariti(es)
                      (sequence or  structural) of  a  protein
                      with other proteins.
   SUBCELLULAR LOCATION     Description  of   the  subcellular
                      location of a mature protein product.
   SUBUNIT            Description of  the quaternary structure
                      of a protein.

   [*]  Whenever it was possible we have used, to describe the
      catalytic activity  of an enzyme, the recommendations of
      the Nomenclature Committee of the International Union of
      Biochemistry (IUB) as published in:

      Enzyme Nomenclature,  NC-IUB, Academic  Press, New-York,
      (1984).
      and
      Supplement  1:   Corrections  and   Additions,  Eur.  J.
      Biochem. 157:1-26(1986).

   Here is,  for each  of the topics defined above, an example
   of its usage:

   CC   -!- CATALYTIC ACTIVITY: ATP + L-GLUTAMATE + NH(3) =
   CC       ADP + GLUTAMINE + ORTHOPHOSPHATE.
   CC   -!- DISEASE: NADH-CYTOCHROME B5 REDUCTASE DEFICIENCY
   CC       CAUSES HEREDITARY METHEMOGLOBINEMIA.
   CC   -!- FUNCTION: PREVENTS THE POLYMERIZATION OF ACTIN.
   CC   -!- INDUCTION: ARGINASE ACTIVITY CAN BE INDUCED BY
   CC       ARGININE OR HOMOARGININE.
   CC   -!- PATHWAY: FIRST STEP IN PROLINE BIOSYNTHESIS.
   CC   -!- SIMILARITY: WITH SUBTILISINS, THERMITASE, AND
   CC       PROTEINASE K.
   CC   -!- SUBCELLULAR LOCATION: MITOCHONDRIAL MATRIX.
   CC   -!- SUBUNIT: TETRAMER OF IDENTICAL CHAINS.


   2.4  Small change in the format of the RN line

   The syntax  of RN  lines  for  articles  which  report  the
   sequence of a protein translated from a nucleotide sequence
   used to  be "SEQUENCE  TRANSL. FROM N.A. SEQ.", and has now
   been changed to "SEQUENCE FROM N.A."

   Examples:

   RN   [1] (SEQUENCE FROM N.A.)
   RN   [1] (SEQUENCE OF 12-143 FROM N.A.)


   2.5 Changes in the documentation

   The document  file SPECFREQ.TXT  which was used to list the
   frequency table  of usage  of  the  species  identification
   codes  has  been  discontinued,  this  information  is  now
   included in a new appendix of these release notes (Appendix
   B), along  with other  statistics such  as the  amino  acid
   composition in  the data  bank and  the repartition  of the
   sequences by size (number of residues).



                       3. THE NEXT RELEASE

   SWISS-PROT release  9.0 will be available in December 1988.
   We plan to upgrade the OC lines so as to add at least a few
   level of  taxonomic nodes  (currently  only  the  top-node:
   viridae, prokaryota or eukaryota is available).



                     4. WE NEED YOUR HELP !

   We welcome any feedback from our users. We especially would
   appreciate that  you notify  us if  you find that sequences
   belonging to  your field  of expertise are missing from the
   data  bank.  We  also  would  like  to  be  notified  about
   annotations to  be updated,  as for example if the function
   of  a   protein  has   been  clarified   or  if  new  post-
   translational information has become available.



                APPENDIX A: DISKS FOR SWISS-PROT


   A.1  IBM PC/AT 1.2 Mb disks

   SWISS-PROT is  stored on  nine 1.2  Mb disks. Each of these
   disk  contains   a  single   bulk  file   (PROT8_01.BLK  to
   PROT8_09.BLK):

   Disk     First sequence        Last Sequence
    1       16K$TRVPS             CD8$HUMAN
    2       CD8$MOUSE             DYR$LACCA
    3       DYR$MESAU             HBA$CAVPO
    4       HBA$CEBAP             KNL1$BOVIN
    5       KNL1$HUMAN            NRAM$INATO
    6       NRAM$INATR            RASN$MOUSE
    7       RB1$HUMAN             TRA$MAIZE
    8       TRA$PSEAE             Y8K6$VACCV
    9       Y90K$VACCV            ZEG1$MAIZE


   A.2  IBM PS/2 1.4 Mb disks

   The number  and content  of the  1.4 Mb  disks for the PS/2
   systems are  exactly identical to those of the 1.2 Mb disks
   (see above).


   A.3  IBM PC 360 Kb disks

   SWISS-PROT is  stored on  twenty eight  (28) 360  Kb disks.
   Each  of   these  disks   contains  a   single  bulk   file
   (PROT8_01.BLK to PROT8_28.BLK):

   Disk     First sequence        Last Sequence
    1       16K$TRVPS             AMEG$BOVIN
    2       AMID$XENLA            AZUR$ALCDE
    3       AZUR$ALCFA            CART$CHICK
    4       CAS1$BOVIN            COA1$POVMA
    5       COA1$SV40             CRG2$RANTE
    6       CRG2$RAT              CYS1$DICDI
    7       CYS2$DICDI            ELA2$MOUSE
    8       ELA2$PIG              FIBG$RAT
    9       FIBH$HUMAN            GLEM$HUMAN
   10       GLGA$ECOLI            HB2I$HUMAN
   11       HB2I$MOUSE            HEMA$MEASH
   12       HEMA$NDV              IG1R$HUMAN
   13       IGAO$DACDE            KAG2$MOUSE
   14       KAG3$MOUSE            KV5C$MOUSE
   15       KV5D$MOUSE            MAL3$DROME
   16       MALE$ECOLI            MYSB$RABIT
   17       MYSP$RAT              NUO5$PANPA
   18       NUO5$PONPY            PG40$HUMAN
   19       PGCA$CHICK            POLN$SFV
   20       POLN$SINDV            RADX$YEAST
   21       RAN1$SCHPO            ROB2$HUMAN
   22       ROC$HUMAN             SP0A$BACSU
   23       SP0B$BACSU            THRB$BOVIN
   24       THRB$HUMAN            TX1$ANESU
   25       TX11$NAJHH            VGE$BPS13
   26       VGF$BPG4              VNST$VSVJ
   27       VNST$VSVJM            YGYW$ECOLI
   28       YHL1$EPBAR            ZEG1$MAIZE


   A.4  IBM PS/2 720 Kb disks

   SWISS-PROT is stored on fourteen (14) 720 Kb disks. Each of
   these disks contains two of the 360 Kb format bulk files.



                   APPENDIX B: SOME STATISTICS


   B.1  Amino acid composition

        Composition in percent for the complete data bank

   Ala (A) 7.76   Gln (Q) 4.11   Leu (L) 9.07   Ser (S) 7.04
   Arg (R) 5.12   Glu (E) 6.12   Lys (K) 5.88   Thr (T) 5.86
   Asn (N) 4.36   Gly (G) 7.33   Met (M) 2.28   Trp (W) 1.36
   Asp (D) 5.21   His (H) 2.30   Phe (F) 3.95   Tyr (Y) 3.25
   Cys (C) 1.93   Ile (I) 5.28   Pro (P) 5.21   Val (V) 6.52

   Asx (B) 0.02   Glx (Z) 0.02   Xaa (X) 0.01


        Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Lys, Thr, Ile, Pro, Asp, Arg,
   Asn, Gln, Phe, Tyr, His, Met, Cys, Trp


   B.2   Repartition of  the sequences  by their  organism  of
         origin

   Total number  of species represented in this release of the
   data bank: 1345

   Species represented 1x:  636
                       2x:  259
                       3x:  140
                       4x:   88
                       5x:   40
                       6x:   39
                       7x:   25
                       8x:   23
                       9x:   15
                      10x:    8
                     >10x:   72


        Table of the most represented species

   739: HUMAN       133: CHICK       62: BACSU       55: TOBAC
   628: ECOLI       125: RABIT           LAMBD       54: BPT7
   414: MOUSE       117: DROME       59: EPBAR       52: MARPO
   312: RAT         113: PIG             MAIZE
   256: YEAST        68: XENLA       58: SALTY
   250: BOVIN        66: BPT4            VACCV


   B.3  Repartition of the sequences by size

   From   To Number             From   To Number
      1-  50    384             1001-1100     36
     51- 100   1000             1101-1200     30
    101- 150   1743             1201-1300     26
    151- 200    856             1301-1400     18
    201- 250    581             1401-1500     12
    251- 300    494             1501-1600      2
    301- 350    450             1601-1700     10
    351- 400    432             1701-1800      7
    401- 450    309             1801-1900      5
    451- 500    350             1901-2000      2
    501- 550    284             >2000         42
    551- 600    166
    601- 650    123             Currently the two largest sequences are:
    651- 700     74             APB$HUMAN   4563 a.a.
    701- 750     76             APOA$HUMAN  4548 a.a.
    751- 800     55
    801- 850     46
    851- 900     56
    901- 950     36
    951-1000     22
  

Swiss-Prot release 7.0

Published April 7, 1988

             SWISS-PROT RELEASE 7.0 RELEASE NOTES


   Date:     April 7, 1988
   Author:   A. Bairoch


                         1. INTRODUCTION

   1.1  Evolution

   Release 7.0  of SWISS-PROT  contains 6821 sequence entries,
   comprising  1'885'771  amino  acids  abstracted  from  7128
   references. This represents an increase of 11% over release
   6.0. The  recent growth  of the  data  bank  is  summarised
   below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771


   1.2  Source of data

   Release 7.0  has been  updated using  protein sequence data
   from  release  15.0  of  the  PIR  (Protein  Identification
   Resource) protein  data bank,  as well  as  translation  of
   nucleotide sequence  data from  release 14.0  of  the  EMBL
   nucleotide sequence Data Library.

   As an  indication to the source of the sequence data in the
   SWISS-PROT data bank we list here the statistics concerning
   the DR (Databank Reference) pointer lines:

   Entries with pointer(s) to only PIR entri(es):          3487
   Entries with pointer(s) to only EMBL entri(es):         1541
   Entries with pointer(s) to both EMBL and PIR entri(es): 1683
   Entries with no pointers lines (entered in house):       110



     2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE
                            RELEASE 6


   2.1  Sequences and annotations

   Some 700  new sequences  have been  added  since  the  last
   release, the sequence data of some 100 existing entries has
   been updated  and the  annotations of  almost 1000  entries
   have been revised.

   In particular  we have  used reviews articles to update the
   annotations  of   the  following   groups  or  families  of
   proteins: alcohol  dehydrogenases,  acylphosphatases,  beta
   lactamases,    cholinesterases,    elastases,    lysozymes,
   nitrogenases,    peroxidases,     superoxide    dismutases,
   prokaryotic  lipoproteins,   trypanosomes  variant  surface
   glycoproteins,  thioredoxins,   heat  stable  enterotoxins,
   pentraxins,   azurins,   metallothioneins,   plastocyanins,
   bowman-birk serine  protease inhibitors,  snake neurotoxins
   and  cytotoxins,  natriuretic  peptides,  growth  hormones,
   prolactins.


   2.2   Introduction of the Organism Classification (OC) line
         type

   The  OC   (Organism  Classification)   lines  contain   the
   taxonomic  classification   of  the  source  organism.  The
   classification is  listed top-down  as nodes in a taxonomic
   tree in which the most general grouping is given first. The
   individual items  are separated by semi-colons and the list
   is terminated  by a  period. The  general format for the OC
   line is:

   OC   NODE[; NODE...].

   As an  example, the  classification lines for a human (Homo
   sapiens) sequence entry should be:

   OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA;
   OC   MAMMALIA; EUTHERIA; PRIMATES.

   Currently, in  release 7.0,  the OC  lines only contain the
   first node  of the taxonomic tree. That node can either be:
   EUKARYOTA, PROKARYOTA  or VIRIDAE.  This will  already help
   users which  want to extract from the data bank the entries
   containing  sequences   from  eukaryotes,   prokaryotes  or
   viruses. We  plan, in  future releases, to implement a full
   set of OC lines for at least the most common species.


   2.3  Documentation changes

   Starting with  this release  we  have  introduced  two  new
   indices: an  index of the sequence of enzymes classified by
   their EC number (ECINDEX) and an index of EMBL Data Library
   sequences referenced in SWISS-PROT (EMBLTOSP).


                       3. THE NEXT RELEASE

   SWISS-PROT release  8.0 will be probably ready in beginning
   of August 1988.



                     4. WE NEED YOUR HELP !

   We welcome any feedback from our users. We especially would
   appreciate that  you notify  us if  you find that sequences
   belonging to  your field  of expertise are missing from the
   data  bank.  We  also  would  like  to  be  notified  about
   annotations to  be updated,  as for example if the function
   of  a   protein  has   been  clarified   or  if  new  post-
   translational information has become available.





                APPENDIX A: DISKS FOR SWISS-PROT


   IBM PC/AT 1.2 Mb disks

   SWISS-PROT is  stored on  seven 1.2 Mb disks. Each of these
   disk  contains   a  single   bulk  file   (PROT7_01.BLK  to
   PROT7_07.BLK):

   Disk     First sequence        Last Sequence
    1       16K$TRVPS             COAT$CAMVC
    2       COAT$CAMVD            FIBA$CANFA
    3       FIBA$HUMAN            IACA$PIG
    4       IATP$BOVIN            MUCB$HUMAN
    5       MUCM$MOUSE            PSBD$CHLRE
    6       PSBD$MARPO            TRPE$SALTY
    7       TRPE$SERMA            ZEIG$MAIZE

   IBM PC 360 Kb disks

   SWISS-PROT is stored on twenty four (24) 360 Kb disks. Each
   of these  disk contains a single bulk file (PROT7_01.BLK to
   PROT7_24.BLK):

   Disk     First sequence        Last Sequence
    1       16K$TRVPS             ANP3$PSEAM
    2       ANP4$PSEAM            CA11$BOVIN
    3       CA11$CHICK            CGPG$BOVIN
    4       CH15$DROME            CRAA$PIG
    5       CRAA$PROCA            CYCP$RHOSH
    6       CYCP$RHOSP            ENOA$RAT
    7       ENOB$CHICK            FTSZ$ECOLI
    8       FTZ$DROME             H2A1$PSAMI
    9       H2A1$SCHPO            HBB4$SALIR
   10       HBBA$BOSJA            HXKA$YEAST
   11       HXKB$YEAST            KABL$HUMAN
   12       KABL$MLVAB            KV5J$MOUSE
   13       KV5K$MOUSE            MERB$SERMA
   14       MERB$STAAU            NFL$BOVIN
   15       NFL$HUMAN             PA2$RAT
   16       PA2$TRIOK             POL2$HPAV
   17       POLG$ENMYV            RAS$SCHPO
   18       RAS1$DROME            RPOD$BACSU
   19       RPOD$ECOLI            TALA$MPOV3
   20       TALA$MPOVA            TRFE$CHICK
   21       TRFE$HUMAN            VE59$LAMBD
   22       VE5A$HPV11            VNCS$PAVBO
   23       VNCY$MUMIM            YG12$BPT3
   24       YG13$BPT3             ZEIG$MAIZE

   IBM PS/2 720 Kb disks

   SWISS-PROT is  stored on  twelve (12) 720 Kb disks. Each of
   these disk contains two of the 360 Kb format bulk files.
  

Swiss-Prot release 6.0

Published January 2, 1988

             SWISS-PROT RELEASE 6.0 RELEASE NOTES


   Date:     January 2, 1988
   Author:   A. Bairoch


                         1. INTRODUCTION

   1.1  Evolution

   Release 6.0  of SWISS-PROT  contains 6102 sequence entries,
   comprising  1'653'982  amino  acids  abstracted  from  6422
   references. This represents an increase of 25% over release
   5.0. The  recent growth  of the  data  bank  is  summarised
   below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982

   1.2  Source of data

   Release 6.0  has been  updated using  protein sequence data
   from  release  14.0  of  the  PIR  (Protein  Identification
   Resource) protein  data bank,  as well  as  translation  of
   nucleotide sequence  data from  release 13.0  of  the  EMBL
   nucleotide sequence Data Library.

   As an  indication to the source of the sequence data in the
   SWISS-PROT data bank we list here the statistics concerning
   the DR (Databank Reference) pointer lines:

   Entries with pointer(s) to only PIR entri(es):          3490
   Entries with pointer(s) to only EMBL entri(es):         1173
   Entries with pointer(s) to both EMBL and PIR entri(es): 1367
   Entries with no pointers lines (entered in house):        72



     2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE
                            RELEASE 5


   2.1  New sequence data

   We have  continued  to  scan  the  EMBL  Data  Library  for
   sequences not yet available in the protein data bank and we
   have  translated   and  annotated   some   500   additional
   sequences. In  addition to  those new  sequences, some  100
   existing entries  have  been  updated  with  sequence  data
   translated from EMBL entries.


   2.2  Modifications in the feature keys

   -  Four new  feature keys  have been  introduced with  this
      release:

      DNA_BIND:   extent of a DNA-binding region.
      INIT_MET:   to indicate that there was an initiator
                  methionine in front of the sequence first
                  residue.
      MYRISTYL:   myristylated residue.
      TRANSMEM:   extent of a transmembrane region.

   -  Two feature keys have been renamed:

      PROPEP      replace   PROP
      SIMILAR     replace   HOMOLOGY


   2.3  Format modifications

   -  Before release  6.0, if  an entry  had more  than one DT
      line, these  lines were  listed starting with the oldest
      one and ending with the most recent one. So as to follow
      the conventions  used by  the EMBL  Data Library we have
      inverted that  order: the  first DT line is now the most
      recent one.

   -  A DR  line pointing  to a PIR entry used that entry name
      as the primary identifier and the PIR release version as
      the secondary  identifier.  PIR  having  (starting  with
      release 13)  introduced accession number, we now use the
      first accession number as the primary identifier and the
      entry name as the secondary identifier. Example:

      what was:   DR   PIR; R5EC7; RELEASE 11, DECEMBER 1986.
      is now:     DR   PIR; A02768; R5EC7.


   2.4  Organism identification code changes

   -  The organism  code for  guinea-pig  which  formerly  was
      GUPIG was  changed to  CAVPO (CAvia  PORcellus) so as to
      fit with the general naming rules.
   -  We have  modified some  viruses organism  identification
      codes to  make them  compatible with  the  abbreviations
      commonly used.


   2.5  Other changes

   -  Keywords have been updated in most entries.
   -  The annotations of a number of protein families has been
      updated and completed using recent reviews.


                       3. THE NEXT RELEASE

   SWISS-PROT release  7.0 will be probably ready in beginning
   of april  1987, we  hope that  at that  stage we  will have
   caught up  with the  backlog of sequences to translate from
   the EMBL Data Library nucleotide sequences.



                     4. WE NEED YOUR HELP !

   We welcome any feedback from our users. We especially would
   appreciate that  you notify  us if  you find that sequences
   belonging to  your field  of expertise are missing from the
   data  bank.  We  also  would  like  to  be  notified  about
   annotations to  be updated,  as for example if the function
   of a protein has been clarified or if new posttranslational
   information has become available.



                APPENDIX A: DISKS FOR SWISS-PROT


   IBM PC/AT 1.2 Mb disks

   SWISS-PROT is  stored on  seven 1.2 Mb disks. Each of these
   disk  contains   a  single   bulk  file   (PROT6_01.BLK  to
   PROT6_07.BLK):

   Disk     First sequence        Last Sequence
    1       16K$TRVPS             CRAA$LOXAF
    2       CRAA$MACMU            GMCS$MOUSE
    3       GONL$HUMAN            KRAF$MOUSE
    4       KRAF$MSV36            PGMU$RABIT
    5       PGMU$YEAST            TESB$RAT
    6       TETX$CLOTE            YX$NPVGM
    7       YXL2$EPBAR            ZEIG$MAIZE


   IBM PC 360 Kb disks

   SWISS-PROT is  stored on  twenty one  360 Kb disks. Each of
   these disk  contains a  single bulk  file (PROT6_01.BLK  to
   PROT6_21.BLK):

   Disk     First sequence        Last Sequence
    1       16K$TRVPS             ARA5$AMBEL
    2       ARAA$SALTY            CAPB$HUMAN
    3       CAPP$ECOLI            COX1$BOVIN
    4       COX1$CHLRE            CYC$HELAN
    5       CYC$HELAS             EGF$HUMAN
    6       EGF$MOUSE             FUCO$HUMAN
    7       FULC$MYXFU            H41$TETPY
    8       H42$TETPY             HEMA$INASW
    9       HEMA$INATA            IL3$HYLLA
   10       IL3$MOUSE             KPCA$BOVIN
   11       KPCB$BOVIN            LV1$CHICKL
   12       LV1A$HUMAN            NAC2$RAT
   13       NACH$ELEEL            OSTC$MACFA
   14       OSTC$MOUSE            POLG$FMDVA
   15       POLG$FMDVC            RBS$ANASP
   16       RBS$TOBAC             SCX3$BUTOC
   17       SCX3$CENSC            TINT$MOUSE
   18       TLP1$MOUSE            V55$BPT7
   19       V57$HSV11             VMSA$HPBVZ
   20       VMSA$WHV              YMC4$YEAST
   21       YMCA$ASPNI            ZEIG$MAIZE
  

Swiss-Prot release 5.0

Published September 8, 1987

             SWISS-PROT RELEASE 5.0 RELEASE NOTES


   Date:     September 8, 1987
   Author:   A. Bairoch


                          INTRODUCTION

   Release 5.0  of SWISS-PROT  contains 5205 sequence entries,
   comprising  1'327'683  amino  acids  abstracted  from  5673
   references. This represents an increase of 28% over release
   4.0. The  recent growth  of the  data  bank  is  summarized
   below:

   Release      Date         Number of entries           Nb of amino
                                                         acids

   3.0          11/86        4160                            969 641
   4.0          04/87        4387                          1 036 010
   5.0          09/87        5205                          1 327 683


                         SOURCE OF DATA

   Release 5.0  has been  updated using  protein sequence data
   from release  12.0 of the PIR protein data bank, as well as
   translation of nucleic-acid sequence data from release 12.0
   of the EMBL nucleotide Data Library. The statistics for the
   source of the sequences in release 5.0 of SWISS-PROT are:

   Entries adapted from PIR entries                   4419
   Entries translated from EMBL entries                708
   Entries entered "in house"                           78


   DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 4


   New sequence data

   To make  SWISS-PROT as complete as possible we have scanned
   the EMBL  Data Library  for sequences  not yet available in
   the protein  data bank and we have translated and annotated
   some 700 additional sequences.

   In addition  to those  new  sequences,  some  200  existing
   entries have  been updated  with sequence  data  translated
   from EMBL entries.


   We also  have started to introduce DR lines pointing to the
   primary accession  number of  the nucleotide sequence entry
   in EMBL  to entries  which were sequenced at the DNA or RNA
   level.  The   current  statistics  for  DR  lines  are  the
   following:

   Entries with pointer line(s) to a PIR entry        4419
   Entries with pointer line(s) to an EMBL entry      1512
   Entries with pointer line(s) to a PDB entry         140


   Modifications in the feature keys

   -  A new feature key has been introduced with this release:
      MUTAGEN which  is used to describe sites which have been
      experimentally altered.

   -  The description  part of  the CONFLICT  and VARIANT keys
      now follows a precise format.

      a) If  the position described is missing the description
         part will be similar to those shown in  the following
         examples:

      FT   CONFLICT     33     33       MISSING (IN REF. 2).
      FT   VARIANT      12    245       MISSING (IN SMALL VARIANT).

      b) Positions with amino acid change(s) are now described
         as shown in the following examples:

      FT   CONFLICT     60     60       P -> A (IN REF. 2).
      FT   CONFLICT      1      5       ASTQS -> GAWTL (IN REF. 3).
      FT   VARIANT      39     39       P -> G (IN 50% OF THE MOLECULES).


   Format modifications

   We have  made a number of small modifications to the format
   of SWISS-PROT  to make  it more compatible with that of the
   EMBL Data Library. Those format changes are the following:

   -  Journal abbreviations  are now  identical in  both  data
      banks.
   -  The format  of the  RL line  for submitted  data and for
      Ph.D. thesis  has been  modified to  be compatible  with
      that used in the EMBL Data Library.
   -  In the  DT line,  there are  now two  spaces between the
      date and  the comment  part (instead  of one in previous
      releases).


   Organism identification code changes

   We have modified some viruses organism identification codes
   to make  them compatible  with the  abbreviations  commonly
   used. Among the codes changed are those of the AIDS viruses
   which now have organism codes starting with HIV1 or HIV2.


   Various changes

   -  Keywords have  been updated  in a  significant number of
      entries.
   -  Cytochrome P450  entry names  and descriptions have been
      totally modified to take in account new nomenclature.
   -  The annotations of a number of protein families has been
      updated and  completed using recent reviews (among those
      reviewed are: cystatins, ubiquitins, RAS oncogenes, "EF-
      hand"     calcium-binding      proteins,      ferritins,
      serine/threonine protein  kinases, GTP binding proteins,
      etc).



                        THE NEXT RELEASE

   SWISS-PROT release 6.0 will be probably ready in the end of
   january or  beginning of february 1987, it will include new
   data from  PIR release  13  and  a  significant  number  of
   additional translated EMBL release 13 sequences.

   At this  stage all the sequences which are stored in SWISS-
   PROT and  which were  sequenced at  the DNA  or  RNA  level
   should have  DR lines  pointing to  the  primary  accession
   number of the nucleotide sequence entry in EMBL.

   We also  plan to  add some new feature keys and to make the
   keyword usage in SWISS-PROT more consistent.



                APPENDIX A: DISKS FOR SWISS-PROT


   IBM PC/AT 1.2Mb disks

   -  SWISS-PROT is stored on six 1.2 Mb disks.
   -  Disks  1   to  5   each  contain   a  single  bulk  file
      (PROT5_01.BLK to  PROT5_05.BLK) which  correspond to the
      sequences adapted  from entries  in the  PIR data  bank.
      Those sequences  are classified  in the  order in  which
      they are found in the PIR data bank.
   -  Disk  6   contains  two  bulk  files:  PROT5_NW.BLK  and
      PROT5_PR.BLK. The  first bulk  file  correspond  to  new
      sequences translated from EMBL Data Library entries. The
      second one to a few "preliminary" entry sequences.


   IBM PC 360 Kb disks

   -  SWISS-PROT is stored on eighteen 360 Kb disks.
   -  Disks  1   to  15   each  contain  a  single  bulk  file
      (PROT5_01.BLK to  PROT5_15.BLK) which  correspond to the
      sequences adapted  from entries  in the  PIR data  bank.
      Those sequences  are classified  in the  order in  which
      they are found in the PIR data bank.
   -  Disks  16   to  17  each  contain  a  single  bulk  file
      (PROT5_N1.BLK and  PROT5_N2.BLK which  correspond to new
      sequences translated from EMBL Data Library entries.
   -  Disk 18  contain contains  two bulk  files: PROT5_N3.BLK
      and PROT5_PR.BLK.  The first bulk file correspond to the
      last part of the new sequences translated from EMBL Data
      Library entries.  The second  one to a few "preliminary"
      entry sequences.
  

Swiss-Prot release 4.0

Published June 10, 1987

                            SWISS-PROT RELEASE 4.0


        Announcing SWISS-PROT release 4.0

   Release 4.0  of the  SWISS-PROT data  bank is  now available.  The total
   number of  sequence entries has grown from 4160 (in release 3.0) to 4387
   and the total number of amino-acids from 969,641 to 1,036,010.

   Release 4.0  has been  updated using  protein sequence data from release
   11.0 of  the P.I.R(1)  protein data  bank, as  well  as  translation  of
   nucleic-acid sequence data from release 10.0 of the EMBL nucleotide data
   library.


        The DR line.

   This release  introduce an  new type  of data line, the DR line which is
   used as  a   pointer on  the existence  of information  relative to  the
   stored sequence in data collections external to SWISS-PROT.

   For example:  if the  X-ray  crystallographic  atomic  coordinate  of  a
   sequence are  stored in the Brookhaven Protein Data Bank (PDB) there now
   is a  DR line(s)  pointing to  the entry(s)  in  that  data  bank  which
   correspond to that sequence. All the sequences adapted from the PIR data
   bank have a DR line pointing to the original PIR entry.

   The general format of the DR line is the following:

   DR   DATA_BANK_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER.

   Examples of complete DR lines are shown here:

   DR   PIR; KIPGA; RELEASE 11.0, DECEMBER 1986.
   DR   PDB; 2ADK; 30-SEP-83.
   DR   EMBL; X01704; GMNOD23.

   In addition  a new  keyword: 3D-STRUCTURE  has been  added  to  all  the
   sequence entries  which have  a DR  line pointing  on the PDB structural
   data bank.


        Availability of SWISS-PROT.

   -  SWISS-PROT will  soon be  available, on-line,  on the BIONET resource
      computer.
   -  EMBL is distributing tapes of SWISS-PROT.
   -  American  customers   of  PC/GENE   can   acquire   SWISS-PROT   from
      IntelliGenetics on floppy media.
   -  European customers  of PC/GENE  will receive  SWISS-PROT release  4.0
      from GENOFIT.


        The next release.

   SWISS-PROT release  5.0 will  be probably  ready in  the end  of july or
   beginning of  august 1987,  it will  include new data from P.I.R release
   12, a  high number  of translated  EMBL sequences  (more then  300  from
   release 10.0  and 11.0)  and sequences  entered by  the staff of Medical
   Biochemistry Department  of the University of Geneva. A great proportion
   of all  the sequences  which are  stored in  SWISS-PROT and  which  were
   sequenced at  the DNA  or RNA  level will  have DR lines pointing to the
   primary accession number of the nucleotide sequence entry in EMBL.


        Direct submission of data.

   We accept and encourage direct submission of data to the SWISS-PROT data
   bank. You  can either  send your  sequence(s) on  a number  of different
   computer media  (tape, 360  Kb or  1.2Mb IBM PC floppy disks, 720 Kb 3'5
   inch disks,  most CP/M  disk  formats,  SoftStrip)  to  our  address  in
   Switzerland or send your sequence(s) by electronic mail to the following
   adresses:

        On BIONET:  bairoch(@BIONET-20.ARPA)
        On BITNET:  bairoch@cgecmu51.bitnet


   A. Bairoch / 10 June 1987.
   Medical Biochemistry Department.
   University of Geneva.
   1, Rue Michel Servet
   1211 Geneva 4
   Switzerland.

   ____________________
   (1)         P.I.R (Protein  Identification Resource) is supported by the
      Division of  Research Resources  of the NIH and prepared by the staff
      of the National Biomedical Research Foundation.
  

Swiss-Prot release 3.0

Published November 28, 1986

                            SWISS-PROT RELEASE 3.0


        Announcing SWISS-PROT release 3.0

   Release 3.0  of the  SWISS-PROT data  bank is  now available.  The total
   number of  sequence entries has grown from 3939 (in release 2.0) to 4160
   and the total number of amino-acids from 900'163 to 969'641.

   Release 3.0  has been  updated using  mainly protein  sequence data from
   release 10.0  of the  P.I.R(1) protein data bank, as well as translation
   of nucleic-acid sequence data from release 8.0 of the EMBL data library.


        Availability of SWISS-PROT.

   -  SWISS-PROT will  be available  very  soon,  on-line,  on  the  BIONET
      resource computer.
   -  EMBL will start distributing tapes of SWISS-PROT in the near future.
   -  PC/GENE American customers can acquire from IntelliGenetics both EMBL
      and SWISS-PROT on floppy media.
   -  PC/GENE European  customers will  receive SWISS-PROT  release 3  from
      GENOFIT.


        The next release.

   SWISS-PROT release  4.0 will be probably ready in the spring of 1987, it
   will include  new data  from  P.I.R,  translated  EMBL  sequences  (from
   release 9.0)  and sequences entered by the staff of Medical Biochemistry
   Department of the University of Geneva.

   A new  feature will  be introduced  in the next release: sequences whose
   three dimensional  structure has  been  published  will  have  an  extra
   reference data block pointing to the name of the entry in the Brookhaven
   National Laboratory Protein Data Bank which list atomic coordinates from
   the crystallographic  data. A new keyword: 3D-STRUCTURE will be added to
   all those sequence entries.



   A. Bairoch / 28 November 1986.
   Medical Biochemistry Department.
   University of Geneva.
   Switzerland.

   ____________________
   1  P.I.R (Protein  Identification Resource) is supported by the Division
      of Research  Resources of  the NIH  and prepared  by the staff of the
      National Biomedical Research Foundation.
  

TrEMBL release 27.0

Published July 6, 2004
                     UniProt/TrEMBL Release Notes
                     Release 27, 5th July 2004

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: datalib@ebi.ac.uk/swissprot@ebi.ac.uk
    WWW server: http://www.ebi.ac.uk/

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: swiss-prot@expasy.org
    WWW server: http://www.expasy.org/

    Protein Information Resource (PIR)
    Georgetown University Medical Center
    3900 Reservoir Road, NW
    Box 571455
    Washington, DC 20057-1455
    United States of America

    Telephone: (+1 202) 687 1039
    Fax: (+1 202) 687 0057)
    Electronic mail address: pirmail@georgetown.edu
    WWW server: http://pir.georgetown.edu


    Acknowledgements

    UniProt/TrEMBL has been prepared by:

    o  Claire O'Donovan, Maria Jesus Martin, Yasmin Alam-Faruque,
       Nicola Althorpe, Daniel Barrell, Wei mun Chan, Paul Browne,
       Kirill Degtyarenko, Ruth Eberhardt, Gill Fraser, Alexander
       Fedetov, Rodrigo Fernandez, John Garavelli, Andre Hackmann,
       Alan Horne, Julius Jacobsen, Alexander Kanapin, Youla
       Karavidopoulou, Paul Kersey, Ernst Kretschmann, Kati Laiho,
       Minna Lehvaslaiho, Michele Magrane, Virginie Mittard, Nicola
       Mulder, John F. O'Rourke, Sandra Orchard, Astrid Rakow,
       Mark Rynbeek, Sandra van den Broek, Eleanor Whitfield, Allyson
       Williams and Rolf Apweiler at the EMBL Outstation - European
       Bioinformatics Institute (EBI) in Hinxton, UK.
    o  Amos Bairoch, Alexandre Gattiker, Karine Michoud, Catherine
       Rivoire, Nicole Redaschi and Sandrine Pilbout at the Swiss
       Institute of Bioinformatics in Geneva, Switzerland.

    Copyright Notice

    UniProt/TrEMBL copyright (c) 2004 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you want to cite UniProt/TrEMBL in a publication please use
    the following reference:

    Apweiler R., Bairoch A., Wu C.H., Barker W.C., Boeckmann B., Ferro
    S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J.,
    Natale D.A., O'Donovan C., Redaschi N. and Yeh L.L.
    UniProt: the Universal Protein knowledgebase
    Nucleic Acids Res. 32: D115-D119 (2004)

 1. Introduction


UniProt/TrEMBL is a computer-annotated protein sequence database
complementing the UniProt/Swiss-Prot database. Together they constitute
the UniProt Knowledgebase. The DDBJ/EMBL/GenBank nucleotide sequence
databases' CDS translations, the sequences of PDB structures, and
directly sequenced peptides extracted from the literature or submitted
directly to UniProt are used by default as the raw material for the
UniProt Knowledgebase. However, some data from DDBJ/EMBL/GenBank including
most of the Whole Genome Shotgun (WGS) data, CDS translations leading to
small fragments or not coding for real proteins, synthetic sequences,
non-germline Immunoglobulins and T-cell receptors, and most patent
application sequences are actively excluded from the Knowledgebase.
Having this data into the Knowledgebase would pollute the database
with highly unstable and low-quality data. However, we do provide all
publicly available protein sequences in the UniProt archive (UniParc)
(http://www.uniprot.org/). UniParc sequences from other UniParc source
records identified by the UniProt curators as important sequences
missing in the Knowledgebase are also used to create new UniProt
Knowledgebase records. This process ensures that the UniProt Knowledgebase
is not missing any important sequences available in the protein sequence
repositories, but minimises the amount of unstable and low quality data in
the Knowledgebase.


 2. Why a complement to UniProt/Swiss-Prot?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into
UniProt/Swiss-Prot. We do not want to dilute the quality standards of
UniProt/Swiss-Prot by incorporating sequences without proper sequence
analysis and annotation but we do want to make the sequences available
as quickly as possible. UniProt/TrEMBL achieves this goal and is a major
step in the process of speeding up subsequent upgrading of annotation to
the standard UniProt/Swiss-Prot quality.


 3. The Release

This UniProt/TrEMBL release has been produced in synch with
UniProt/Swiss-Prot release 44 and together they comprise the UniProt
Knowledgebase release 2.0. It was created from the EMBL Nucleotide
Sequence Database release 79 and updates until the 18-June-2004 and
contains 1'333'917 entries and 413'323'560 amino acids.

UniProt/TrEMBL is organized in subsections:

arc.dat (Archaea):                          4245 entries
arp.dat (Complete Archaeal proteomes):     33050 entries
fun.dat (Fungi):                           41959 entries
hum.dat (Human):                           49176 entries
inv.dat (Invertebrates):                  147306 entries
mam.dat (Other Mammals):                   18352 entries
mhc.dat (MHC proteins):                    10528 entries
org.dat (Organelles):                     112691 entries
phg.dat (Bacteriophages):                  13750 entries
pln.dat (Plants):                         116371 entries
pro.dat (Prokaryotes):                    169966 entries
prp.dat (Complete Prokaryotic Proteomes): 330392 entries
rod.dat (Rodents):                         47097 entries
unc.dat (Unclassified):                      963 entries
vrl.dat (Viruses):                        124972 entries
vrt.dat (Other Vertebrates):               30294 entries
vrv.dat (Retroviruses):                   116571 entries

275'585 new entries have been integrated in UniProt/TrEMBL.

More statistics for the UniProt/TrEMBL release are available at
http://www.ebi.ac.uk/trembl/

In the document delac_tr.txt, you will find a list of all accession
numbers which were previously present in UniProt/TrEMBL, but which
have now been deleted from the database.


 4. Format differences between UniProt/Swiss-Prot and UniProt/TrEMBL

The format and conventions used by UniProt/TrEMBL follow as closely
as possible that of UniProt/Swiss-Prot. Hence, it is not necessary
to produce an additional user manual and extensive release notes
for UniProt/TrEMBL. The information given in the UniProt/Swiss-Prot
release notes and user manual are in general valid for UniProt/TrEMBL.
The differences are mentioned below.

The general structure of an entry is identical in both databases.

The data class used in UniProt/TrEMBL (in the ID line) is always
'PRELIMINARY',whereas in UniProt/Swiss-Prot it is always 'STANDARD'.

Differences in line types:

The ID line (IDentification):

The entry name used in UniProt/TrEMBL is the same as the Accession
Number of the entry.

The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in UniProt/Swiss-Prot;
but the DT lines in UniProt/TrEMBL refer to the UniProt/TrEMBL release.
The difference is shown in the example below.

    DT lines in a UniProt/Swiss-Prot entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a UniProt/TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1999 (TrEMBLrel. 12, Last sequence update)
    DT   01-MAR-2004 (TrEMBLrel. 26, Last annotation update)


 5. Bi-Weekly incremental UniProt Knowledgebase releases

5.1 UniProt Knowledgebase

In addition to full releases, we also provide biweekly two compressed
files: uniprot.sprot.dat.gz and uniprot.trembl.dat.gz at
http://www.uniprot.org/database/download.shtml allowing users access
to the latest data.

5.2 XML

A version of the UniProt Knowledgebase in XML format has been developed
and is provided with this release. More information is available at
http://www.uniprot.org/support/documents.shtml and the data can be
downloaded from http://www.uniprot.org/database/download.shtml

We would welcome any feedback from the user community.

5.3 Varsplic Expand

We also provide Varsplic Expand which is a program to generate
"expanded" sequences from UniProt Knowledgebase records i.e. sequences
including the variants specified by the varsplic, variant and conflict
annotations. New records are produced in either pseudo-UniProt/Swiss-Prot
or FASTA format for each specified variant. More information and the data
is available at http://www.uniprot.org/database/download.shtml

 6. Access/Data Distribution


The UniProt/TrEMBL release 27 is available at:
ftp.ebi.ac.uk/pub/databases/trembl


The biweekly UniProt Knowledgebase release is available for searches and
download from http://www.uniprot.org/database/download.shtml

The UniProt Knowledgebase release is also available on CD-ROM from the EBI.


 7. General announcements and Forthcoming changes

7.1 Recent and Forthcoming changes documentation for users

We have introduced two new resources for users to enable us to communicate
effectively between releases about what is new in the UniProt Knowledgebase
and what is planned for the future.
These are available at:
http://www.uniprot.org/support/documents.shtml

7.2 TrEMBL enhancements

This release of TrEMBL has been produced from a new relational
database system. This new system enables the biweekly synchronization of
UniProt/TrEMBL with it's source EMBL/DDBJ/GenBank nucleotide sequence
databases. It has also facilitated the integration of various bioinformatic
tools to enhance the UniProt/TrEMBL annotation. As a result, this release of
the database has significant annotation differences with regards to previous
releases and we are committed to further raising the annotation standards.
We welcome feedback from the user community.

  

TrEMBL release 26.0

Published March 2, 2004

                              TrEMBL Release Notes
                              Release 26, March 2004


EMBL Outstation
European Bioinformatics Institute (EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Telephone: (+44 1223) 494 444
Fax: (+44 1223) 494 468
Electronic mail address: datalib@ebi.ac.uk/swissprot@ebi.ac.uk
WWW server: http://www.ebi.ac.uk/

Swiss Institute of Bioinformatics (SIB)
Centre Medical Universitaire
1, rue Michel Servet
1211 Geneva 4
Switzerland

Telephone: (+41 22) 702 50 50
Fax: (+41 22) 702 58 58
Electronic mail address: Amos.Bairoch@isb-sib.ch
WWW server: http://www.expasy.org/


Acknowledgements

UniProtKB/TrEMBL has been prepared by:

o Maria Jesus Martin, Claire O'Donovan, Nicola Althorpe, Rolf
Apweiler, Daniel Barrell, Kirsty Bates, Paul Browne, Kirill
Degtyarenko, Ruth Eberhardt, Gill Fraser, Alexander Fedetov, Andre
Hackmann, Alexander Kanapin, Youla Karavidopoulou, Paul Kersey, Ernst
Kretschmann, Kati Laiho, Minna Lehvaslaiho, Michele Magrane, Michelle
McHale, Virginie Mittard, Nicola Mulder, John F. O'Rourke, Markiyan
Oliynyk, Sandra Orchard, Astrid Rakow, Sandra van den Broek, Eleanor
Whitfield and Allyson Williams at the EMBL Outstation - European
Bioinformatics Institute (EBI) in Hinxton, UK;
o Amos Bairoch, Alexandre Gattiker, Karine Michoud, Isabelle Phan and
Sandrine Pilbout at the Swiss Institute of Bioinformatics in Geneva,
Switzerland.

Copyright Notice
TrEMBL copyright (c) 2004 EMBL-EBI
This manual and the database it accompanies may be copied and
redistributed freely, without advance permission, provided
that this copyright statement is reproduced with each copy.

Citation

If you want to cite TrEMBL in a publication please use
the following reference:

Apweiler R., Bairoch A., Wu C.H., Barker W.C., Boeckmann B., Ferro
S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J.,
Natale D.A., O'Donovan C., Redaschi N. and Yeh L.L.
UniProt: the Universal Protein knowledgebase
Nucleic Acids Res. 32: D115-D119 (2004)

1. Introduction


UniProtKB/TrEMBL is a computer-annotated protein sequence database
complementing the Swiss-Prot Protein Knowledgebase. UniProtKB/TrEMBL contains
the translations of all coding sequences (CDS) present in the
EMBL/GenBank/DDBJ Nucleotide Sequence Databases and also protein
sequences extracted from the literature or submitted to Swiss-Prot,
which are not yet integrated into UniProtKB/Swiss-Prot. For all UniProtKB/TrEMBL entries, UniProtKB/Swiss-Prot accession numbers have been assigned.

2. Why a complement to UniProtKB/Swiss-Prot?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into
UniProtKB/Swiss-Prot. We do not want to dilute the quality standards of UniProtKB/Swiss-Prot by incorporating sequences without proper sequence analysis and
annotation but we do want to make the sequences available as quickly as
possible. UniProtKB/TrEMBL achieves this second goal, and is a major step in the
process of speeding up subsequent upgrading of annotation to the standard
UniProtKB/Swiss-Prot quality. To address the problem of redundancy, the translations
of all coding sequences (CDS) in the EMBL Nucleotide Sequence Database
already included in Swiss-Prot have been removed from UniProtKB/TrEMBL.


3. The Release

This UniProtKB/TrEMBL release has been produced in synch with Swiss-Prot release 43. It was created from the EMBL Nucleotide Sequence Database release 77 and
updates until the 26-January-2004 and contains 1'069649 entries and
335'331'748 amino acids.

UniProtKB/TrEMBL is organized in subsections:

arc.dat (Archaea): 1850 entries
arp.dat (Complete Archaeal proteomes): 30436 entries
fun.dat (Fungi): 26545 entries
hum.dat (Human): 33713 entries
inv.dat (Invertebrates): 116930 entries
mam.dat (Other Mammals): 14217 entries
mhc.dat (MHC proteins): 9401 entries
org.dat (Organelles): 87676 entries
phg.dat (Bacteriophages): 9349 entries
pln.dat (Plants): 95077 entries
pro.dat (Prokaryotes): 94217 entries
prp.dat (Complete Prokaryotic Proteomes): 293602 entries
rod.dat (Rodents): 42282 entries
unc.dat (Unclassified): 456 entries
vrl.dat (Viruses): 96447 entries
vrt.dat (Other Vertebrates): 20986 entries
vrv.dat (Retroviruses): 96465 entries

67'168 new entries have been integrated in UniProtKB/TrEMBL. The sequences of
2443 UniProtKB/TrEMBL entries have been updated and the annotation has been
updated in 441'151 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in UniProtKB/TrEMBL, but which have now been deleted from
the database.


4. Format Differences Between UniProtKB/Swiss-Prot and UniProtKB/TrEMBL

The format and conventions used by UniProtKB/TrEMBL follow as closely as possible
that of UniProtKB/Swiss-Prot. Hence, it is not necessary to produce an additional
user manual and extensive release notes for UniProtKB/TrEMBL. The information given
in the UniProtKB/Swiss-Prot release notes and user manual are in general valid for
UniProtKB/TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. The data class used in UniProtKB/TrEMBL (in the ID line) is always 'PRELIMINARY', whereas in UniProtKB/Swiss-Prot it is always 'STANDARD'.

Differences in line types present in UniProtKB/Swiss-Prot and TrEMBL:

The ID line (IDentification):

The entry name used in UniProtKB is the same as the Accession Number of the
entry.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in UniProtKB/Swiss-Prot; but the
DT lines in UniProtKB/TrEMBL refer to the UniProtKB/TrEMBL release. The difference is
shown in the example below.

DT lines in a UniProtKB/Swiss-Prot entry:

DT 01-JAN-1988 (Rel. 06, Created)
DT 01-JUL-1989 (Rel. 11, Last sequence update)
DT 01-AUG-1992 (Rel. 23, Last annotation update)

DT lines in a UniProtKB/TrEMBL entry:

DT 01-NOV-1996 (UniProtKB/TrEMBLrel. 01, Created)
DT 01-NOV-1999 (UniProtKB/TrEMBLrel. 12, Last sequence update)
DT 01-MAR-2004 (UniProtKB/TrEMBLrel. 26, Last annotation update)

5. Weekly updates of UniProtKB/TrEMBL and non-redundant data sets

5.1 UniProtKB/TrEMBL updates

Biweekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

5.2 UniProtKB

We also produce biweekly a complete non-redundant protein sequence
collection by providing three compressed files: uniprot.sprot.dat.gz and
uniprot.trembl.dat.gz are in /pub/databases/uniprot/knowledgebase and
uniprot.trembl_new.dat.gz is in /pub/databases/uniprot/knowledgebase/new
on the EBI ftp server.

This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of UniProtKB/Swiss-Prot + UniProtKB/TrEMBL to their own schedule without having to wait for full releases of UniProtKB/Swiss-Prot or UniProtKB/TrEMBL (UniProtKB).

5.3 XML

A version of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL in XML format has been developed and is provided with this release. More information is available at
http://www.ebi.uniprot.org/support/documents.shtml and the data can be
downloaded from
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase

We would welcome any feedback from the user community.

5.4 Varsplic Expand

We also provide Varsplic Expand which is a program to generate
" expanded" sequences from Swiss-Prot and TrEMBL records i.e. sequences
including the variants specified by the varsplic, variant and conflict
annotations. New records are produced in either pseudo-Swiss-Prot or
FASTA format for each specified variant. More information and the data is
available at ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase

6. Access/Data Distribution

FTP server: ftp.ebi.ac.uk/pub/databases/trembl
SRS server: http://srs.ebi.ac.uk/

UniProtKB/TrEMBL is also available on the UniProtKB/Swiss-Prot CD-ROM.
UniProtKB/Swiss-Prot + UniProtKB/TrEMBL is searchable on the following servers at the EBI:

FASTA3 (http://www.ebi.ac.uk/fasta33/)
BLAST2 (http://www.ebi.ac.uk/blast2/)
Scanps (http://www.ebi.ac.uk/scanps/)
MPSrch (http://www.ebi.ac.uk/MPsrch/)

For each UniProtKB/TrEMBL release, a synchronized version of the concurrent UniProtKB/Swiss-Prot release is distributed at ftp.ebi.ac.uk/pub/databases/trembl/swissprot/


7. General announcements and Forthcoming changes

7.1 What's new

We have introduced new resources for users to enable us to communicate
effectively between releases about what is new in UniProtKB/Swiss-Prot and UniProtKB/TrEMBL and what is planned for the future.
These are available at:
http://us.expasy.org/sprot/relnotes/sp_news.html
http://us.expasy.org/sprot/relnotes/sp_soon.html
  

TrEMBL release 25.0

Published October 2, 2003

                              TrEMBL Release Notes
                              Release 25, October 2003

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: datalib@ebi.ac.uk/swissprot@ebi.ac.uk
    WWW server: http://www.ebi.ac.uk/

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: Amos.Bairoch@isb-sib.ch
    WWW server: http://www.expasy.org/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Claire O'Donovan, Maria Jesus Martin, Nicola Althorpe, Rolf
       Apweiler, Daniel Barrell, Kirsty Bates, Paul Browne, Kirill
       Degtyarenko, Ruth Eberhardt, Gill Fraser, Alexander Fedetov, Andre
       Hackmann, Alexander Kanapin, Youla Karavidopoulou, Paul Kersey, Ernst
       Kretschmann, Kati Laiho, Minna Lehvaslaiho, Michele Magrane, Michelle
       McHale, Virginie Mittard, Nicola Mulder, John F. O'Rourke, Markiyan
       Oliynyk, Sandra Orchard, Astrid Rakow, Sandra van den Broek, Eleanor
       Whitfield and Allyson Williams at the EMBL Outstation - European
       Bioinformatics Institute (EBI) in Hinxton, UK;
    o  Amos Bairoch, Alexandre Gattiker, Karine Michoud, Isabelle Phan and
       Sandrine Pilbout at the Swiss Institute of Bioinformatics in Geneva,
       Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2003 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use
    the following reference:

    Boeckmann B., Bairoch A., Apweiler R., Blatter M., Estreicher A.,
    Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I.,
    Pilbout S., and Schneider M. (2003)
    The Swiss-Prot protein knowledgebase and its supplement TrEMBL in
    2003.
    Nucleic Acids Res. 31:365-370.


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database
complementing the Swiss-Prot Protein Knowledgebase. TrEMBL contains
the translations of all coding sequences (CDS) present in the
EMBL/GenBank/DDBJ Nucleotide Sequence Databases and also protein
sequences extracted from the literature or submitted to Swiss-Prot,
which are not yet integrated into Swiss-Prot. For all TrEMBL entries
which should finally be upgraded to the standard Swiss-Prot quality,
Swiss-Prot accession numbers have been assigned.

                        2. Why a complement to Swiss-Prot?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into
Swiss-Prot. We do not want to dilute the quality standards of Swiss-Prot
by incorporating sequences without proper sequence analysis and
annotation, but we do want to make the sequences available as quickly as
possible. TrEMBL achieves this second goal, and is a major step in the
process of speeding up subsequent upgrading of annotation to the standard
Swiss-Prot quality. To address the problem of redundancy, the translations
of all coding sequences (CDS) in the EMBL Nucleotide Sequence Database
already included in Swiss-Prot have been removed from TrEMBL.


                        3. The Release

This TrEMBL release has been produced in synch with Swiss-Prot release 42.
It was created from the EMBL Nucleotide Sequence Database release 75 and
updates until the 20-August-2003 and contains 1'117'376 entries and
332'901'127 amino acids.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (Swiss-Prot TrEMBL) contains the entries (1'017'041) which should
be eventually incorporated into Swiss-Prot. Swiss-Prot accession numbers
have been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea):                          1878 entries
arp.dat (Complete Archaeal proteomes):     30990 entries
fun.dat (Fungi):                           18453 entries
hum.dat (Human):                           35244 entries
inv.dat (Invertebrates):                   87444 entries
mam.dat (Other Mammals):                   14170 entries
mhc.dat (MHC proteins):                     9375 entries
org.dat (Organelles):                      87849 entries
phg.dat (Bacteriophages):                   9385 entries
pln.dat (Plants):                          96577 entries
pro.dat (Prokaryotes):                     90825 entries
prp.dat (Complete Prokaryotic Proteomes): 277839 entries
rod.dat (Rodents):                         43265 entries
unc.dat (Unclassified):                      453 entries
vrl.dat (Viruses):                         95980 entries
vrt.dat (Other Vertebrates):               20862 entries
vrv.dat (Retroviruses):                    96452 entries

83'033 new entries have been integrated in SP-TrEMBL. The sequences of
1046 SP-TrEMBL entries have been updated and the annotation has been
updated in 447'504 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (100'335) that we do
not want to include in Swiss-Prot. REM-TrEMBL entries do not have Swiss-Prot
accession numbers. Instead the stable ID portion of the protein_id present
in the source EMBL/DDBJ/GenBank nucleotide sequence database entries is
used as the ID and accession number.This section is organized in six
subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into Swiss-Prot,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in Swiss-Prot and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      Swiss-Prot  and  keep  only a  representative  cross-section of  these
      proteins in Swiss-Prot.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in Swiss-Prot are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to Swiss-Prot.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in Swiss-Prot), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of Swiss-Prot.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.

                4. Format Differences Between Swiss-Prot and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of Swiss-Prot. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the Swiss-Prot release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in Swiss-Prot and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in Swiss-Prot it is always 'STANDARD'.

Differences in line types present in Swiss-Prot and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format       <identifier>
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in Swiss-Prot; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a Swiss-Prot entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1999 (TrEMBLrel. 12, Last sequence update)
    DT   28-FEB-2003 (TrEMBLrel. 23, Last annotation update)

                5. Weekly updates of TrEMBL and non-redundant data sets

5.1 TrEMBL updates

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

5.2 SPTr

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of Swiss-Prot + TrEMBL to
their own schedule without having to wait for full releases of Swiss-Prot
or TrEMBL.

5.3 XML

A version of Swiss-Prot and TrEMBL in XML format has been developed and is
provided with this release. More information is available at
http://www.ebi.ac.uk/swissprot/SP-ML and the data can be downloaded
from ftp://ftp.ebi.ac.uk/pub/databases/trembl/xml and
ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/xml

We would welcome any feedback from the user community.

5.4 Varsplic Expand

We also provide Varsplic Expand which is a program to generate
"expanded" sequences from Swiss-Prot and TrEMBL records i.e. sequences
including the variants specified by the varsplic, variant and conflict
annotations. New records are produced in either pseudo-Swiss-Prot or
FASTA format for each specified variant. More information and the data is
available at ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/

            6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the Swiss-Prot CD-ROM.
Swiss-Prot + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta33/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)

For each TrEMBL release, a synchronized version of the concurrent Swiss-Prot
release is distributed at ftp.ebi.ac.uk/pub/databases/trembl/swissprot/


                  7. Forthcoming changes

7.1 Evidence tags

We are continuing with the introduction of evidence tags to Swiss-Prot and
TrEMBL entries. The aim of this is to allow users to see where data items
came from and to enable Swiss-Prot staff to automatically update data if
the underlying evidence changes. This is ongoing internally and the evidence
tags are visible in the XML version of Swiss-Prot and TrEMBL.
For more information, please see
http://www.ebi.ac.uk/swissprot/SP-ML/evidence.html
We would welcome any feedback from the user community.

7.2 Conversion of TrEMBL to mixed case

Most of the DE (DEscription), GN (Gene Name), RC (Reference Comment)
and CC (Comment) lines have been converted to mixed case internally. The
conversion is ongoing and will be made public as the conversion of each
line type reaches a satisfactory stage. A mixed case version of the DE
line was made public in release 22 and the RC line in release 23.


7.3  Extension of the entry name format

Currently TrEMBL has it's accession number as the entry name. It is
intended to extend this to have the accession number as the protein
name component of the entry name (having elongated the mnemonic code
from 4 characters to 6) and to assign the mnemonic species identification
code of at most 5 alphanumeric characters as Swiss-Prot currently does.


                   8. General announcements

8.1 REM-TrEMBL

This is the last release of REM-TrEMBL. It has been replaced by our new
database resource UniParc (http://www.ebi.ac.uk/uniparc) We will continue
to produce releases of SP-TrEMBL quarterly.

8.2 What's new?

We have introduced a new resource for users to enable us to communicate
effectively between releases about what is new in Swiss-Prot and TrEMBL.
This is available at: http://us.expasy.org/sprot/relnotes/sp_news.html
  

TrEMBL release 24.0

Published June 1, 2003

                              TrEMBL Release Notes
                              Release 24, June 2003


EMBL Outstation
European Bioinformatics Institute (EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Telephone: (+44 1223) 494 444
Fax: (+44 1223) 494 468
Electronic mail address: datalib@ebi.ac.uk/swissprot@ebi.ac.uk
WWW server: http://www.ebi.ac.uk/

Swiss Institute of Bioinformatics (SIB)
Centre Medical Universitaire
1, rue Michel Servet
1211 Geneva 4
Switzerland

Telephone: (+41 22) 702 50 50
Fax: (+41 22) 702 58 58
Electronic mail address: Amos.Bairoch@isb-sib.ch
WWW server: http://www.expasy.org/


Acknowledgements

UniProtKB/TrEMBL has been prepared by:

o Maria Jesus Martin, Claire O'Donovan, Nicola Althorpe, Rolf
Apweiler, Daniel Barrell, Kirsty Bates, Paul Browne, Kirill
Degtyarenko, Ruth Eberhardt, Gill Fraser, Alexander Fedetov, Andre
Hackmann, Alexander Kanapin, Youla Karavidopoulou, Paul Kersey, Ernst
Kretschmann, Kati Laiho, Minna Lehvaslaiho, Michele Magrane, Michelle
McHale, Virginie Mittard, Nicola Mulder, John F. O'Rourke, Markiyan
Oliynyk, Sandra Orchard, Astrid Rakow, Sandra van den Broek, Eleanor
Whitfield and Allyson Williams at the EMBL Outstation - European
Bioinformatics Institute (EBI) in Hinxton, UK;
o Amos Bairoch, Alexandre Gattiker, Karine Michoud, Isabelle Phan and
Sandrine Pilbout at the Swiss Institute of Bioinformatics in Geneva,
Switzerland.

Copyright Notice
TrEMBL copyright (c) 2003 EMBL-EBI
This manual and the database it accompanies may be copied and
redistributed freely, without advance permission, provided
that this copyright statement is reproduced with each copy.

Citation

If you want to cite TrEMBL in a publication please use
the following reference:

Boeckmann B., Bairoch A., Apweiler R., Blatter M., Estreicher A.,
Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I.,
Pilbout S., and Schneider M. (2003)
The Swiss-Prot protein knowledgebase and its supplement TrEMBL in
2003.
Nucleic Acids Res. 31:365-370.


1. Introduction


UniProtKB/TrEMBL is a computer-annotated protein sequence database
complementing the UniProtKB/Swiss-Prot Protein Knowledgebase. UniProtKB/TrEMBL contains the translations of all coding sequences (CDS) present in the
EMBL/GenBank/DDBJ Nucleotide Sequence Databases and also protein
sequences extracted from the literature or submitted to Swiss-Prot,
which are not yet integrated into UniProtKB/Swiss-Prot. For all UniProtKB/TrEMBL entries which should finally be upgraded to the standard UniProtKB/Swiss-Prot quality,
UniProtKB/Swiss-Prot accession numbers have been assigned.

2. Why a complement to UniProtKB/Swiss-Prot?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into
UniProtKB/Swiss-Prot. We do not want to dilute the quality standards of UniProtKB/Swiss-Prot by incorporating sequences without proper sequence analysis and
annotation, but we do want to make the sequences available as quickly as
possible. UniProtKB/TrEMBL achieves this second goal, and is a major step in the
process of speeding up subsequent upgrading of annotation to the standard
UniProtKB/Swiss-Prot quality. To address the problem of redundancy, the translations
of all coding sequences (CDS) in the EMBL Nucleotide Sequence Database
already included in Swiss-Prot have been removed from UniProtKB/TrEMBL.


3. The Release

This UniProtKB/TrEMBL release has been produced in synch with UniProtKB/Swiss-Prot release 41.13. It was created from the EMBL Nucleotide Sequence Database release 74 and updates until the 16-May-2003 and contains 1'043'240 entries and 310'062'472
amino acids.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (Swiss-Prot TrEMBL) contains the entries (944'868) which should
be eventually incorporated into UniProtKB/Swiss-Prot. UniProtKB/Swiss-Prot accession numbers have been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea): 1796 entries
arp.dat (Complete Archaeal proteomes): 31367 entries
fun.dat (Fungi): 17715 entries
hum.dat (Human): 35721 entries
inv.dat (Invertebrates): 84696 entries
mam.dat (Other Mammals): 13529 entries
mhc.dat (MHC proteins): 9252 entries
org.dat (Organelles): 81685 entries
phg.dat (Bacteriophages): 8188 entries
pln.dat (Plants): 86647 entries
pro.dat (Prokaryotes): 86373 entries
prp.dat (Complete Prokaryotic Proteomes): 245103 entries
rod.dat (Rodents): 42386 entries
unc.dat (Unclassified): 341 entries
vrl.dat (Viruses): 89781 entries
vrt.dat (Other Vertebrates): 18607 entries
vrv.dat (Retroviruses): 91681 entries

124'197 new entries have been integrated in SP-TrEMBL. The sequences of
1509 SP-TrEMBL entries have been updated and the annotation has been
updated in 651'912 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (98'372) that we do
not want to include in UniProtKB/Swiss-Prot. REM-TrEMBL entries do not have UniProtKB/Swiss-Prot accession numbers. Instead the stable ID portion of the protein_id present in the source EMBL/DDBJ/GenBank nucleotide sequence database entries is
used as the ID and accession number.This section is organized in six
subsections:

1) Immunoglobulins and T-cell receptors (Immuno.dat)
Most REM-TrEMBL entries are immunoglobulins and T-cell receptors. We
stopped entering immunoglobulins and T-cell receptors into Swiss-Prot,
because we only want to keep the germ line gene derived translations
of these proteins in Swiss-Prot and not all known somatic recombinated
variations of these proteins. We would like to create a specialized
database dealing with these sequences as a further supplement to
Swiss-Prot and keep only a representative cross-section of these
proteins in Swiss-Prot.

2) Synthetic sequences (Synth.dat)
Another category of data, which will not be included in UniProtKB/Swiss-Prot are
synthetic sequences. Again, we do not want to leave these entries in
UniProtKB/TrEMBL. Ideally one should build a specialized database for artificial
sequences as a further supplement to Swiss-Prot.

3) Patent application sequences (Patent.dat)
A third subsection consists of coding sequences captured from patent
applications. A thorough survey of these entries have shown that
apart from a rather small minority (which in most cases have already
been integrated in UniProtKB/Swiss-Prot), most of these sequences contain either
erroneous data or concern artificially generated sequences outside the
scope of UniProtKB/Swiss-Prot.

4) Small fragments (Smalls.dat)
Another subsection consists of fragments with less than eight amino
acids.

5) CDS not coding for real proteins (Pseudo.dat)
This subsection consists of CDS translations where we have strong
evidence to believe that these CDS are not coding for real proteins.

6) Truncated proteins (Truncated.dat)
The last subsection consists of truncated proteins which result from
events like mutations introducing a stop codon leading to the truncation
of the protein product.

4. Format Differences Between UniProtKB/Swiss-Prot and UniProtKB/TrEMBL

The format and conventions used by UniProtKB/TrEMBL follow as closely as possible
that of UniProtKB/Swiss-Prot. Hence, it is not necessary to produce an additional
user manual and extensive release notes for UniProtKB/TrEMBL. The information given
in the UniProtKB/Swiss-Prot release notes and user manual are in general valid for
UniProtKB/TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY', whereas in UniProtKB/Swiss-Prot it is always 'STANDARD'.

Differences in line types present in UniProtKB/Swiss-Prot and UniProtKB/TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT CDS 339..1514
FT /codon_start=1
FT /db_xref="PID:g1256015"
FT /product="dystrobrevin-epsilon"
FT /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier /protein_id
Definition Protein Identifier, issued by International collaborators.
This qualifier consists of a stable ID portion (3+5 format
with 3 position letters and 5 numbers) plus a version
number after the decimal point.

Value format <identifier>
Example /protein_id="AAA12345.1"
Comment When the protein sequence encoded by the CDS changes, only
the version number of the /protein_id value is incremented.
The stable part of the /protein_id remains unchanged and
as a result will permanently be associated with a given
protein. This qualifier is valid only on CDS features
which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in Swiss-Prot; but the
DT lines in UniProtKB/TrEMBL refer to the UniProtKB/TrEMBL release. The difference is shown in the example below.

DT lines in a UniProtKB/Swiss-Prot entry:

DT 01-JAN-1988 (Rel. 06, Created)
DT 01-JUL-1989 (Rel. 11, Last sequence update)
DT 01-AUG-1992 (Rel. 23, Last annotation update)

DT lines in a TrEMBL entry:

DT 01-NOV-1996 (TrEMBLrel. 01, Created)
DT 01-NOV-1999 (TrEMBLrel. 12, Last sequence update)
DT 28-FEB-2003 (TrEMBLrel. 23, Last annotation update)

5. Weekly updates of UniProtKB/TrEMBL and non-redundant data sets

5.1 TrEMBL updates

Weekly cumulative updates of UniProtKB/TrEMBL are available by anonymous FTP and
from the EBI SRS server.

5.2 SPTr

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of Swiss-Prot + TrEMBL to
their own schedule without having to wait for full releases of Swiss-Prot
or TrEMBL.

5.3 XML

A version of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL in XML format has been developed and is provided with this release. More information is available at
http://www.ebi.ac.uk/swissprot/SP-ML and the data can be downloaded
from ftp://ftp.ebi.ac.uk/pub/databases/trembl/xml and
ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/xml

We would welcome any feedback from the user community.

5.4 Varsplic Expand

We also provide Varsplic Expand which is a program to generate
" expanded" sequences from UniProtKB/Swiss-Prot and UniProtKB/TrEMBL records i.e. sequences including the variants specified by the varsplic, variant and conflict
annotations. New records are produced in either pseudo-Swiss-Prot or
FASTA format for each specified variant. More information and the data is
available at ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/

6. Access/Data Distribution

FTP server: ftp.ebi.ac.uk/pub/databases/trembl
SRS server: http://srs.ebi.ac.uk/

UniProtKB/TrEMBL is also available on the UniProtKB/Swiss-Prot CD-ROM.
UniProtKB/Swiss-Prot + UniProtKB/TrEMBL is searchable on the following servers at the EBI:

FASTA3 (http://www.ebi.ac.uk/fasta33/)
BLAST2 (http://www.ebi.ac.uk/blast2/)
Scanps (http://www.ebi.ac.uk/scanps/)
MPSrch (http://www.ebi.ac.uk/MPsrch/)

For each UniProtKB/TrEMBL release, a synchronized version of the concurrent UniProtKB/Swiss-Prot release is distributed at ftp.ebi.ac.uk/pub/databases/trembl/swissprot/

7. Description of changes made to TrEMBL since release 23.

7.1 Changes concerning cross-references (DR line)

We have added cross-references from UniProtKB/TrEMBL to three new databases.

7.1.1 GK

We have added cross-references to the Genome Knowledgebase (GK)
(available at http://www.genomeknowledgebase.org)

7.1.2 GO

We have added cross-references to the Gene Ontology database (GO)
(available at http://www.geneontology.org).

7.1.3 PIRSF

We have added cross-references to PIR superfamilies, an integrated
protein classification database
(available at http://pir.georgetown.edu/iproclass)


8. Forthcoming changes

8.1 Evidence tags

We are continuing with the introduction of evidence tags to UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL entries. The aim of this is to allow users to see where data items
came from and to enable Swiss-Prot staff to automatically update data if
the underlying evidence changes. This is ongoing internally and the evidence
tags are visible in the XML version of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.
For more information, please see
http://www.ebi.ac.uk/swissprot/SP-ML/evidence.html
We would welcome any feedback from the user community.

8.2 Conversion of TrEMBL to mixed case

Most of the DE (DEscription), GN (Gene Name), RC (Reference Comment)
and CC (Comment) lines have been converted to mixed case internally. The
conversion is ongoing and will be made public as the conversion of each
line type reaches a satisfactory stage. A mixed case version of the DE
line was made public in release 22 and the RC line in release 23.

8.3 Version of SPTr in XML format:

We intend to provide an XML version of SPTr updated monthly

8.4 New format of comment line (CC) topics

We are continuing a major overhaul of various comment line topics. We would
like the majority of the information stored to be usable by computer
programs (while remaining human-readable). We are therefore standardizing
the format of the topics.
'ALTERNATIVE PRODUCTS' was the first topic to be modified.

For more information, please see
http://us.expasy.org/sprot/relnotes/sp_news.html

8.5 Modifications concerning the feature table (FT line)

We are investigating a major effort in the annotation of posttranslational
modifications, which has an effect on various feature keys and feature
descriptions.
In this release, the feature key 'CROSSLNK' has been
introduced to describe bonds between amino acids, which are formed
posttranslationally within a peptide or between peptides, such as isopeptidic
bonds, carbon-carbon-linkages, carbon-nitrogen linkages, thioether bonds,
thoiether bonds, thiolester bonds, and backbone condensations.

For more information, please see
http://us.expasy.org/sprot/relnotes/sp_news.html

8.6 Extension of the entry name format

Currently TrEMBL has it's accession number as the entry name. It is
intended to extend this to have the accession number as the protein
name component of the entry name (having elongated the mnemonic code
from 4 characters to 6) and to assign the mnemonic species identification
code of at most 5 alphanumeric characters as Swiss-Prot currently does.


9. General announcements

9.1 REM-TrEMBL

It has been decided to discontinue the production of REM-TrEMBL. The next
release will be the last one. This is because REM-TrEMBL has been fully
integrated into our new database resource UniParc which will be made
public in the near future. We will continue to produce releases of
SP-TrEMBL quarterly.

9.2 What's new?

We have introduced a new resource for users to enable us to communicate
effectively between releases about what is new in UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.
This is available at: http://us.expasy.org/sprot/relnotes/sp_news.html

  

TrEMBL release 23.0

Published March 1, 2003

                              TrEMBL Release Notes
                              Release 23, March 2003

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: datalib@ebi.ac.uk
    WWW server: http://www.ebi.ac.uk/

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: Amos.Bairoch@isb-sib.ch
    WWW server: http://www.expasy.org/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Maria Jesus Martin, Claire O'Donovan, Nicola Althorpe, Rolf
       Apweiler, Daniel Barrell, Kirsty Bates, Paul Browne, Daniel Barrell,
       Kirill Degtyarenko, Gill Fraser, Alexander Fedetov, Andre Hackmann,
       Alexander Kanapin, Youla Karavidopoulou, Paul Kersey, Ernst
       Kretschmann, Kati Laiho, Minna Lehvaslaiho, Michele Magrane, Michelle
       McHale, Virginie Mittard, Nicola Mulder, John F. O'Rourke, Sandra
       Orchard, Astrid Rakow, Sandra van den Broek, Eleanor Whitfield and
       Allyson Williams at the EMBL Outstation - European Bioinformatics
       Institute (EBI) in Hinxton, UK;
    o  Amos Bairoch, Alexandre Gattiker, Karine Michoud, Isabelle Phan and
       Sandrine Pilbout at the Swiss Institute of Bioinformatics in Geneva,
       Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2003 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use
    the following reference:

    Boeckmann B., Bairoch A., Apweiler R., Blatter M., Estreicher A.,
    Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I.,
    Pilbout S., and Schneider M. (2003)
    The Swiss-Prot protein knowledgebase and its supplement TrEMBL in
    2003.
    Nucleic Acids Res. 31:365-370.


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database
complementing the Swiss-Prot Protein Knowledgebase. TrEMBL contains
the translations of all coding sequences (CDS) present in the
EMBL/GenBank/DDBJ Nucleotide Sequence Databases and also protein
sequences extracted from the literature or submitted to Swiss-Prot,
which are not yet integrated into Swiss-Prot. For all TrEMBL entries
which should finally be upgraded to the standard Swiss-Prot quality,
Swiss-Prot accession numbers have been assigned.

                        2. Why a complement to Swiss-Prot?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into
Swiss-Prot. We do not want to dilute the quality standards of Swiss-Prot
by incorporating sequences without proper sequence analysis and
annotation, but we do want to make the sequences available as quickly as
possible. TrEMBL achieves this second goal, and is a major step in the
process of speeding up subsequent upgrading of annotation to the standard
Swiss-Prot quality.To address the problem of redundancy, the translations
of all coding sequences (CDS) in the EMBL Nucleotide Sequence Database
already included in Swiss-Prot have been removed from TrEMBL.


                        3. The Release

This TrEMBL release has been produced in synch with Swiss-Prot release 41. It
was created from the EMBL Nucleotide Sequence Database release 73 and contains
921'952 entries and 40'914'860 amino acids.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (Swiss-Prot TrEMBL) contains the entries (830'525) which should
be eventually incorporated into Swiss-Prot. Swiss-Prot accession numbers
have been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea):                          1736 entries
arp.dat (Complete Archaeal proteomes):     31625 entries
fun.dat (Fungi):                           15977 entries
hum.dat (Human):                           34880 entries
inv.dat (Invertebrates):                   79680 entries
mam.dat (Other Mammals):                   12223 entries
mhc.dat (MHC proteins):                     8813 entries
org.dat (Organelles):                      73538 entries
phg.dat (Bacteriophages):                   6448 entries
pln.dat (Plants):                          80929 entries
pro.dat (Prokaryotes):                     79736 entries
prp.dat (Complete Prokaryotic Proteomes): 181432 entries
rod.dat (Rodents):                         40143 entries
unc.dat (Unclassified):                      331 entries
vrl.dat (Viruses):                         82490 entries
vrt.dat (Other Vertebrates):               14889 entries
vrv.dat (Retroviruses):                    85655 entries

107'123 new entries have been integrated in SP-TrEMBL. The sequences of
1713 SP-TrEMBL entries have been updated and the annotation has been
updated in 252'549 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (91'427) that we do
not want to include in Swiss-Prot. REM-TrEMBL entries do not have Swiss-Prot
accession numbers. Instead the stable ID portion of the protein_id present
in the source EMBL/DDBJ/GenBank nucleotide sequence database entries is
used as the ID and accession number.This section is organized in six
subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into Swiss-Prot,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in Swiss-Prot and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      Swiss-Prot  and  keep  only a  representative  cross-section of  these
      proteins in Swiss-Prot.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in Swiss-Prot are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to Swiss-Prot.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in Swiss-Prot), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of Swiss-Prot.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.

                4. Format Differences Between Swiss-Prot and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of Swiss-Prot. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the Swiss-Prot release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in Swiss-Prot and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in Swiss-Prot it is always 'STANDARD'.

Differences in line types present in Swiss-Prot and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.



The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in Swiss-Prot; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a Swiss-Prot entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1999 (TrEMBLrel. 12, Last sequence update)
    DT   28-FEB-2003 (TrEMBLrel. 23, Last annotation update)

                5. Weekly updates of TrEMBL and non-redundant data sets

5.1 TrEMBL updates

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

5.2 SPTr

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of Swiss-Prot + TrEMBL to
their own schedule without having to wait for full releases of Swiss-Prot
or TrEMBL.

5.3 XML

A version of Swiss-Prot and TrEMBL in XML format has been developed and is
provided with this release. More information is available at
http://www.ebi.ac.uk/swissprot/SP-ML <http://www.ebi.ac.uk/swissprot/SP-ML/> and the data can be downloaded
from ftp://ftp.ebi.ac.uk/pub/databases/trembl/xml <ftp://ftp.ebi.ac.uk/pub/databases/trembl/xml/> and
ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/xml <ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/xml/>

We would welcome any feedback from the user community.

5.4 Varsplic Expand

We also provide Varsplic Expand which is a program to generate
"expanded" sequences from Swiss-Prot and TrEMBL records i.e. sequences
including the variants specified by the varsplic, variant and conflict
annotations. New records are produced in either pseudo-Swiss-Prot or
FASTA format for each specified variant. More information and the data is
available at ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/

            6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl <ftp.ebi.ac.uk/pub/databases/trembl/>
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the Swiss-Prot CD-ROM.
Swiss-Prot + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta33/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)

For each TrEMBL release, a synchronized version of the concurrent Swiss-Prot
release is distributed at ftp.ebi.ac.uk/pub/databases/trembl/swissprot/

    7. Description of changes made to TrEMBL since release 22.

7.1 Changes concerning cross-references (DR line)

  We have added cross-references from TrEMBL to a number of new databases.

  7.1.1 Schizosaccharomyces pombe GeneDB Prototype

  We have added cross-references to the Schizosaccharomyces pombe GeneDB
  Prototype (available at http://www.genedb.org/genedb/pombe/index.jsp)

  7.1.2 Genew

  We have added cross-references to the Human Gene Nomenclature Database Genew
  (available at http://www.gene.ucl.ac.uk/nomenclature/searchgenes.pl).


7.2 Changes concerning the Organelle (OG) line

The term Nucleomorph has been added which is the residual nucleus of an
algal endosymbiont that resides inside its host cell.


                      8. Planned changes

8.1 Evidence tags

We are continuing with the introduction of evidence tags to Swiss-Prot and
TrEMBL entries. The aim of this is to allow users to see where data items
came from and to enable Swiss-Prot staff to automatically update data if
the underlying evidence changes. This is ongoing internally and the evidence
tags are visible in the XML version of Swiss-Prot and TrEMBL.
For more information, please see
http://www.ebi.ac.uk/swissprot/SP-ML/evidence.html
We would welcome any feedback from the user community.

8.2 Conversion of TrEMBL to mixed case

Most of the DE (DEscription), GN (Gene Name), RC (Reference Comment)
and CC (Comment) lines have been converted to mixed case internally. The
conversion is ongoing and will be made public as the conversion of each
line type reaches a satisfactory stage. A mixed case version of the DE
line was made public in release 21. The RC line is mixed case in this
release.

8.3 Version of SPTr in XML format:

We intend to provide an XML version of SPTr updated monthly

8.4 Reference Comment (RC) line topics may span lines

The RC (Reference Comment) line store comments relevant to the reference
cited, in currently 5 distinct topics: PLASMID, SPECIES, STRAIN, TISSUE and
TRANSPOSON. It is not always possible to list all information within one
line. Therefore we will allow multiple RC lines, in which one topic might
span over a line. Example:

RC   STRAIN=Various strains;

could become

RC   STRAIN=AZ.026, DC.005, GA.039, GA2181, IL.014, IN.018, KY.172, KY2.37,
RC   LA.013, MN.001, MNb027, MS.040, NY.016, OH.036, TN.173, TN2.38,
RC   UT.002, AL.012, AZ.180, MI.035, VA.015, and IL2.17;

8.5  New format of comment line (CC) topics

We are continuing a major overhaul of various comment line topics. We would
like the majority of the information stored to be usable by computer
programs (while remaining human-readable). We are therefore standardizing
the format of the topics. Please see Swiss-Prot release 41 relnotes for
further details.

8.6 Modifications concerning the feature table (FT line)

We are investigating a major effort in the annotation of posttranslational
modifications, which has an effect on various feature keys and feature
descriptions. Please see Swiss-Prot release 41 relnotes for further details.

8.7  Extension of the entry name format

Currently TrEMBL has it's accession number as the entry name. It is
intended to extend this to have the accession number as the protein
name component of the entry name (having elongated the mnemonic code
from 4 characters to 6) and to assign the mnemonic species identification
code of at most 5 alphanumeric characters as Swiss-Prot currently does.

  

TrEMBL release 22.0

Published October 1, 2002

                              TrEMBL Release Notes
                              Release 22, October 2002

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: Amos.Bairoch@isb-sib.ch
    WWW server: http://www.expasy.org/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Philippe Aldebert, Nicola Althorpe, Rolf Apweiler, Daniel Barrell,
       Kirsty Bates, Paul Browne, Daniel Barrell, Kirill Degtyarenko,
       Gill Fraser, Alexander Fedetov, Andre Hackmann, Henning Hermjakob,
       Alexander Kanapin, Youla Karavidopoulou, Paul Kersey, Ernst Kretschmann,
       Kati Laiho, Minna Lehvaslaiho, Michele Magrane, Maria Jesus Martin,
       Michelle McHale, Virginie Mittard, Nicola Mulder, Claire O'Donovan,
       John F. O'Rourke, Sandra Orchard, Astrid Rakow, Sandra van den Broek,
       Eleanor Whitfield and Allyson Williams at the EMBL Outstation -
       European Bioinformatics Institute (EBI) in Hinxton, UK;
    o  Amos Bairoch, Alexandre Gattiker, Isabelle Phan and Sandrine Pilbout
       at the Swiss Institute of Bioinformatics in Geneva, Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2002 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A., and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 2000.
              Nucl. Acids Res. 28:45-48(2000).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Knowledgebase. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL/GenBank/DDBJ Nucleotide
Sequence Databases and also protein sequences extracted from the literature
or submitted to SWISS-PROT, which are not yet integrated into SWISS-PROT.
TrEMBL can be considered as a preliminary section of SWISS-PROT. For all
TrEMBL entries which should finally be upgraded to the standard SWISS-PROT
quality, SWISS-PROT accession numbers have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as quickly as possible. TrEMBL achieves this
second goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.


                             3. The Release

This TrEMBL release was created from the EMBL Nucleotide Sequence Database
release 72 and contains 821'014 entries and 36'790'365 amino acids. To
minimize redundancy, the translations of all coding sequences (CDS) in the
EMBL Nucleotide Sequence Database already included in SWISS-PROT release 40
and updates until 25.10.02 have been removed from TrEMBL release 22.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (734'427) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea):                          1694 entries
arp.dat (Complete Archaeal proteomes):     32840 entries
fun.dat (Fungi):                           19843 entries
hum.dat (Human):                           39753 entries
inv.dat (Invertebrates):                   84525 entries
mam.dat (Other Mammals):                   11880 entries
mhc.dat (MHC proteins):                     8701 entries
org.dat (Organelles):                      89635 entries
phg.dat (Bacteriophages):                   6585 entries
pln.dat (Plants):                          98105 entries
pro.dat (Prokaryotes):                     86915 entries
prp.dat (Complete Prokaryotic Proteomes): 161638 entries
rod.dat (Rodents):                         32982 entries
unc.dat (Unclassified):                      149 entries
vrl.dat (Viruses):                         85797 entries
vrt.dat (Other Vertebrates):               14095 entries
vrv.dat (Retroviruses):                    82256 entries

72'120 new entries have been integrated in SP-TrEMBL. The sequences of
4357 SP-TrEMBL entries have been updated and the annotation has been
updated in 334'435 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (86'587) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries do not have SWISS-PROT
accession numbers. Instead the stable ID portion of the protein_id present
in the source EMBL/DDBJ/GenBank nucleotide sequence database entries is
used as the ID and accession number.This section is organized in six
subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.

                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format       <identifier>
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)

                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or TrEMBL.

We have also introduced Varsplic Expand which is a program to generate
"expanded" sequences from SWISS-PROT and TrEMBL records i.e. sequences
including the variants specified by the varsplic, variant and conflict
annotations. New records are produced in either pseudo-SWISS-PROT or
FASTA format for each specified variant. More information and the data is
available at ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta33/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Bic_sw  (http://www.ebi.ac.uk/bic_sw/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)

For each TrEMBL release, a synchronized version of the concurrent SWISS-PROT
release is distributed at ftp.ebi.ac.uk/pub/databases/trembl/swissprot/

 7. Planned changes


7.1 Evidence tags:

    We are continuing with the introduction of evidence tags to SWISS-PROT and
    TrEMBL entries. The aim of this is to allow users to see where data items
    came from and to enable SWISS-PROT staff to automatically update data if
    the underlying evidence changes. This is ongoing internally and we hope
    to provide a public version in 2002. For more information,
    please see
    ftp://ftp.ebi.ac.uk/pub/databases/trembl/evidenceDocumentation.html
    We would welcome any feedback from the user community.

7.2 Conversion of TrEMBL to mixed case:

    Most of the DE (DEscription), GN (Gene Name), RC (Reference Comment)
    and CC (Comment) lines have been converted to mixed case internally. The
    conversion is ongoing and will be made public as the conversion of each
    line type reaches a satisfactory stage. A mixed case version of the DE
    line was made public in last release.

7.3 Version of TrEMBL in XML format:

    A pre-prelease version of TrEMBL in XML format has been developed and is
    provided with this release of TrEMBL.This will be provided in the future
    for SWISS-PROT as well. More information is available at
    http://www.ebi.ac.uk/swissprot/SP-ML and the data can be downloaded
    from ftp://ftp.ebi.ac.uk/pub/databases/trembl/xml
    We would welcome any feedback from the user community.
  

TrEMBL release 21.0

Published June 1, 2002

                              TrEMBL Release Notes
                              Release 21, June 2002

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: Amos.Bairoch@isb-sib.ch
    WWW server: http://www.expasy.org/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Philippe Aldebert, Rolf Apweiler, Daniel Barrell, Kirsty Bates,
       Margaret Biswas, Paul Browne, Sergio Contrino, Daniel Barrell,
       Kirill Degtyarenko, Gill Fraser, Henning Hermjakob, Kati Laiho,
       Alexander Kanapin, Youla Karavidopoulou, Paul Kersey, Minna
       Lehvaslaiho, Michele Magrane, Maria Jesus Martin, Virginie Mittard,
       Nicola Mulder, Claire O'Donovan, John F. O'Rourke, Sandra Orchard,
       Sandra van den Broek, Eleanor Whitfield and Allyson Williams at the
       EMBL Outstation - European Bioinformatics Institute (EBI) in Hinxton,
       UK;
    o  Amos Bairoch, Alain Gateau, Alexandre Gattiker, Isabelle Phan
       and Sandrine Pilbout at the Swiss Institute of Bioinformatics in
       Geneva, Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2002 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A., and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 2000.
              Nucl. Acids Res. 28:45-48(2000).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Knowledgebase. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL/GenBank/DDBJ Nucleotide
Sequence Databases and also protein sequences extracted from the literature
or submitted to SWISS-PROT, which are not yet integrated into SWISS-PROT.
TrEMBL can be considered as a preliminary section of SWISS-PROT. For all
TrEMBL entries which should finally be upgraded to the standard SWISS-PROT
quality, SWISS-PROT accession numbers have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as quickly as possible. TrEMBL achieves this
second goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.


                             3. The Release

This TrEMBL release was created from the EMBL Nucleotide Sequence Database
release 70 and updates until 07.05.02 and contains 751'148 entries
and 218'504'701 amino acids. To minimize redundancy, the translations of all
coding sequences (CDS) in the EMBL Nucleotide Sequence Database already
included in SWISS-PROT release 40 and updates until 21.06.02 have been
removed from TrEMBL release 21.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (671'580) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea):                        1644 entries
arp.dat (Complete Archaeal proteomes):   29757 entries
fun.dat (Fungi):                         14606 entries
hum.dat (Human):                         29766 entries
inv.dat (Invertebrates):                 68301 entries
mam.dat (Other Mammals):                 10511 entries
mhc.dat (MHC proteins):                   8069 entries
org.dat (Organelles):                    58906 entries
phg.dat (Bacteriophages):                 5676 entries
pln.dat (Plants):                        67339 entries
pro.dat (Prokaryotes):                   66680 entries
prp.dat (Complete Prokaryotic Proteomes):123685 entries
rod.dat (Rodents):                       27467 entries
unc.dat (Unclassified):                    143 entries
vrl.dat (Viruses):                       71522 entries
vrt.dat (Other Vertebrates):             12279 entries
vrv.dat (Retroviruses):                  75229 entries

56'042 new entries have been integrated in SP-TrEMBL. The sequences of
810 SP-TrEMBL entries have been updated and the annotation has been
updated in 189'983 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (79'568) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in six subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.

                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)

                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

We also recently introduced Varsplic Expand which is a program to generate
"expanded" sequences from SWISS-PROT records i.e. sequences including the
variants specified by the varsplic, variant and conflict annotations. New
records are produced in either pseudo-SWISS-PROT or FASTA format for each
specified variant. More information and the data is available at
ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta33/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Bic_sw  (http://www.ebi.ac.uk/bic_sw/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)

For each TrEMBL release, a synchronized version of the concurrent SWISS-PROT
release is distributed at ftp.ebi.ac.uk/pub/databases/trembl/swissprot/

 7. Planned changes


7.1 Evidence tags:

    We are continuing with the introduction of evidence tags to SWISS-PROT and
    TrEMBL entries. The aim of this is to allow users to see where data items
    came from and to enable SWISS-PROT staff to automatically update data if
    the underlying evidence changes. This is ongoing internally and we hope
    to provide a public version in 2002. For more information,
    please see
    ftp://ftp.ebi.ac.uk/pub/databases/trembl/evidenceDocumentation.html
    We would welcome any feedback from the user community.

7.2 Conversion of TrEMBL to mixed case:

    Most of the DE (DEscription), GN (Gene Name), RC (Reference Comment)
    and CC (Comment) lines have been converted to mixed case internally. The
    conversion is ongoing and will be made public as the conversion of each
    line type reaches a satisfactory stage. A mixed case version of the DE
    line has been made public in this release.

7.3 Version of SWISS-PROT/TrEMBL in XML format:

    A distribution version of SWISS-PROT and TrEMBL in XML format is
    being developed. The first draft of the new XML format, SP-ML
    (SWISS-PROT Markup Language) can be found at
    http://www.ebi.ac.uk/swissprot/SP-ML.
    We would welcome any feedback from the user community.
  

TrEMBL release 20.0

Published March 1, 2002

                              TrEMBL Release Notes
                              Release 20, March 2002

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: Amos.Bairoch@isb-sib.ch
    WWW server: http://www.expasy.org/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Margaret Biswas, Sergio Contrino,
       Daniel Barrell, Kirill Degtyarenko, Wolfgang Fleischmann, Gill Fraser,
       Henning Hermjakob, Kati Laiho, Alexander Kanapin, Youla Karavidopoulou,
       Paul Kersey, Minna Lehvaslaiho, Michele Magrane, Maria Jesus Martin,
       Virginie Mittard, Nicola Mulder, Claire O'Donovan, John F. O'Rourke,
       Eleanor Whitfield and Allyson Williams at the EMBL Outstation -
       European Bioinformatics Institute (EBI) in Hinxton, UK;
    o  Amos Bairoch, Alain Gateau, Alexandre Gattiker, Isabelle Phan
       and Sandrine Pilbout at the
       Swiss Institute of Bioinformatics in Geneva, Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2002 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A., and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 2000.
              Nucl. Acids Res. 28:45-48(2000).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Knowledgebase. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL/GenBank/DDBJ Nucleotide
Sequence Databases and also protein sequences extracted from the literature
or submitted to SWISS-PROT, which are not yet integrated into SWISS-PROT.
TrEMBL can be considered as a preliminary section of SWISS-PROT. For all
TrEMBL entries which should finally be upgraded to the standard SWISS-PROT
quality, SWISS-PROT accession numbers have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.


                             3. The Release

This TrEMBL release was created from the EMBL Nucleotide Sequence Database
release 69 and updates until 08.02.02 and contains 700'753 entries
and 203'489'769 amino acids. To minimize redundancy, the translations of all
coding sequences (CDS) in the EMBL Nucleotide Sequence Database already
included in SWISS-PROT release 40 and updates until 27.03.02 have been
removed from TrEMBL release 20.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (623'159) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea):                        1721 entries
arp.dat (Complete Archaeal proteomes):   22019 entries
fun.dat (Fungi):                         14172 entries
hum.dat (Human):                         29751 entries
inv.dat (Invertebrates):                 61859 entries
mam.dat (Other Mammals):                 10260 entries
mhc.dat (MHC proteins):                   7673 entries
org.dat (Organelles):                    55796 entries
phg.dat (Bacteriophages):                 4793 entries
pln.dat (Plants):                        61896 entries
pro.dat (Prokaryotes):                   71431 entries
prp.dat (Complete Prokaryote Proteomes):108287 entries
rod.dat (Rodents):                       25972 entries
unc.dat (Unclassified):                    143 entries
vrl.dat (Viruses):                       65583 entries
vrt.dat (Other Vertebrates):             11702 entries
vrv.dat (Retroviruses):                  70101 entries

65'161 new entries have been integrated in SP-TrEMBL. The sequences of
840 SP-TrEMBL entries have been updated and the annotation has been
updated in 231'260 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (77'594) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in six subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.

                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)

                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

We also recently introduced Varsplic Expand which is a program to generate
"expanded" sequences from SWISS-PROT records i.e. sequences including the
variants specified by the varsplic, variant and conflict annotations. New
records are produced in either pseudo-SWISS-PROT or FASTA format for each
specified variant. More information and the data is available at
ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta33/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Bic_sw  (http://www.ebi.ac.uk/bic_sw/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)

For each TrEMBL release, a synchronized version of the concurrent SWISS-PROT
release is distributed at ftp.ebi.ac.uk/pub/databases/trembl/swissprot/

 7. Planned changes


7.1 Evidence tags:

    We are continuing with the introduction of evidence tags to SWISS-PROT and
    TrEMBL entries. The aim of this is to allow users to see where data items
    came from and to enable SWISS-PROT staff to automatically update data if
    the underlying evidence changes. This is ongoing internally and we hope
    to provide a public version early in 2002. For more information,
    please see
    ftp://ftp.ebi.ac.uk/pub/databases/trembl/evidenceDocumentation.html
    We would welcome any feedback from the user community.

7.2 Conversion of TrEMBL to mixed case:

    Most of the DE (DEscription), GN (Gene Name), RC (Reference Comment)
    and CC (Comment) lines have been converted to mixed case internally. The
    conversion is ongoing and will be made public when ready.

7.3 Version of SWISS-PROT/TrEMBL in XML format:

    A distribution version of SWISS-PROT and TrEMBL in XML format is
    being developed. The first draft of the new XML format, SP-ML
    (SWISS-PROT Markup Language) can be found at
    http://www.ebi.ac.uk/swissprot/SP-ML.
    We would welcome any feedback from the user community.

  

TrEMBL release 19.0

Published December 1, 2001

                              TrEMBL Release Notes
                              Release 19, December 2001

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: Amos.Bairoch@isb-sib.ch
    WWW server: http://www.expasy.org/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Margaret Biswas, Sergio Contrino,
       Daniel Barrell, Kirill Degtyarenko, Wolfgang Fleischmann, Gill Fraser,
       Henning Hermjakob, Kati Laiho, Alexander Kanapin, Youla Karavidopoulou,
       Paul Kersey, Minna Lehvaslaiho, Michele Magrane, Maria Jesus Martin,
       Virginie Mittard, Nicola Mulder, Claire O'Donovan, John F. O'Rourke,
       Eleanor Whitfield and Allyson Williams at the EMBL Outstation -
       European Bioinformatics Institute (EBI) in Hinxton, UK;
    o  Amos Bairoch, Alain Gateau, Isabelle Phan and Sandrine Pilbout at the
       Swiss Institute of Bioinformatics in Geneva, Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2000 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A., and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 2000.
              Nucl. Acids Res. 28:45-48(2000).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Knowledgebase. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.


                             3. The Release

This TrEMBL release was created from the EMBL Nucleotide Sequence Database
release 68 and updates until 16.11.01 and contains 636'825 entries
and 184'332'036 amino acids. To minimize redundancy, the translations of all
coding sequences (CDS) in the EMBL Nucleotide Sequence Database already
included in SWISS-PROT release 40 and updates until 13.12.01 have been
removed from TrEMBL release 19

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (562'222) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea):                        1692 entries
arp.dat (Complete Archaeal proteomes):   19590 entries
fun.dat (Fungi):                         13339 entries
hum.dat (Human):                         29064 entries
inv.dat (Invertebrates):                 60611 entries
mam.dat (Other Mammals):                  9724 entries
mhc.dat (MHC proteins):                   7434 entries
org.dat (Organelles):                    50792 entries
phg.dat (Bacteriophages):                 4368 entries
pln.dat (Plants):                        58841 entries
pro.dat (Prokaryotes):                   69426 entries
prp.dat (Complete Prokaryote Proteomes): 74477 entries
rod.dat (Rodents):                       24185 entries
unc.dat (Unclassified):                    252 entries
vrl.dat (Viruses):                       60309 entries
vrt.dat (Other Vertebrates):             11003 entries
vrv.dat (Retroviruses):                  67115 entries

80'772 new entries have been integrated in SP-TrEMBL. The sequences of
1388 SP-TrEMBL entries have been updated and the annotation has been
updated in 321'110 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (74'603) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in six subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.


                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format       <identifier>
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)


                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

We also recently introduced Varsplic Expand which is a program to generate
"expanded" sequences from SWISS-PROT records i.e. sequences including the
variants specified by the varsplic, variant and conflict annotations. New
records are produced in either pseudo-SWISS-PROT or FASTA format for each
specified variant. More information and the data is available at
ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/


                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta33/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Bic_sw  (http://www.ebi.ac.uk/bic_sw/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)


 7. Description of changes made to TrEMBL since release 18

7.1 Changes concerning cross-references (DR line)

    We have added crossreferences from TrEMBL to a number of new databases.

    7.1.1 MEROPS, the protease database, available at http://www.merops.co.uk

    7.1.2 ANU-2DPAGE, the Australian National University Two-Dimensional
          Polyacrylamide Gel Electrophoresis Database, available at
          http://semele.anu.edu.au/2d/2d.html

    7.1.3 PHCI-2DPAGE, the two-dimensional polyacrylamide gel electrophoresis
          database at Universidad Complutense de Madrid, available at
          http://babbage.csc.ucm.es/2d/2d.html

    7.1.4 PMMA-2DPAGE, the Purkyne Military Medical Academy 2D-PAGE database,
          available at http://www.pmma.pmfhk.cz/2d/2d.html

    7.1.5 Siena-2DPAGE, the 2D-PAGE database from the Department of Molecular
          Biology, University of Siena, Italy, available at
          http://www.bio-mol.unisi.it/2d/2d.html

A list of all crossreferences in TrEMBL to other databases is provided below:
EMBL
PIR
PDB
HSSP
MEROPS
REBASE
TRANSFAC
Aarhus/Ghent-2DPAGE
ANU-2DPAGE
COMPLUYEAST-2DPAGE
PHCI-2DPAGE
PMMA-2DPAGE
Siena-2DPAGE
FlyBase
Leproma
MIM
MGD
MypuList
SGD
SubtiList
TIGR
TubercuList
WormPep
ZFIN
InterPro
Pfam
PRINTS
ProDom
SMART
PROSITE

7.2 New database divisions:

    Three new database divisions were created as previously announced
    to reflect the specialised attention we are giving to complete
    proteomes and the HAMAP project. The new viral division is for
    operational reasons:

    7.2.1 New database division 'arp'
   It will contain archaeal complete proteome entries.
    7.2.2 New database division 'prp'
   It will contain bacterial complete proteome entries.
    7.2.3 New database division 'vrv'
   It will contain retrovirus entries.


 8. Planned changes


8.1 Evidence tags:

    We are continuing with the introduction of evidence tags to SWISS-PROT and
    TrEMBL entries. The aim of this is to allow users to see where data items
    came from and to enable SWISS-PROT staff to automatically update data if
    the underlying evidence changes. This is ongoing internally and we hope
    to provide a public version early in 2002. For more information,
    please see
    ftp://ftp.ebi.ac.uk/pub/databases/trembl/evidenceDocumentation.html
    We would welcome any feedback from the user community.

8.2 Conversion of TrEMBL to mixed case:

    Most of the DE (DEscription), GN (Gene Name) and RC (Reference Comment)
    lines is being converted to mixed case internally. The conversion is
    ongoing and will be made public when ready.

8.3 Version of SWISS-PROT/TrEMBL in XML format:

    A distribution version of SWISS-PROT and TrEMBL in XML format is
    being developed. The specifications of this new format will be described
    when it will be first implemented in TrEMBL.

8.4 Multiple RP lines:

    In the workreleases from release 19 onwards, there can be more than one
    RP (Reference Position) line per reference in a TrEMBL entry. For more
    information on this development, please read SWISS-PROT Protein
    Knowledgebase Release Notes 40 at http://www.expasy.org/sprot/relnotes
  

TrEMBL release 18.0

Published October 2, 2001

                              TrEMBL Release Notes
                              Release 18, October 2001

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: Amos.Bairoch@isb-sib.ch
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Margaret Biswas, Sergio Contrino,
       Kirill Degtyarenko, Wolfgang Fleischmann, Gill Fraser,
       Henning Hermjakob, Alexander Kanapin, Youla Karavidopoulou,
       Paul Kersey, Minna Lehvaslaiho, Michele Magrane,
       Maria Jesus Martin, Nicoletta Mitaritonna, Virginie Mittard,
       Steffen Moeller, Nicola Mulder, Claire O'Donovan, John F. O'Rourke,
       Isabelle Phan, Sandrine Pilbout, Eleanor Whitfield and Allyson Williams
       at the EMBL Outstation - European Bioinformatics Institute (EBI)
       in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2000 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:

              Bairoch A., and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 2000.
              Nucl. Acids Res. 28:45-48(2000).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Knowledgebase. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.


                             3. The Release

The goal of this TrEMBL release is to achieve synchronization with the
SWISS-PROT Protein Knowledgebase release 40.0. Therefore all sequence
entries present in SWISS-PROT release 40.0 have been removed from TrEMBL
release 18. In addition, there was further upgrading of existing TrEMBL
entries and some new entries were incorporated. It contains 558'150 entries
and 160'420'778 amino acids.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (484'551) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea):             18384 entries
fun.dat (Fungi):               12481 entries
hum.dat (Human):               22925 entries
inv.dat (Invertebrates):       54665 entries
mam.dat (Other Mammals):        8280 entries
mhc.dat (MHC proteins):         6813 entries
org.dat (Organelles):          41585 entries
phg.dat (Bacteriophages):       3892 entries
pln.dat (Plants):              50806 entries
pro.dat (Prokaryotes):        125274 entries
rod.dat (Rodents):             21335 entries
unc.dat (Unclassified):          135 entries
vrl.dat (Viruses):            108309 entries
vrt.dat (Other Vertebrates):    9667 entries

17'914 new entries have been integrated in SP-TrEMBL. The sequences of
1634 SP-TrEMBL entries have been updated and the annotation has been
updated in 120'529 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (73'599) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in six subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.


                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format       <identifier>
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)

                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

We also recently introduced Varsplic Expand which is a program to generate
"expanded" sequences from SWISS-PROT records i.e. sequences including the
variants specified by the varsplic, variant and conflict annotations. New
records are produced in either pseudo-SWISS-PROT or FASTA format for each
specified variant. More information and the data is available at
ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta33/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Bic_sw  (http://www.ebi.ac.uk/bic_sw/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)


  7. Planned Changes

7.1 Evidence tags:

    We are continuing with the introduction of evidence tags to SWISS-PROT and
    TrEMBL entries. The aim of this is to allow users to see where data items
    came from and to enable SWISS-PROT staff to automatically update data if
    the underlying evidence changes. This is ongoing internally and we hope
    to provide a public version early in 2002. For more information,
    please see
    ftp://ftp.ebi.ac.uk/pub/databases/trembl/evidenceDocumentation.html
    We would welcome any feedback from the user community.

7.2 New database divisions:

    Three new database divisions will be created starting from Release 19
    to reflect the specialised attention we are giving to complete proteomes
    as a whole and as part of the HAMAP project:
    7.2.1 New database division 'arp'
   It will contain archaeal complete proteome entries.
    7.2.2 New database division 'prp'
   It will contain bacterial complete proteome entries.
    7.2.3 New database division 'vrv'
   It will contain retrovirus entries.

7.3 Conversion of TrEMBL to mixed case:

    Most of the DE (DEscription), GN (Gene Name) and RC (Reference Comment)
    lines will be converted to mixed case. The conversion will be ongoing.

7.4 Changes concerning cross-references (DR line):

    We have added cross-references from TrEMBL to new databases:

    7.4.1 The Mycobacterium leprae genome database Leproma, which is
          available at http://genolist.pasteur.fr/Leproma/.

    7.4.2 The Mycoplasma pulmonis genome database MypuList, available at
          http://genolist.pasteur.fr/MypuList/.

7.5 Version of SWISS-PROT/TrEMBL in XML format:

    A distribution version of SWISS-PROT and TrEMBL in XML format is
    being developed. The specifications of this new format will be described
    when it will be first implemented in TrEMBL.

7.6 Multiple RP lines:

    From release 19, there can be more than one RP (Reference Position)
    line per reference in a TrEMBL entry. For more information on this
    development, please read SWISS-PROT Protein Knowledgebase Release Notes
    40 at http://www.expasy.org/sprot/relnotes
  

TrEMBL release 17.0

Published June 1, 2001

                              TrEMBL Release Notes
                              Release 17, June 2001

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland


    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: Amos.Bairoch@isb-sib.ch
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Margaret Biswas, Sergio Contrino,
       Kirill Degtyarenko, Wolfgang Fleischmann, Gill Fraser,
       Henning Hermjakob, Vivien Junker, Alexander Kanapin, Youla
       Karavidopoulou, Paul Kersey, Minna Lehvaslaiho, Michele Magrane,
       Maria Jesus Martin, Nicoletta Mitaritonna, Virginie Mittard,
       Steffen Moeller, Nicola Mulder, Claire O'Donovan, John F. O'Rourke,
       Isabelle Phan, Sandrine Pilbout, Eleanor Whitfield and Allyson Williams
       at the EMBL Outstation - European Bioinformatics Institute (EBI)
       in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2000 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:

              Bairoch A, and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 2000.
              Nucl. Acids Res. 28:45-48(2000).


                         1. Introduction

TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.


                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.


                             3. The Release

This TrEMBL release was created from the EMBL Nucleotide Sequence Database
release 66 and updates up to 01.05.01 and contains 540'195 sequence entries,
comprising 155'771'315 amino acids. To minimize redundancy, the translations
of all coding sequences (CDS) in the EMBL Nucleotide Sequence Database already
included in SWISS-PROT release 39.21.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (473'505) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea):             14653 entries
fun.dat (Fungi):               12773 entries
hum.dat (Human):               24037 entries
inv.dat (Invertebrates):       56712 entries
mam.dat (Other Mammals):        8380 entries
mhc.dat (MHC proteins):         6821 entries
org.dat (Organelles):          41900 entries
phg.dat (Bacteriophages):       3895 entries
pln.dat (Plants):              50780 entries
pro.dat (Prokaryotes):        113140 entries
rod.dat (Rodents):             12312 entries
unc.dat (Unclassified):          135 entries
vrl.dat (Viruses):            108495 entries
vrt.dat (Other Vertebrates):    9812 entries

54'960 new entries have been integrated in SP-TrEMBL. The sequences of
609 SP-TrEMBL entries have been updated and the annotation has been
updated in 299'529 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (66'690) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in six subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.


                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.

The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)


                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

We also recently introduced Varsplic Expand which is a program to generate
"expanded" sequences from SWISS-PROT records i.e. sequences including the
variants specified by the varsplic, variant and conflict annotations. New
records are produced in either pseudo-SWISS-PROT or FASTA format for each
specified variant. More information and the data is available at
ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/


                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta3/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Bic_sw  (http://www.ebi.ac.uk/bic_sw/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)


                7. Planned Changes

7.1 We are introducing evidence tags to SWISS-PROT and TrEMBL entries.
    The aim of this is to allow users to see where data items came from
    and to enable SWISS-PROT staff to automatically update data if the
    underlying evidence changes. This is currently ongoing internally
    and we hope to provide a public version by the end of 2001. For more
    information, please see
    ftp://ftp.ebi.ac.uk/pub/databases/trembl/evidenceDocumentation.html
    We would welcome any feedback from the user community.
  

TrEMBL release 16.0

Published March 1, 2001

                              TrEMBL Release Notes
                              Release 16, March 2001

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: Amos.Bairoch@isb-sib.ch
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Margaret Biswas, Sergio Contrino,
       Kirill Degtyarenko, Wolfgang Fleischmann, Gill Fraser, Cathy Gedman,
       Henning Hermjakob, Vivien Junker, Alexander Kanapin, Youla
       Karavidopoulou, Paul Kersey, Minna Lehvaslaiho, Michele Magrane,
       Maria Jesus Martin, Nicoletta Mitaritonna, Virginie Mittard,
       Steffen Moeller, Nicola Mulder, Claire O'Donovan, John F. O'Rourke,
       Isabelle Phan, Sandrine Pilbout, Lucia Rodriguez-Monge,
       Eleanor Whitfield and Allyson Williams
       at the EMBL Outstation - European Bioinformatics Institute (EBI)
       in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2000 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A, and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 2000.
              Nucl. Acids Res. 28:45-48(2000).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.


                             3. The Release

This TrEMBL release was created from the EMBL Nucleotide Sequence Database
release 65 and updates up to 22.01.01 and contains 489'620 sequence entries,
comprising 141'347'364 amino acids. To minimise redundancy, the translations of
all coding sequences (CDS) in the EMBL Nucleotide Sequence Database already
included in SWISS-PROT release 40 and updates up to 21.2.2001 have been
removed from TrEMBl release 16.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (425'026) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organised in subsections:

arc.dat (Archaea):             15191 entries
fun.dat (Fungi):               11819 entries
hum.dat (Human):               21314 entries
inv.dat (Invertebrates):       54506 entries
mam.dat (Other Mammals):        7281 entries
mhc.dat (MHC proteins):         6568 entries
org.dat (Organelles):          38007 entries
phg.dat (Bacteriophages):       3301 entries
pln.dat (Plants):              46050 entries
pro.dat (Prokaryotes):         98330 entries
rod.dat (Rodents):             12312 entries
unc.dat (Unclassified):           54 entries
vrl.dat (Viruses):            101091 entries
vrt.dat (Other Vertebrates):    9202 entries

59'565 new entries have been integrated in SP-TrEMBL. The sequences of
1263 SP-TrEMBL entries have been updated and the annotation has been
updated in 198'646 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (64'594) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organised in six subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialised
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialised database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.

                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)

                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

We also recently introduced Varsplic Expand which is a program to generate
"expanded" sequences from SWISS-PROT records i.e. sequences including the
variants specified by the varsplic, variant and conflict annotations. New
records are produced in either pseudo-SWISS-PROT or FASTA format for each
specified variant. More information and the data is available at
ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta3/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Bic_sw  (http://www.ebi.ac.uk/bic_sw/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)


                7. Description of changes made to TrEMBL since release 15.

7.1  Changes concerning cross-references (DR line)

We have added cross-references from TrEMBL to the SMART database
available at http://smart.embl-heidelberg.de and converted the database name to
mixed case as appropriate.

A list of all crossreferences in TrEMBL to other databases is provided below:
Aarhus/Ghent-2DPAGE;
EMBL
FlyBase
GCRDb
HSSP
Mendel
MGD
MIM
PDB
InterPro
Pfam
PIR
PRINTS
ProDom
PROSITE
SMART
SGD
SubtiList
TIGR
TRANSFAC
WormPep
ZFIN

7.2 Please note that the TrEMBL files are distributed from the EBI's ftp
    server as gzip files (*.gz) as of this release instead of compress (*.Z)
    files.

                8. Planned Changes

8.1 We are introducing evidence tags to SWISS-PROT and TrEMBL entries.
    The aim of this is to allow users to see where data items came from
    and to enable SWISS-PROT staff to automatically update data if the
    underlying evidence changes. This is currently ongoing internally
    and we hope to provide a public version by the end of 2001. For more
    information, please see
    ftp://ftp.ebi.ac.uk/pub/databases/trembl/evidenceDocumentation.html
    We would welcome any feedback from the user community.


  

TrEMBL release 15.0

Published October 30, 2000

                              TrEMBL Release Notes
                              Release 15, October 2000

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Margaret Biswas, Sergio Contrino,
       Kirill Degtyarenko, Wolfgang Fleischmann, Gill Fraser, Cathy Gedman,
       Henning Hermjakob, Vivien Junker, Alexander Kanapin, Youla
       Karavidopoulou, Paul Kersey, Fiona Lang, Minna Lehvaslaiho,
       Michele Magrane, Maria Jesus Martin, Nicoletta Mitaritonna, Virginie
       Mittard, Steffen Moeller, Nicola Mulder, Claire O'Donovan, John F.
       O'Rourke, Isabelle Phan, Sandrine Pilbout, Lucia Rodriguez-Monge,
       Eleanor Whitfield and Allyson Williams
       at the EMBL Outstation - European Bioinformatics Institute (EBI)
       in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2000 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A, and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 2000.
              Nucl. Acids Res. 28:45-48(2000).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.


                             3. The Release

This TrEMBL release was created from the EMBL Nucleotide Sequence Database
release 64 and contains 431'424 sequence entries, comprising 124'294'926 amino
acids. To minimize redundancy, the translations of all coding sequences (CDS)
in the EMBL Nucleotide Sequence Database already included in SWISS-PROT release
39 and updates up to 01.9.2000 have been removed from TrEMBl release 15.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (374'700) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea):             11801 entries
fun.dat (Fungi):               10994 entries
hum.dat (Human):               18327 entries
inv.dat (Invertebrates):       51398 entries
mam.dat (Other Mammals):        6332 entries
mhc.dat (MHC proteins):         6111 entries
org.dat (Organelles):          32082 entries
phg.dat (Bacteriophages):       3016 entries
pln.dat (Plants):              40283 entries
pro.dat (Prokaryotes):         88169 entries
rod.dat (Rodents):             11849 entries
unc.dat (Unclassified):           51 entries
vrl.dat (Viruses):             86365 entries
vrt.dat (Other Vertebrates):    7922 entries

79'958 new entries have been integrated in SP-TrEMBL. The sequences of
621 SP-TrEMBL entries have been updated and the annotation has been
updated in 70'559 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (56'724) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in six subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.

                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format       <identifier>
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)

                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

We also recently introduced Varsplic Expand which is a program to generate
"expanded" sequences from SWISS-PROT records i.e. sequences including the
variants specified by the varsplic, variant and conflict annotations. New
records are produced in either pseduo-SWISS-PROT or FASTA format for each
specified variant. More information and the data is available at
ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta3/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Bic_sw  (http://www.ebi.ac.uk/bic_sw/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)


  7. Description of changes made to TrEMBL since release 14.

7.1  Changes concerning cross-references (DR line)

We have added cross-references from TrEMBL to the REBASE database
available at ftp://ftp.neb.com/pub/rebase and ftp://ftp.ebi.ac.uk/pub/databases/rebase

A list of all crossreferences in TrEMBL to other databases is provided below:
AARHUS/GHENT-2DPAGE;
EMBL
FLYBASE
GCRDB
HSSP
MENDEL
MGD
MIM
PDB
INTERPRO
PFAM
PIR
PRINTS
PRODOM
PROSITE
SGD
SUBTILIST
TIGR
TRANSFAC
WORMPEP
ZFIN

7.2 Introduction of a new line type and code (OX line)

The OX (Organism taxonomy cross-reference) line contains
cross-references to taxonomy resource(s) for the organism(s)
listed in the OS line(s) of the entry.
The format of the OX line is:

OX   TAXONOMY_RESOURCE_NAME=IDENTIFIER_1[, IDENTIFIER_2,...
IDENTIFIER_N];

Where the currently defined taxonomic resources are:

Name:     NCBI_TaxID
Resource: National Center for Biotechnology Information Taxonomy Browser

- The identifiers are listed in the order of the corresponding organisms
in the OS line(s).
- In the rare case where all the information will not fit on a single
line more than one OX line is used.

Examples:
OX   NCBI_TaxID=256;
OX   NCBI_TaxID=126566, 38, 846, 23412;

The OX line(s) occurs after the OC lines in the entry line order.


7.3 Modification of the RX line type

We extended the scope of the RX line to include PubMed
as well as Medline. We also changed the syntax of this
line.
An example is given below:

RX  MEDLINE=99382246; PubMed=10448073;

  8. Planned Changes

8.1 Please note that the TrEMBL files will be distributed from the EBI's ftp
    server as gzip files (*.gz) as of the next release (Rel.16) instead of
    compress (*.Z) files. Please contact us if you have any queries
    regarding this.




  

TrEMBL release 14.0

Published June 23, 2000

                              TrEMBL Release Notes
                              Release 14, June 2000

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Amos Bairoch
    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Margaret Biswas, Sergio Contrino,
       Kirill Degtyarenko, Wolfgang Fleischmann, Gill Fraser, Cathy Gedman,
       Henning Hermjakob, Vivien Junker, Youla Karavidopoulou, Paul Kersey,
       Fiona Lang, Minna Lehvaslaiho, Michele Magrane, Maria Jesus Martin,
       Steffen Moeller, Nicoletta Mitaritonna, Virginie Mittard, Nicola Mulder,
       Claire O'Donovan, John F. O'Rourke, Isabelle Phan, Sandrine Pilbout,
       Lucia Rodriguez-Monge, Eleanor Whitfield and  Allyson Williams
       at the EMBL Outstation - European Bioinformatics Institute (EBI)
       in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2000 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A, and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 2000.
              Nucl. Acids Res. 28:45-48(2000).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.


                             3. The Release

The goal of this TrEMBL release is to achieve synchronization with SWISS-PROT
release 39.0. Therefore, all sequence entries present in SWISS-PROT release
39.0 have been removed from TrEMBL release 13, further upgrading of existing
TrEMBL entries was achieved and only a very few new entries were
incorporated.

TrEMBL release 14 contains 351'834 sequence entries, comprising 100'069'442
amino acids.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (297'973) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organised in subsections:

arc.dat (Archaea):             11776 entries
fun.dat (Fungi):                8877 entries
hum.dat (Human):               13757 entries
inv.dat (Invertebrates):       44786 entries
mam.dat (Other Mammals):        4977 entries
mhc.dat (MHC proteins):         5264 entries
org.dat (Organelles):          25730 entries
phg.dat (Bacteriophages):       2666 entries
pln.dat (Plants):              28878 entries
pro.dat (Prokaryotes):         62006 entries
rod.dat (Rodents):             10226 entries
unc.dat (Unclassified):           31 entries
vrl.dat (Viruses):             72237 entries
vrt.dat (Other Vertebrates):    6762 entries

112 new entries have been integrated in SP-TrEMBL. The sequences of
40 SP-TrEMBL entries have been updated and the annotation has been
updated in 105'056 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (53'861) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organised in six subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialised
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialised database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.

                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)

                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta3/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Bic_sw  (http://www.ebi.ac.uk/bic_sw/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)

                7. Description of changes made to TrEMBL since release 13.

7.1  Changes concerning cross-references (DR line)

We have added cross-references from TrEMBL to the PRODOM database
available at ftp://ftp.toulouse.inra.fr/pub/prodom


A list of all crossreferences in TrEMBL to other databases is provided below:
AARHUS/GHENT-2DPAGE;
EMBL
FLYBASE
GCRDB
HSSP
MENDEL
MGD
MIM
PDB
INTERPRO
PFAM
PIR
PRINTS
PRODOM
PROSITE
SGD
SUBTILIST
TIGR
TRANSFAC
WORMPEP
ZFIN

                8. Planned changes

8.1 Introduction of a new line type and code (OX line)

The OX (Organism taxonomy cross-reference) line contains
cross-references to taxonomy resource(s) for the organism(s)
listed in the OS line(s) of the entry.
The format of the OX line is:

OX   TAXONOMY_RESOURCE_NAME=IDENTIFIER_1[, IDENTIFIER_2,...
IDENTIFIER_N];

Where the currently defined taxonomic resources are:

Name:     NCBI_TaxID
Resource: National Center for Biotechnology Information Taxonomy Browser

- The identifiers are listed in the order of the corresponding organisms
in the OS line(s).
- In the rare case where all the information will not fit on a single
line more than one OX line is used.

Examples:
OX   NCBI_TaxID=256;
OX   NCBI_TaxID=126566, 38, 846, 23412;

The OX line(s) will be after the OC lines in the entry line order.


8.2 Modification of the RX line type

We are extending the scope of the RX line to include PubMed
as well as Medline. We are also changing the syntax of this
line.
An example is given below:

RX  MEDLINE=99382246; PubMed=10448073;

8.3 Conversion of TrEMBL to mixed cases.

We are beginning the conversion of the RC lines to mixed cases.
Please see the SWISS-PROT rel. 39 relnotes for further details.


  

TrEMBL release 13.0

Published May 2, 2000

                              TrEMBL Release Notes
                              Release 13, May 2000

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Amos Bairoch
    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 702 58 60
    Fax: (+41 22) 702 55 02
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Margaret Biswas, Sergio Contrino,
       Kirill Degtyarenko, Wolfgang Fleischmann, Gill Fraser, Cathy Gedman,
       Henning Hermjakob, Vivien Junker, Youla Karavidopoulou, Paul Kersey,
       Fiona Lang, Minna Lehvaslaiho, Michele Magrane, Maria Jesus Martin,
       Steffen Moeller, Nicoletta Mitaritonna, Virginie Mittard, Nicola Mulder,
       Claire O'Donovan, John F. O'Rourke, Isabelle Phan, Sandrine Pilbout,
       Lucia Rodriguez-Monge, Eleanor Whitfield and  Allyson Williams
       at the EMBL Outstation - European Bioinformatics Institute (EBI)
       in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 2000 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A, and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 2000.
              Nucl. Acids Res. 28:45-48(2000).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.


                             3. The Release

This TrEMBL release was created from the EMBL Nucleotide Sequence Database
release 62 and contains 353'156 sequence entries, comprising 100'750'187 amino
acids. To minimize redundancy, the translations of all coding sequences (CDS)
in the EMBL Nucleotide Sequence Database already included in SWISS-PROT release
38 and updates up to 26.4.2000 have been removed from TrEMBl release 13.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (300'192) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea):             11814 entries
fun.dat (Fungi):                8916 entries
hum.dat (Human):               13979 entries
inv.dat (Invertebrates):       45133 entries
mam.dat (Other Mammals):        5037 entries
mhc.dat (MHC proteins):         5265 entries
org.dat (Organelles):          25755 entries
phg.dat (Bacteriophages):       2667 entries
pln.dat (Plants):              28814 entries
pro.dat (Prokaryotes):         62455 entries
rod.dat (Rodents):             10377 entries
unc.dat (Unclassified):           31 entries
vrl.dat (Viruses):             72264 entries
vrt.dat (Other Vertebrates):    6788 entries

65'315 new entries have been integrated in SP-TrEMBL. The sequences of
1'204 SP-TrEMBL entries have been updated and the annotation has been
updated in 161'387 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (53'861) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in six subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.

                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format       <identifier>
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)

                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www.ebi.ac.uk/fasta3/)
BLAST2  (http://www.ebi.ac.uk/blast2/)
Bic_sw  (http://www.ebi.ac.uk/bic_sw/)
Scanps  (http://www.ebi.ac.uk/scanps/)
MPSrch  (http://www.ebi.ac.uk/MPsrch/)

  7. Description of changes made to TrEMBL since release 12.

7.1 Conversion of TrEMBL to mixed cases.

In this release we have converted the RA lines to mixed cases.

7.2 Change in the syntax of the SQ line.

The last item in the SQ line, the 32-bit CRC value, has been replaced by a
64-bit CRC. The format of the SQ line is:

SQ   SEQUENCE  XXXX AA; XXXXXX MW;  XXXXXXXXXXXXXXXX CRC64;

Example:

SQ   SEQUENCE   308 AA;  35309 MW;  B26AE1BBDA64E683 CRC64;

7.3 Multiple AC lines

In the ongoing process of updating and annotating TrEMBL entries, we find
entries which can be merged. We keep all the original accession numbers
in the resultinh merged entry. The repetition of such a process sometimes
produces an accession number list which can no longer fit in a single AC
line. Therefore there will now be some entries with two, three or more AC
lines (as shown below).

AC   Q00626; Q57339; O08022; O08102; O09490; O09483; O09393; O09396;
AC   O09397; O09398; O09399; O09400; O09401; O09402; O09403; O09404;
AC   O09405; O09406; O09407; O09408; O09481; O09482;


7.4  Changes concerning cross-references (DR line)

We have added cross-references from TrEMBL to
the INTERPRO (Integrated Resource of Protein Domains and Funcional Sites)
database available at ftp.ebi.ac.uk/pub/databases/interpro/


A list of all crossreferences in TrEMBL to other databases is provided below:
AARHUS/GHENT-2DPAGE;
EMBL
FLYBASE
GCRDB
HSSP
MENDEL
MGD
MIM
PDB
INTERPRO
PFAM
PIR
PRINTS
PROSITE
SGD
SUBTILIST
TIGR
TRANSFAC
WORMPEP
ZFIN
  

TrEMBL release 12.0

Published November 1, 1999

                              TrEMBL Release Notes
                              Release 12, November 1999

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Amos Bairoch
    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 784 40 82
    Fax: (+41 22) 702 55 02
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Margaret Biswas, Sergio Contrino,
       Wolfgang Fleischmann, Gill Fraser, Cathy Gedman, Henning Hermjakob,
       Vivien Junker, Youla Karavidopoulou, Fiona Lang, Minna Lehvaslaiho,
       Michele Magrane, Maria Jesus Martin, Steffen Moeller, Nicoletta
       Mitaritonna, Virginie Mittard, Nicola Mulder, Claire O'Donovan,
       Isabelle Phan, Sandrine Pilbout, Lucia Rodriguez-Monge and Eleanor
       Whitfield at the EMBL Outstation - European Bioinformatics Institute
       (EBI) in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Copyright Notice
    TrEMBL copyright (c) 1999 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A, and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 1999.
              Nucleic Acids Res. 27:49-54(1999).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.


                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.


                             3. The Release

This TrEMBL release was created from the EMBL Nucleotide Sequence Database
release 60 and contains 276'472 sequence entries, comprising 75'524'740 amino
acids. To minimize redundancy, the translations of all coding sequences (CDS)
in the EMBL Nucleotide Sequence Database already included in SWISS-PROT release
38 and updates up to 18.10.1999 have been removed from TrEMBl release 12.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (225'878) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archaea):             10017 entries
fun.dat (Fungi):                7130 entries
hum.dat (Human):                9417 entries
inv.dat (Invertebrates):       26516 entries
mam.dat (Other Mammals):        3498 entries
mhc.dat (MHC proteins):         4563 entries
org.dat (Organelles):          18368 entries
phg.dat (Bacteriophages):       2201 entries
pln.dat (Plants):              18993 entries
pro.dat (Prokaryotes):         51151 entries
rod.dat (Rodents):              7992 entries
unc.dat (Unclassified):          129 entries
vrl.dat (Viruses):             61006 entries
vrt.dat (Other Vertebrates):    4897 entries

28'141 new entries have been integrated in SP-TrEMBL. The sequences of
522 SP-TrEMBL entries have been updated and the annotation has been updated in
56'201 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (50'594) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in six subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      This subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   6) Truncated proteins (Truncated.dat)
      The last subsection consists of truncated proteins which result from
      events like mutations introducing a stop codon leading to the truncation
      of the protein product.


                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the stable part of the protein_id
tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database.
'protein_id' stands for the "Protein Identification" number. It is a number
that you will find in the feature table of the EMBL nucleotide sequence
entries in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"

The protein_id is defined as follows in the The DDBJ/EMBL/GenBank Feature Table
Definition documentation
Qualifier          /protein_id
Definition         Protein Identifier, issued by International collaborators.
                   This qualifier consists of a stable ID portion (3+5 format
                   with 3 position letters and 5 numbers) plus a version
                   number after the decimal point.

Value format       <identifier>
Example            /protein_id="AAA12345.1"
Comment            When the protein sequence encoded by the CDS changes, only
                   the version number of the /protein_id value is incremented.
                   The stable part of the /protein_id remains unchanged and
                   as a result will permanently be associated with a given
                   protein. This qualifier is valid only on CDS features
                   which translate into a valid protein.


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)


                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:

(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.


                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:
FASTA3  (http://www2.ebi.ac.uk/fasta3/)
BLAST2  (http://www2.ebi.ac.uk/blast2/)
Bic_sw  (http://www2.ebi.ac.uk/bic_sw/)
Scanps  (http://www2.ebi.ac.uk/scanps/)
SSearch (http://www2.ebi.ac.uk/ssearch3/)


  7. Description of changes made to TrEMBL since release 11.


7.1  Addition of new subsection of REM-TrEMBL.

In this release we have created a new subsection called truncated.dat.

7.2  Changes concerning cross-references (DR line)

We have added cross-references from TrEMBL to:

a) the PRINTS database available at ftp.seqnet.dl.ac.uk (see: Attwood, T.K.,
Beck, M.E., Bleasby, A.J. and Parry-Smith, D.J. (1994) PRINTS - A database of
protein motif fingerprints. Nucleic Acids Res. 24:182-188(1994)).
b) the HSSP: Homology derived Secondary Structure of Proteins database
available at http://www.sander.ebi.ac.uk/hssp/ (see: Sander C. & Schneider R.
(1991) Database of homology-derived protein structures. Proteins, Structure,
Function & Genetics, 9:56-68).

A list of all crossreferences in TrEMBL to other databases is provided below:
AARHUS/GHENT-2DPAGE;
EMBL
FLYBASE
GCRDB
HSSP
MENDEL
MGD
MIM
PDB
PFAM
PIR
PRINTS
PROSITE
SGD
SUBTILIST
TIGR
TRANSFAC
WORMPEP
ZFIN
  

TrEMBL release 11.0

Published July 30, 1999

                              TrEMBL Release Notes
                              Release 11, August 1999

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Amos Bairoch
    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 784 40 82
    Fax: (+41 22) 702 55 02
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Margaret Biswas, Sergio Contrino,
       Wolfgang Fleischmann, Gill Fraser, Henning Hermjakob, Vivien Junker,
       Youla Karavidopoulou, Fiona Lang,  Minna Lehvaslaiho, Michele Magrane,
       Maria Jesus Martin, Steffen Moeller, Nicoletta Mitaritonna, Nicola Mulder,
       Claire O'Donovan, Lucia Rodriguez-Monge and Eleanor Whitfield
       at the EMBL Outstation - European Bioinformatics Institute (EBI) in
       Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Notes

    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A, and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 1999.
              Nucleic Acids Res. 27:49-54(1999).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.


                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.

We name this supplement TrEMBL (Translation from EMBL), since the tools
used to create the translations of the CDS are based on the program
'trembl' written by Thure Etzold at the EMBL.


                             3. The Release

The goal of this TrEMBL release is to achieve synchronization with SWISS-PROT
release 38.0. Therefore, all sequence entries present in SWISS-PROT release 38.0
have been removed from TrEMBL release 11, further upgrading of existing TrEMBL
entries was achieved and only a very few new entries were incorporated.

TrEMBL release 11 contains 245'761 sequence entries, comprising 56'545'670 amino
acids.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (199'794) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archea):               7383 entries
fun.dat (Fungi):                6656 entries
hum.dat (Human):                7880 entries
inv.dat (Invertebrates):       23594 entries
mam.dat (Other Mammals):        3094 entries
mhc.dat (MHC proteins):         4210 entries
org.dat (Organelles):          16227 entries
phg.dat (Bacteriophages):       1963 entries
pln.dat (Plants):              17250 entries
pro.dat (Prokaryotes):         45908 entries
rod.dat (Rodents):              7348 entries
unc.dat (Unclassified):           44 entries
vrl.dat (Viruses):             53911 entries
vrt.dat (Other Vertebrates):    4326 entries

29 new entries have been integrated in SP-TrEMBL. The sequences of
297 SP-TrEMBL entries have been updated and the annotation has been updated in
1'330 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (45'967) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in five subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      The last subsection consists of  CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.


                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the protein_id tagged to the
corresponding CDS in the EMBL Nucleotide Sequence Database. 'protein_id'
stands for the "Protein IDentification" number. It is a number that you
will find in the feature table of the EMBL nucleotide sequence entries
in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL refer to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)


                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:
FASTA3  (http://www2.ebi.ac.uk/fasta3/)
BLAST2  (http://www2.ebi.ac.uk/blast2/)
Bic_sw  (http://www2.ebi.ac.uk/bic_sw/)
Scanps  (http://www2.ebi.ac.uk/scanps/)
SSearch (http://www2.ebi.ac.uk/ssearch3/)

  7. Description of changes made to TrEMBL since release 10.


7.1  Conversion of TrEMBL to mixed cases.

In this release we have converted to mixed cases the RL lines.

This conversion is described in detail in section 3.1 of the release notes
of SWISS-PROT release 37.0.





  

TrEMBL release 10.0

Published June 1, 1999

                              TrEMBL Release Notes
                              Release 10, May 1999

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Amos Bairoch
    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 784 40 82
    Fax: (+41 22) 702 55 02
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Margaret Biswas, Sergio Contrino,
       Wolfgang Fleischmann, Gill Fraser, Henning Hermjakob, Vivien Junker,
       Youla Karavidopoulou, Fiona Lang,  Minna Lehvaslaiho, Michele Magrane,
       Maria Jesus Martin, Steffen Moeller, Nicoletta Mitaritonna, Nicola Mulder,
       Claire O'Donovan and Eleanor Whitfield
       at the EMBL Outstation - European Bioinformatics Institute (EBI) in
       Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Notes

    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A, and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 1999.
              Nucleic Acids Res. 27:49-54(1999).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.

We name this supplement TrEMBL (Translation from EMBL), since the tools
used to create the translations of the CDS are based on the program
'trembl' written by Thure Etzold at the EMBL.

                             3. The Release

This TrEMBL release was created from the EMBL Nucleotide Sequence Database
release 58 and contains 244'862 sequence entries, comprising 66'562'800 amino
acids. To minimize redundancy, the translations of all coding sequences (CDS)
in the EMBL Nucleotide Sequence Database already included in SWISS-PROT 37
have been removed from TrEMBL release 10.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:

SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (201'082) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archea):               7408 entries
fun.dat (Fungi):                6679 entries
hum.dat (Human):                8518 entries
inv.dat (Invertebrates):       23653 entries
mam.dat (Other Mammals):        3130 entries
mhc.dat (MHC proteins):         4236 entries
org.dat (Organelles):          16261 entries
phg.dat (Bacteriophages):       1971 entries
pln.dat (Plants):              17352 entries
pro.dat (Prokaryotes):         45992 entries
rod.dat (Rodents):              7480 entries
unc.dat (Unclassified):           44 entries
vrl.dat (Viruses):             53916 entries
vrt.dat (Other Vertebrates):    4442 entries

25166 new entries have been integrated in SP-TrEMBL. The sequences of
894 SP-TrEMBL entries have been updated and the annotation has been updated in
57'510 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were previously present in TrEMBL, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (43'780) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in five subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      The last subsection consists of  CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.


                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the protein_id tagged to the
corresponding CDS in the EMBL Nucleotide Sequence Database. 'protein_id'
stands for the "Protein IDentification" number. It is a number that you
will find in the feature table of the EMBL nucleotide sequence entries
in a qualifier called "/protein_id" which is tagged to every CDS.

Example:

FT   CDS             339..1514
FT                   /codon_start=1
FT                   /db_xref="PID:g1256015"
FT                   /product="dystrobrevin-epsilon"
FT                   /protein_id="AAC50431.1"


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL are referring to the TrEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1996 (TrEMBLrel. 01, Last sequence update)
    DT   01-FEB-1997 (TrEMBLrel. 02, Last annotation update)

                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.

We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:
FASTA3  (http://www2.ebi.ac.uk/fasta3/)
BLAST2  (http://www2.ebi.ac.uk/blast2/)
Bic_sw  (http://www2.ebi.ac.uk/bic_sw/)
Scanps  (http://www2.ebi.ac.uk/scanps/)
SSearch (http://www2.ebi.ac.uk/ssearch3/)

  7. Description of changes made to TrEMBL since release 9.

7.1  Extension of the accession number system.

As all possible numbers with the 'O' series have been used, a new system of
accession numbers has been introduced in TrEMBL release 10. This system is
based on the following format:

    1        2       3          4            5            6
    [O,P,Q]  [0-9]  [A-Z, 0-9]  [A-Z, 0-9]   [A-Z, 0-9]   [0-9]

What the above means is that we will keep a six-character code, but that
in positions 3, 4 and 5 of this code a letter or a number can be present.
We allow only numbers in positions 2 and 6.

Examples: Q9ZXM8, Q9ZNQ2, Q9YGB9, Q9YH55


7.2  New Protein Sequence Identifier in the cross-references to the EMBL
nucleotide sequence database.

The new protein sequences identifiers introduced in the EMBL nucleotide
sequence database release 58 are now present in all cross-references
to the EMBL nucleotide sequence database and replace the previous PIDs.

The new protein sequence identifier consists of a stable ID portion
(3+5 format with 3 position letters and 5 numbers) plus a version number
after a decimal point.

Example:

DR   EMBL; L37685; AAC41668.1; -.


7.3  Conversion of TrEMBL to mixed cases.

In this release we have converted to mixed cases the following line
types:

 DT, OS, OG, OC, RL and KW

This conversion is described in detail in section 3.1 of the release notes
of SWISS-PROT release 37.0.


7.4. RL lines.

The format for RL lines for submissions to DNA databases has changed.

Example: in release 9 the format was:

RL   SUBMITTED (APR-1996) TO EMBL/GENBANK/DDBJ DATA BANKS.

In release 10 it is :

RL   Submitted (APR-1996) to the EMBL/GenBank/DDBJ databases.

  

TrEMBL release 9.0

Published January 1, 1999

                              TrEMBL Release Notes
                              Release 9, January 1999

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Amos Bairoch
    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 784 40 82
    Fax: (+41 22) 702 55 02
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Kirsty Bates, Sergio Contrino, Wolfgang Fleischmann,
       Gill Fraser, Henning Hermjakob, Vivien Junker, Youla Karavidopoulou,
       Fiona Lang, Michele Magrane, Maria Jesus Martin, Steffen Moeller,
       Nicoletta Mitaritonna, Claire O'Donovan and Eleanor Whitfield
       at the EMBL Outstation - European Bioinformatics Institute (EBI) in
       Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Notes

    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:


              Bairoch A, and Apweiler R.
              The SWISS-PROT protein sequence data bank and its supplement
              TrEMBL in 1999.
              Nucleic Acids Res. 27:49-54(1999).


                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.

We name this supplement TrEMBL (Translation from EMBL), since the tools
used to create the translations of the CDS are based on the program
'trembl' written by Thure Etzold at the EMBL.

                             3. The Release

The goal of this TrEMBL release is to achieve synchronization with SWISS-PROT
release 37.0. Therefore, all sequence entries present in SWISS-PROT release 37.0
have been removed from TrEMBL release 9, further upgrading of existing TrEMBL
entries was achieved and only a very few new entries were incorporated.

TrEMBL release 9 contains 221422 sequence entries, comprising 59'461'791 amino
acids.

TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL:

SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (179'066) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archea):               7315 entries
fun.dat (Fungi):                5862 entries
hum.dat (Human):                7594 entries
inv.dat (Invertebrates):       22665 entries
mam.dat (Other Mammals):        2792 entries
mhc.dat (MHC proteins):         3981 entries
org.dat (Organelles):          13996 entries
phg.dat (Bacteriophages):       1736 entries
pln.dat (Plants):              14626 entries
pro.dat (Prokaryotes):         39243 entries
rod.dat (Rodents):              6863 entries
unc.dat (Unclassified):           44 entries
vrl.dat (Viruses):             48436 entries
vrt.dat (Other Vertebrates):    3913 entries

407 new entries have been integrated in SP-TrEMBL. The sequences of
979 SP-TrEMBL entries have been updated and the annotation has been updated in
22'224 entries.

In the document deleteac.txt, you will find a list of all accession numbers
which were present in the TrEMBL data bank, but which have now been deleted from
the database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (42'356) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in five subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      The last subsection consists of  CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.


                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the PID tagged to the
corresponding CDS in the EMBL Nucleotide Sequence Database. 'PID' stands for
the "Protein IDentification" number. It is a number that you will find in
EMBL nucleotide sequence entries in a qualifier called "/db_xref" which is
tagged to every CDS in the nucleotide database. Example:

   FT   CDS            54..1382
   FT                  /note="ribulose-1,5-bisphosphate carboxylase/
   FT                  oxygenase activase precursor"
   FT                  /db_xref="PID:g1006835"

The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL are referring to the TEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (REL. 06, CREATED)
    DT   01-JUL-1989 (REL. 11, LAST SEQUENCE UPDATE)
    DT   01-AUG-1992 (REL. 23, LAST ANNOTATION UPDATE)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TREMBLREL. 01, CREATED)
    DT   01-NOV-1996 (TREMBLREL. 01, LAST SEQUENCE UPDATE)
    DT   01-FEB-1997 (TREMBLREL. 02, LAST ANNOTATION UPDATE)

                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.
We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:
FASTA3  (http://www2.ebi.ac.uk/fasta3/)
BLAST2  (http://www2.ebi.ac.uk/blast2/)
Bic_sw  (http://www2.ebi.ac.uk/bic_sw/)
Scanps  (http://www2.ebi.ac.uk/scanps/)
SSearch (http://www2.ebi.ac.uk/ssearch3/)



                7. Planned changes

7.1  Extension of the accession number system

As explained in detail in the SWISS-PROT release 37.0 release notes section 2.5,
we will modify the accession number format in the TrEMBL release 10.

7.2  Conversion of TrEMBL to mixed-case characters

In synchronization with SWISS-PROT, we will gradually convert TrEMBL
entries from all 'UPPER CASE' to 'MiXeD CaSe'. This conversion is described
in detail in SWISS-PROT release 37.0 release notes section 3.1.

  

TrEMBL release 8.0

Published November 1, 1998

                              TrEMBL Release Notes
                              Release 8, November 1998

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Amos Bairoch
    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 784 40 82
    Fax: (+41 22) 702 55 02
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Sergio Contrino, Wolfgang Fleischmann, Gill Fraser,
       Henning Hermjakob, Vivien Junker, Stephanie Kappus, Youla
       Karavidopoulou, Fiona Lang, Michele Magrane, Maria Jesus Martin,
       Steffen Moeller, Nicoletta Mitaritonna and Claire O'Donovan at
       the EMBL Outstation - European Bioinformatics Institute (EBI) in
       Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Notes

    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:

              Bairoch A., and Apweiler R.
              The SWISS-PROT protein sequence data bank and its
              supplement TrEMBL in 1998.
              Nucleic Acids Res. 26:38-42(1998).

                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.

We name this supplement TrEMBL (Translation from EMBL), since the tools
used to create the translations of the CDS are based on the program
'trembl' written by Thure Etzold at the EMBL.

                             3. The Release

This TrEMBL release is created from the EMBL Nucleotide Sequence Database
release 56 and contains 224'543 sequence entries, comprising 60'188'661 amino
acids. To minimize redundancy, the translations of all coding sequences (CDS)
in the EMBL Nucleotide Sequence Database already included in SWISS-PROT 36
have been removed from TrEMBL release 8.

TrEMBL is split in two main sections; SP-TrEMBL and REM-TrEMBL:

SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (180'763) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archea):               7397 entries
fun.dat (Fungi):                6007 entries
hum.dat (Human):                7688 entries
inv.dat (Invertebrates):       22829 entries
mam.dat (Other Mammals):        2892 entries
mhc.dat (MHC proteins):         3985 entries
org.dat (Organelles):          14230 entries
phg.dat (Bacteriophages):       1824 entries
pln.dat (Plants):              14749 entries
pro.dat (Prokaryotes):         39777 entries
rod.dat (Rodents):              6923 entries
unc.dat (Unclassified):           44 entries
vrl.dat (Viruses):             48472 entries
vrt.dat (Other Vertebrates):    3946 entries

20'082 new entries have been integrated in SP-TrEMBL. 922 sequences of
SP-TrEMBL entries have been updated and in 124'261 cases the annotation
has been updated.
In the document deleteac.txt you will find a list of all accession numbers
which appeared earlier in the TrEMBL data bank, but have been deleted from the
database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (43'780) that we do
not want to include in SWISS-PROT. REM-TrEMBL entries have no accession
numbers. This section is organized in five subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      The last subsection consists of  CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.


                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the PID tagged to the
corresponding CDS in the EMBL Nucleotide Sequence Database. 'PID' stands for
the "Protein IDentification" number. It is a number that you will find in
EMBL nucleotide sequence entries in a qualifier called "/db_xref" which is
tagged to every CDS in the nucleotide database. Example:

   FT   CDS            54..1382
   FT                  /note="ribulose-1,5-bisphosphate carboxylase/
   FT                  oxygenase activase precursor"
   FT                  /db_xref="PID:g1006835"

The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL are referring to the TEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (REL. 06, CREATED)
    DT   01-JUL-1989 (REL. 11, LAST SEQUENCE UPDATE)
    DT   01-AUG-1992 (REL. 23, LAST ANNOTATION UPDATE)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TREMBLREL. 01, CREATED)
    DT   01-NOV-1996 (TREMBLREL. 01, LAST SEQUENCE UPDATE)
    DT   01-FEB-1997 (TREMBLREL. 02, LAST ANNOTATION UPDATE)

                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.
We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:
FASTA3  (http://www2.ebi.ac.uk/fasta3/)
BLAST2  (http://www2.ebi.ac.uk/blast2/)
Bic_sw  (http://www2.ebi.ac.uk/bic_sw/)
Scanps  (http://www2.ebi.ac.uk/scanps/)
SSearch (http://www2.ebi.ac.uk/ssearch3/)


  7. Description of changes made to TrEMBL since release 7.

7.1  Switch to the NCBI taxonomy

To standardize the taxonomies used by different databases we have switched
to the NCBI taxonomy, which is already used by the DDBJ/EMBL/GenBank
nucleotide sequence databases.

7.2  Introduction of RT lines

We have introduced a new line type, the RT (Reference Title) line.
This optional line is placed between the RA and RL line. The RT line gives the
title of the paper (or other work) as accurately as possible given the limitations
of the computer character set. The form used is that which is used in a
citation rather than that displayed at the top of the published paper. For
instance, where journals capitalize major title words this is not preserved.
The title is enclosed in double quotes, and may continue over several
lines if necessary. Note these lines are in mixed case. The title lines are
terminated by a semicolon. An example of the use of RT lines is shown below:

 RT   "Sequence analysis of the genome of the unicellular cyanobacterium
 RT   Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb
 RT   region from map positions 64% to 92% of the genome.";


                8. Planned changes

8.1  Extension of the accession number system

As explained in detail in the SWISS-PROT release notes under 2.3, we will
extend the accession number system when the 'O' series is used up. This can
be anticipated for January 1999.
  

TrEMBL release 7.0

Published August 1, 1998

                              TrEMBL Release Notes
                              Release 7, August 1998

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Amos Bairoch
    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 784 40 82
    Fax: (+41 22) 702 55 02
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://www.expasy.ch/


    Acknowledgements

    TrEMBL has been prepared by:

    o  Rolf Apweiler, Sergio Contrino, Wolfgang Fleischmann, Henning
       Hermjakob, Vivien Junker, Stephanie Kappus, Fiona Lang, Michele
       Magrane, Maria Jesus Martin, Steffen Moeller, Nicoletta
       Mitaritonna and Claire O'Donovan at the EMBL Outstation
       - European Bioinformatics Institute (EBI) in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Swiss Institute of Bioinformatics
       in Geneva, Switzerland.

    Notes

    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this statement is reproduced with each copy.

    Citation

    If you  want to  cite  TrEMBL  in  a  publication  please  use  the
    following reference:

              Bairoch A., and Apweiler R.
              The SWISS-PROT protein sequence data bank and its
              supplement TrEMBL in 1998.
              Nucleic Acids Res. 26:38-42(1998).



                         1. Introduction


TrEMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TrEMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TrEMBL can be considered as a preliminary
section of SWISS-PROT. For all TrEMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.


                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TrEMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TrEMBL.

We name this supplement TrEMBL (Translation from EMBL), since the tools
used to create the translations of the CDS are based on the program
'trembl' written by Thure Etzold at the EMBL.


                             3. The Release

This TrEMBL release is created from the EMBL Nucleotide Sequence Database
release 55 and contains 193'860 sequence entries, comprising 53'601'062 amino
acids. To minimize redundancy, the translations of all coding sequences (CDS)
in the EMBL Nucleotide Sequence Database already included in SWISS-PROT 36
have been removed from TrEMBL release 7.

TrEMBL is split in two main sections; SP-TrEMBL and REM-TrEMBL:

SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (165'420) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TrEMBL entries.

SP-TrEMBL is organized in subsections:

arc.dat (Archea):               7434 entries
fun.dat (Fungi):                5261 entries
hum.dat (Human):                6976 entries
inv.dat (Invertebrates):       21991 entries
mam.dat (Other Mammals):        2684 entries
mhc.dat (MHC proteins):         3601 entries
org.dat (Organelles):          12699 entries
phg.dat (Bacteriophages):       1604 entries
pln.dat (Plants):              12668 entries
pro.dat (Prokaryotes):         35857 entries
rod.dat (Rodents):              6346 entries
unc.dat (Unclassified):           88 entries
vrl.dat (Viruses):             44561 entries
vrt.dat (Other Vertebrates):    3650 entries

16'379 new entries have been integrated in SP-TrEMBL. 580 sequences of
SP-TrEMBL entries have been updated and in 54'681 cases the annotation
has been updated.

In the document deleteac.txt you will find a list of all accession numbers
which appeared in the TrEMBL data bank, but have been deleted from the
database.

REM-TrEMBL (REMaining TrEMBL) contains the entries (28'440) that we do
not want to include in SWISS-PROT.REM-TrEMBL entries have no accession
numbers. This section is organized in five subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TrEMBL entries are  immunoglobulins and  T-cell receptors. We
      stopped entering immunoglobulins and T-cell receptors into SWISS-PROT,
      because we only want to keep  the  germ line gene derived translations
      of these proteins in SWISS-PROT and not all known somatic recombinated
      variations of  these proteins.  We would like to  create a specialized
      database  dealing  with  these  sequences  as a further  supplement to
      SWISS-PROT  and  keep  only a  representative  cross-section of  these
      proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data, which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to  leave these entries in
      TrEMBL. Ideally one should build a specialized database for artificial
      sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of  coding sequences captured from patent
      applications.  A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority  (which in  most cases have already
      been integrated in SWISS-PROT), most of these sequences contain either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection  consists of fragments  with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      The last subsection consists of  CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.


                4. Format Differences Between SWISS-PROT and TrEMBL

The format and conventions used by TrEMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TrEMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TrEMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TrEMBL.
The data class used in TrEMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TrEMBL:

The ID line (IDentification):

The entry name used in SP-TrEMBL is the same as the Accession Number of the
entry. The entry name used in REM-TrEMBL is the PID tagged to the
corresponding CDS in the EMBL Nucleotide Sequence Database. 'PID' stands for
the "Protein IDentification" number. It is a number that you will find in
EMBL nucleotide sequence entries in a qualifier called "/db_xref" which is
tagged to every CDS in the nucleotide database. Example:

   FT   CDS            54..1382
   FT                  /note="ribulose-1,5-bisphosphate carboxylase/
   FT                  oxygenase activase precursor"
   FT                  /db_xref="PID:g1006835"

The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TrEMBL are referring to the TEMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (REL. 06, CREATED)
    DT   01-JUL-1989 (REL. 11, LAST SEQUENCE UPDATE)
    DT   01-AUG-1992 (REL. 23, LAST ANNOTATION UPDATE)

    DT lines in a TrEMBL entry:

    DT   01-NOV-1996 (TREMBLREL. 01, CREATED)
    DT   01-NOV-1996 (TREMBLREL. 01, LAST SEQUENCE UPDATE)
    DT   01-FEB-1997 (TREMBLREL. 02, LAST ANNOTATION UPDATE)


                5. Weekly updates of TrEMBL and non-redundant data sets

Weekly cumulative updates of TrEMBL are available by anonymous FTP and
from the EBI SRS server.
We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:

(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TrEMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TrEMBL.


                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk/

TrEMBL is also available on the SWISS-PROT CD-ROM.

SWISS-PROT + TrEMBL is searchable on the following servers at the EBI:

FASTA3  (http://www2.ebi.ac.uk/fasta3/)
BLAST2  (http://www2.ebi.ac.uk/blast2/)
Bic_sw  (http://www2.ebi.ac.uk/bic_sw/)
Scanps  (http://www2.ebi.ac.uk/scanps/)
SSearch (http://www2.ebi.ac.uk/ssearch3/)



                7. Planned changes

7.1  Extension of the accession number system

As explained in detail in the SWISS-PROT release notes under 2.3, we will
extend the accession number system when we will have used up the 'O' series
of accession numbers. This can be anticipated for January 1999.

7.2  Switch to the NCBI taxonomy

To standardize the taxonomies used by different databases we will change
with SWISS-PROT release release 37 and TrEMBL release 8 our taxonomy. We
will switch to the NCBI taxonomy, which is already used as the common
taxonomy by the DDBJ/EMBL/GenBank nucleotide sequence databases.

7.3  Introduction of RT lines

With SWISS-PROT release release 37 and TrEMBL release 8 we will introduce
a new line type, the RT (Reference Title) line. This optional line will be
placed between the RA and RL line. The RT line gives the title of the paper
(or other work) as exactly as possible given the limitations of the computer
character set. The form which will be used is that which would be used in a
citation rather than displayed at the top of the published paper. For
instance, where journals capitalize major title words this is not preserved.
The title is enclosed in double quotes, and may be continued over several
lines as necessary. The title lines are terminated by a semicolon. An
example of the use of RT lines is shown below:

 RT   "Sequence analysis of the genome of the unicellular cyanobacterium
 RT   Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb
 RT   region from map positions 64% to 92% of the genome.";

  

TrEMBL release 6.0

Published June 1, 1998

                            TREMBL Release Notes
                            Release 6, June 1998


        EMBL Outstation
        European Bioinformatics Institute (EBI)
        Wellcome Trust Genome Campus
        Hinxton
        Cambridge CB10 1SD
        United Kingdom

        Telephone: (+44 1223) 494 400
        Fax: (+44 1223) 494 468
        Electronic mail address: DATALIB@EBI.AC.UK
        WWW server: http://www.ebi.ac.uk/

        Amos Bairoch
        Medical Biochemistry Department
        Centre Medical Universitaire
        1211 Geneva 4
        Switzerland

        Telephone: (+41 22) 784 40 82
        Fax: (+41 22) 702 55 02
        Electronic mail address: BAIROCH@CMU.UNIGE.CH
        WWW server: http://www.expasy.ch/


  INTRODUCTION
  ============

  TREMBL is a protein sequence database supplementing the SWISS-PROT
  Protein Sequence Data Bank. TREMBL contains the translations of all
  coding sequences (CDS) present in the EMBL Nucleotide Sequence
  Database not yet integrated in SWISS-PROT. TREMBL can be considered
  as a preliminary section of SWISS-PROT. For all TREMBL entries
  which should finally be upgraded to the standard SWISS-PROT
  quality, SWISS-PROT accession numbers have been assigned.


  RELEASE 6.0 OF TREMBL
  =====================

  This TREMBL release is created from the EMBL Nucleotide Sequence Database
  release 54 and contains 177'757 sequence entries, comprising 48'796'878
  amino acids.

  TREMBL is split in two main sections; SP-TREMBL and REM-TREMBL:

  SP-TREMBL (SWISS-PROT TREMBL) contains the entries (150'329) which should be
  eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
  been assigned for all SP-TREMBL entries.

  SP-TREMBL is organized in subsections:

  arc.dat (Archea):               5556 entries  (new section)
  fun.dat (Fungi):                4781 entries
  hum.dat (Human):                6312 entries
  inv.dat (Invertebrates):       20364 entries
  mam.dat (Other Mammals):        2450 entries
  mhc.dat (MHC proteins):         3524 entries
  org.dat (Organelles):          11685 entries
  phg.dat (Bacteriophages):       1240 entries
  pln.dat (Plants):              11217 entries
  pro.dat (Prokaryotes):         32000 entries
  rod.dat (Rodents):              5913 entries
  unc.dat (Unclassified):           88 entries
  vrl.dat (Viruses):             41882 entries
  vrt.dat (Other Vertebrates):    3317 entries

  REM-TREMBL (REMaining TREMBL) contains the entries (27'428) that we do
  not want to include in SWISS-PROT.


  WEEKLY UPDATES OF TREMBL AND NON-REDUNDANT DATA SETS
  ====================================================

  Weekly cumulative updates of TREMBL are available by anonymous FTP and
  from the EBI SRS server.

  We also produce every week a complete non-redundant protein sequence
  collection by providing three compressed files (these are in the directory
  /pub/databases/sp_tr_nrdb on the EBI FTP server):
  sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.


  ACCESS/DATA DISTRIBUTION
  ========================

  FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
  SRS server:     http://srs.ebi.ac.uk:5000/

  TREMBL is also available on the SWISS-PROT CD-ROM.
  SWISS-PROT + TREMBL is searchable on the FASTA3, BLAST2 and Bic_sw servers
  of the EBI.

  TREMBL HAS BEEN PREPARED BY:
  ============================

  Rolf Apweiler, Christian Desaintes, Sergio Contrino, Wolfgang Fleischmann,
  Henning Hermjakob, Vivien Junker, Stephanie Kappus, Fiona Lang, Michele
  Magrane, Maria Jesus Martin, Steffen Moeller, Nicoletta Mitaritonna and
  Claire O'Donovan at the EMBL Outstation - European Bioinformatics Institute
  (EBI) in Hinxton, UK;
  Amos Bairoch and Alain Gateau at the Medical Biochemistry Department of
  the University of Geneva, Switzerland.

  Notes
  =====

  This manual and the database it accompanies may be copied and redistributed
  freely, without advance permission, provided that this statement is reproduced
  with each copy.


  Citation
  ========

 If you  want to  cite  TREMBL  in  a  publication  please  use  the following
 reference:

           Bairoch A., and Apweiler R.
           The SWISS-PROT protein sequence data bank and its
           supplement TREMBL in 1998.
           Nucleic Acids Res. 26:38-42(1998).

  

TrEMBL release 5.0

Published January 29, 1998

                              TREMBL Release Notes
                              Release 5, January 1998


    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 400
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Amos Bairoch
    Medical Biochemistry Department
    Centre Medical Universitaire
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 784 40 82
    Fax: (+41 22) 702 55 02
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://www.expasy.ch/

    Acknowledgements

    TREMBL has been prepared by:

    o  Rolf Apweiler, Sergio Contrino, Wolfgang Fleischmann, Henning
       Hermjakob, Vivien Junker, Stephanie Kappus, Fiona Lang, Michele
       Magrane, Maria Jesus Martin, Steffen Moeller, Nicoletta Mitaritonna
       and Claire O'Donovan at the EMBL Outstation - European Bioinformatics
       Institute (EBI) in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Medical Biochemistry Department
       of the University of Geneva, Switzerland.

    Notes

    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this statement is reproduced with each copy.

    Citation

    If you  want to  cite  TREMBL  in  a  publication  please  use  the
    following reference:

              Bairoch A., and Apweiler R.
              The SWISS-PROT protein sequence data bank and its
              supplement TREMBL in 1998.
              Nucleic Acids Res. 26:38-42(1998).

                         1. Introduction


TREMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TREMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TREMBL can be considered as a preliminary
section of SWISS-PROT. For all TREMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.

                        2. Why a supplement to SWISS-PROT?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TREMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TREMBL.

We name this supplement TREMBL (TRanslation from EMBL), since the tools
used to create the translations of the CDS are based on the program
'trembl' written by Thure Etzold at the EMBL.

                             3. The Release

This TREMBL release is created from the EMBL Nucleotide Sequence Database
release 53 and contains 166'361 sequence entries, comprising 45'671'684 amino
acids. To minimize redundancy, the translations of all coding sequences (CDS)
in the EMBL Nucleotide Sequence Database already included in SWISS-PROT
Release 35 have been removed from TREMBL release 5.

TREMBL is split in two main sections; SP-TREMBL and REM-TREMBL:

SP-TREMBL (SWISS-PROT TREMBL) contains the entries (140'555) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have been
assigned for all SP-TREMBL entries.

SP-TREMBL is organized in subsections:

fun.dat (Fungi):                4694 entries
hum.dat (Human):                6101 entries
inv.dat (Invertebrates):       18423 entries
mam.dat (Other Mammals):        2444 entries
mhc.dat (MHC proteins):         3336 entries
org.dat (Organelles):          10561 entries
phg.dat (Bacteriophages):       1111 entries
pln.dat (Plants):               9871 entries
pro.dat (Prokaryotes):         34832 entries
rod.dat (Rodents):              5976 entries
unc.dat (Unclassified):          109 entries
vrl.dat (Viruses):             39943 entries
vrt.dat (Other Vertebrates):    3154 entries

30'007 new entries have been integrated in SP-TREMBL. 1'860 sequences of
SP-TREMBL entries have been updated and in 26'093 cases the annotation
has been updated.

In the document deleteac.txt you will find a list of all accession numbers
which appeared in the TREMBL data bank, but have been deleted from the
database.

REM-TREMBL (REMaining TREMBL) contains the entries (25'806) that we do
not want to include in SWISS-PROT.REM-TREMBL entries have no accession
numbers. This section is organized in five subsections:

   1) Immunoglobulins and T-cell receptors (Immuno.dat)
      Most REM-TREMBL entries will be immunoglobulins and T-cell receptors.
      We stopped entering  immunoglobulins and T-cell receptors into SWISS-
      PROT, because we only   want to  keep  the  germ  line  gene  derived
      translations of  these proteins  in  SWISS-PROT  and  not  all  known
      somatic recombinated  variations of  these proteins. We would like to
      create  a  specialized  database  dealing  with these sequences as  a
      further  supplement to  SWISS-PROT and  keep  only  a  representative
      cross-section of these proteins in SWISS-PROT.

   2) Synthetic sequences (Synth.dat)
      Another category of data which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to leave these entries in
      TREMBL.  Ideally   one  should   build  a  specialized  database  for
      artificial sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences (Patent.dat)
      A third  subsection consists of coding sequences captured from patent
      applications. A thorough  survey of  these  entries  have shown  that
      apart from a rather small minority (which in  most cases have already
      been  integrated in  SWISS-PROT),  most of these  sequences  contains
      either  erroneous data or concern  artificially  generated  sequences
      outside the scope of SWISS-PROT.

   4) Small fragments (Smalls.dat)
      Another  subsection consists  of fragments with less than eight amino
      acids.

   5) CDS not coding for real proteins (Pseudo.dat)
      The last subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.


                4. Format Differences Between SWISS-PROT and TREMBL

The format and conventions used by TREMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TREMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TREMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TREMBL.
The data class used in TREMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TREMBL:

The ID line (IDentification):

The entry name used in SP-TREMBL is the same as the Accession Number of the
entry. The entry name used in REM-TREMBL is the PID tagged to the
corresponding CDS in the EMBL Nucleotide Sequence Database. 'PID' stands for
the "Protein IDentification" number. It is a number that you will find in
EMBL nucleotide sequence entries in a qualifier called "/db_xref" which is
tagged to every CDS in the nucleotide database. Example:

   FT   CDS            54..1382
   FT                  /note="ribulose-1,5-bisphosphate carboxylase/
   FT                  oxygenase activase precursor"
   FT                  /db_xref="PID:g1006835"

The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TREMBL are referring to the TREMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (REL. 06, CREATED)
    DT   01-JUL-1989 (REL. 11, LAST SEQUENCE UPDATE)
    DT   01-AUG-1992 (REL. 23, LAST ANNOTATION UPDATE)

    DT lines in a TREMBL entry:

    DT   01-NOV-1996 (TREMBLREL. 01, CREATED)
    DT   01-NOV-1996 (TREMBLREL. 01, LAST SEQUENCE UPDATE)
    DT   01-FEB-1997 (TREMBLREL. 02, LAST ANNOTATION UPDATE)

                5. Weekly updates of TREMBL and non-redundant data sets

Weekly cumulative updates of TREMBL are available by anonymous FTP and
from the EBI SRS server.
We also produce every week a complete non-redundant protein sequence
collection by providing three compressed files (these are in the directory
/pub/databases/sp_tr_nrdb on the EBI FTP server and in databases/sp_tr_nrdb
on the ExPASy server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.
This set of non-redundant files is especially important for two types of
users:
(i) Managers of similarity search services. They can now provide what is
currently the most comprehensive and non-redundant data set of protein
sequences.
(ii) Anybody wanting to update their full copy of SWISS-PROT + TREMBL to
their own schedule without having to wait for full releases of SWISS-PROT
or of TREMBL.

                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk:5000/

TREMBL is also available on the SWISS-PROT CD-ROM.
SWISS-PROT + TREMBL is searchable on the FASTA3, BLAST2 and Bic_sw servers
of the EBI.
  

TrEMBL release 4.0

Published July 1, 1997

                              TREMBL Release Notes
                              Release 4, Jul 1997


    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 400
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Amos Bairoch
    Medical Biochemistry Department
    Centre Medical Universitaire
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 784 40 82
    Fax: (+41 22) 702 55 02
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://expasy.hcuge.ch/

    Acknowledgements

    TREMBL has been prepared by:

    o  Rolf Apweiler, Sergio Contrino, Vivien Junker, Stephanie Kappus,
       Fiona Lang, Michele Magrane, Maria Jesus Martin, Nicoletta
       Mitaritonna and Claire O'Donovan at the EMBL Outstation - European
       Bioinformatics Institute (EBI) in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Medical Biochemistry Department
       of the University of Geneva, Switzerland.

    Notes

    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this statement is reproduced with each copy.

    Citation

    If you  want to  cite  TREMBL  in  a  publication  please  use  the
    following reference:

              Bairoch A., and Apweiler R.
              The SWISS-PROT protein sequence data bank and its
              supplement TREMBL.
              Nucleic Acids Res. 25:31-26(1997).


                         1. Introduction


TREMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TREMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TREMBL can be considered as a preliminary
section of SWISS-PROT. For all TREMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.


                        2. Why a supplement to SWISS-PROT?

The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TREMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TREMBL.

We name this supplement TREMBL (TRanslation from EMBL), since the tools
used to create the translations of the CDS are based on the program
'trembl' written by Thure Etzold at the EMBL.


                             3. The Release

This TREMBL release is created from the EMBL Nucleotide Sequence Database
release 51 and contains 139'208 sequence entries, comprising 37'836'288 amino
acids. To minimize redundancy, the translations of all coding sequences (CDS)
in the EMBL Nucleotide Sequence Database already included in SWISS-PROT Release
34 and in SWISS-PROT updates by 21.7. have been removed from TREMBL release 4.

TREMBL is split in two main sections; SP-TREMBL and REM-TREMBL:

SP-TREMBL (SWISS-PROT TREMBL) contains the entries (116'769) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have been
assigned for all SP-TREMBL entries.

SP-TREMBL is organized in subsections:

fun.dat (Fungi):                3757 entries
hum.dat (Human):                5605 entries
inv.dat (Invertebrates):       15940 entries
mam.dat (Other Mammals):        2081 entries
mhc.dat (MHC proteins):         2694 entries
org.dat (Organelles):           8213 entries
phg.dat (Bacteriophages):        961 entries
pln.dat (Plants):               7776 entries
pro.dat (Prokaryotes):         26576 entries
rod.dat (Rodents):              5551 entries
unc.dat (Unclassified):          279 entries
vrl.dat (Viruses):             34508 entries
vrt.dat (Other Vertebrates):    2828 entries

REM-TREMBL (REMaining TREMBL) contains the entries (22'439) that we do
not want to include in SWISS-PROT.REM-TREMBL entries have no accession
numbers. This section is organized in five subsections:

   1) Immunoglobulins and T-cell receptors
      Most REM-TREMBL entries will be immunoglobulins and T-cell receptors.
      We stopped entering  immunoglobulins and T-cell receptors into SWISS-
      PROT, because we only   want to  keep  the  germ  line  gene  derived
      translations of  these proteins  in  SWISS-PROT  and  not  all  known
      somatic recombinated  variations of  these proteins. We would like to
      create  a  specialized  database  dealing  with these sequences as  a
      further  supplement to  SWISS-PROT and  keep  only  a  representative
      cross-section of these proteins in SWISS-PROT.

   2) Synthetic sequences
      Another category of data which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to leave these entries in
      TREMBL.  Ideally   one  should   build  a  specialized  database  for
      artificial sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences
      A third  subsection consists  of coding sequences captured from patent
      applications. A thorough survey of these entries have shown that apart
      from a rather small  minority (which in  most cases  have already been
      integrated  in SWISS-PROT),  most of  these sequences  contains  either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments
      Another  subsection consists  of fragments with less than seven amino
      acids.

   5) CDS not coding for real proteins
      The last subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.



                4. Format Differences Between SWISS-PROT and TREMBL

The format and conventions used by TREMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TREMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TREMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TREMBL.
The data class used in TREMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TREMBL:

The ID line (IDentification):

The entry name used in SP-TREMBL is the same as the Accession Number of the
entry. The entry name used in REM-TREMBL is the PID tagged to the
corresponding CDS in the EMBL Nucleotide Sequence Database. 'PID' stands for
the "Protein IDentification" number. It is a number that you will find in
EMBL nucleotide sequence entries in a qualifier called "/db_xref" which is
tagged to every CDS in the nucleotide database. Example:

   FT   CDS            54..1382
   FT                  /note="ribulose-1,5-bisphosphate carboxylase/
   FT                  oxygenase activase precursor"
   FT                  /db_xref="PID:g1006835"

The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TREMBL are referring to the TREMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (REL. 06, CREATED)
    DT   01-JUL-1989 (REL. 11, LAST SEQUENCE UPDATE)
    DT   01-AUG-1992 (REL. 23, LAST ANNOTATION UPDATE)

    DT lines in a TREMBL entry:

    DT   01-NOV-1996 (TREMBLREL. 01, CREATED)
    DT   01-NOV-1996 (TREMBLREL. 01, LAST SEQUENCE UPDATE)
    DT   01-FEB-1997 (TREMBLREL. 02, LAST ANNOTATION UPDATE)


                5. Weekly updates of TREMBL

Weekly cumulative updates of TREMBL are available by anonymous FTP and
from the EBI SRS server.


                6. Access/Data Distribution

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://srs.ebi.ac.uk:5000/

TREMBL is also available on the SWISS-PROT CD-ROM.
SP-TREMBL is searchable on the FASTA server of the EBI and will be soon
searchable on the BLITZ server of the EBI.
  

TrEMBL release 3.0

Published May 1, 1997

                              TREMBL Release Notes
                              Release 3, May 1997


    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 400
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/

    Amos Bairoch
    Medical Biochemistry Department
    Centre Medical Universitaire
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 784 40 82
    Fax: (+41 22) 702 55 02
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://expasy.hcuge.ch/

    Acknowledgements

    TREMBL has been prepared by:

    o  Rolf Apweiler, Sergio Contrino, Vivien Junker, Stephanie Kappus,
       Fiona Lang, Michele Magrane, Maria Jesus Martin, Nicoletta
       Mitaritonna and Claire O'Donovan at the EMBL Outstation - European
       Bioinformatics Institute (EBI) in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Medical Biochemistry Department
       of the University of Geneva, Switzerland.

    Notes

    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this statement is reproduced with each copy.

    Citation

    If you  want to  cite  TREMBL  in  a  publication  please  use  the
    following reference:

              Bairoch A., and Apweiler R.
              The SWISS-PROT protein sequence data bank and its
              supplement TREMBL.
              Nucleic Acids Res. 25:31-26(1997).

                               1. Introduction



TREMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TREMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TREMBL can be considered as a preliminary
section of SWISS-PROT. For all TREMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.


                        2. Why a supplement to SWISS-PROT?

The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TREMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TREMBL.

We name this supplement TREMBL (TRanslation from EMBL), since the tools
used to create the translations of the CDS are based on the program
'trembl' written by Thure Etzold at the EMBL.


                             3. The Release

This TREMBL release is created from the EMBL Nucleotide Sequence Database
release 50 and contains 126'995 sequence entries, comprising 34'178'645
amino acids. To minimize redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT Release 34 and in SWISS-PROT updates by 19.5. have been
removed from TREMBL release 3.

TREMBL is split into two main sections; SP-TREMBL and REM-TREMBL:

SP-TREMBL (SWISS-PROT TREMBL) contains the entries (104'865) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TREMBL entries.

SP-TREMBL is organized in subsections:

fun.dat (Fungi):                3425 entries
hum.dat (Human):                5126 entries
inv.dat (Invertebrates):       14229 entries
mam.dat (Other Mammals):        1958 entries
mhc.dat (MHC proteins):         2537 entries
org.dat (Organelles):           7313 entries
phg.dat (Bacteriophages):        883 entries
pln.dat (Plants):               6812 entries
pro.dat (Prokaryotes):         23718 entries
rod.dat (Rodents):              5118 entries
unc.dat (Unclassified):           39 entries
vrl.dat (Viruses):             31052 entries
vrt.dat (Other Vertebrates):    2655 entries

REM-TREMBL (REMaining TREMBL) contains the entries (22'130) that we do
not want to include in SWISS-PROT. REM-TREMBL entries have no accession
numbers. This section is organized in five subsections:

   1) Immunoglobulins and T-cell receptors
      Most REM-TREMBL entries will be immunoglobulins and T-cell receptors.
      We stopped entering  immunoglobulins and T-cell receptors into SWISS-
      PROT, because we only   want to  keep  the  germ  line  gene  derived
      translations of  these proteins  in  SWISS-PROT  and  not  all  known
      somatic recombinated  variations of  these proteins. We would like to
      create  a  specialized  database  dealing  with these sequences as  a
      further  supplement to  SWISS-PROT and  keep  only  a  representative
      cross-section of these proteins in SWISS-PROT.

   2) Synthetic sequences
      Another category of data which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to leave these entries in
      TREMBL.  Ideally   one  should   build  a  specialized  database  for
      artificial sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences
      A third  subsection consists  of coding sequences captured from patent
      applications. A thorough survey of these entries have shown that apart
      from a rather small  minority (which in  most cases  have already been
      integrated  in SWISS-PROT),  most of  these sequence  contains  either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments
      Another  subsection consists  of fragments with less than seven amino
      acids.

   5) CDS not coding for real proteins
      The last subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.


           4. Format Differences Between SWISS-PROT and TREMBL

The format and conventions used by TREMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TREMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TREMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TREMBL.
The data class used in TREMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TREMBL:

The ID line (IDentification):

The entry name used in SP-TREMBL is the same as the Accession Number of the
entry. The entry name used in REM-TREMBL is the PID tagged to the
corresponding CDS in the EMBL Nucleotide Sequence Database. 'PID' stands for
the "Protein IDentification" number. It is a number that you will find in
EMBL nucleotide sequence entries in a qualifier called "/db_xref" which is
tagged to every CDS in the nucleotide database. Example:

   FT   CDS            54..1382
   FT                  /note="ribulose-1,5-bisphosphate carboxylase/
   FT                  oxygenase activase precursor"
   FT                  /db_xref="PID:g1006835"

The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TREMBL are referring to the TREMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (REL. 06, CREATED)
    DT   01-JUL-1989 (REL. 11, LAST SEQUENCE UPDATE)
    DT   01-AUG-1992 (REL. 23, LAST ANNOTATION UPDATE)

    DT lines in a TREMBL entry:

    DT   01-NOV-1996 (TREMBLREL. 01, CREATED)
    DT   01-NOV-1996 (TREMBLREL. 01, LAST SEQUENCE UPDATE)
    DT   01-FEB-1997 (TREMBLREL. 02, LAST ANNOTATION UPDATE)
  

TrEMBL release 2.0

Published February 1, 1997

                            TREMBL Release Notes
                            Release 2, February 1997




    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Hinxton
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 400
    Fax: (+44 1223) 494 468
    Electronic mail address: DATALIB@EBI.AC.UK
    WWW server: http://www.ebi.ac.uk/



    Amos Bairoch
    Medical Biochemistry Department
    Centre Medical Universitaire
    1211 Geneva 4
    Switzerland

    Telephone: (+41 22) 784 40 82
    Fax: (+41 22) 702 55 02
    Electronic mail address: BAIROCH@CMU.UNIGE.CH
    WWW server: http://expasy.hcuge.ch/




    Acknowledgements

    TREMBL has been prepared by:

    o  Rolf Apweiler, Sergio Contrino, Vivien Junker, Stephanie Kappus,
       Fiona Lang, Michele Magrane, Maria Jesus Martin, Nicoletta
       Mitaritonna and Claire O'Donovan at the EMBL Outstation - European
       Bioinformatics Institute (EBI) in Hinxton, UK;
    o  Amos Bairoch and Alain Gateau at the Medical Biochemistry Department
       of the University of Geneva, Switzerland.


    Notes

    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this statement is reproduced with each copy.



    Citation

    If you  want to  cite  TREMBL  in  a  publication  please  use  the
    following reference:

              Bairoch A., and Apweiler R.
              The SWISS-PROT protein sequence data bank and its
              supplement TREMBL.
              Nucleic Acids Res. 25:31-26(1997).









                               1. Introduction



TREMBL is a computer-annotated protein sequence database supplementing the
SWISS-PROT Protein Sequence Data Bank. TREMBL contains the translations of
all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database
not yet integrated in SWISS-PROT. TREMBL can be considered as a preliminary
section of SWISS-PROT. For all TREMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.




                        2. Why a supplement to SWISS-PROT?


The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-PROT.
We do not want to dilute the quality standards of SWISS-PROT by incorporating
sequences without proper sequence analysis and annotation, but we do want to
make the sequences available as fast as possible. TREMBL achieves this second
goal, and is a major step in the process of speeding up subsequent
upgrading of annotation to the standard SWISS-PROT quality.
To address the problem of redundancy, the translations of all coding
sequences (CDS) in the EMBL Nucleotide Sequence Database already included
in SWISS-PROT have been removed from TREMBL.

We name this supplement TREMBL (TRanslation from EMBL), since the tools
used to create the translations of the CDS are based on the program
'trembl' written by Thure Etzold at the EMBL.




                             3. The Release


This TREMBL release is created from the EMBL Nucleotide Sequence Database
release 49 and contains 116'379 sequence entries, comprising 31'293'053
amino acids.


TREMBL is split into two main sections; SP-TREMBL and REM-TREMBL:

SP-TREMBL (SWISS-PROT TREMBL) contains the entries (96'757) which should be
eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have
been assigned for all SP-TREMBL entries.

SP-TREMBL is organized in subsections:

fun.dat (Fungi):                3108 entries
hum.dat (Human):                4541 entries
inv.dat (Invertebrates):       13269 entries
mam.dat (Other Mammals):        1740 entries
mhc.dat (MHC proteins):         2307 entries
org.dat (Organelles):           6705 entries
phg.dat (Bacteriophages):        884 entries
pln.dat (Plants):               5699 entries
pro.dat (Prokaryotes):         23331 entries
rod.dat (Rodents):              4769 entries
vrl.dat (Viruses):             28150 entries
vrt.dat (Other Vertebrates):    2254 entries



REM-TREMBL (REMaining TREMBL) contains the entries (19'622) that we do
not want to include in SWISS-PROT. REM-TREMBL entries have no accession
numbers. This section is organized in five subsections:

   1) Immunoglobulins and T-cell receptors
      Most REM-TREMBL entries will be immunoglobulins and T-cell receptors.
      We stopped entering  immunoglobulins and T-cell receptors into SWISS-
      PROT, because we only   want to  keep  the  germ  line  gene  derived
      translations of  these proteins  in  SWISS-PROT  and  not  all  known
      somatic recombinated  variations of  these proteins. We would like to
      create  a  specialized  database  dealing  with these sequences as  a
      further  supplement to  SWISS-PROT and  keep  only  a  representative
      cross-section of these proteins in SWISS-PROT.

   2) Synthetic sequences
      Another category of data which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to leave these entries in
      TREMBL.  Ideally   one  should   build  a  specialized  database  for
      artificial sequences as a further supplement to SWISS-PROT.

   3) Patent application sequences
      A third  subsection consists  of coding sequences captured from patent
      applications. A thorough survey of these entries have shown that apart
      from a rather small  minority (which in  most cases  have already been
      integrated  in SWISS-PROT),  most of  these sequence  contains  either
      erroneous data or concern artificially generated sequences outside the
      scope of SWISS-PROT.

   4) Small fragments
      Another  subsection consists  of fragments with less than seven amino
      acids.

   5) CDS not coding for real proteins
      The last subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.






           4. Format Differences Between SWISS-PROT and TREMBL


The format and conventions used by TREMBL follow as closely as possible
that of SWISS-PROT. Hence, it is not necessary to produce an additional
user manual and extensive release notes for TREMBL. The information given
in the SWISS-PROT release notes and user manual are in general valid for
TREMBL. The differences are mentioned below.

The general structure of an entry is identical in SWISS-PROT and TREMBL.
The data class used in TREMBL (in the ID line) is always 'PRELIMINARY',
whereas in SWISS-PROT it is always 'STANDARD'.

Differences in line types present in SWISS-PROT and TREMBL:

The ID line (IDentification):

The entry name used in SP-TREMBL is the same as the Accession Number of the
entry. The entry name used in REM-TREMBL is the PID tagged to the
corresponding CDS in the EMBL Nucleotide Sequence Database. 'PID' stands for
the "Protein IDentification" number. It is a number that you will find in
EMBL nucleotide sequence entries in a qualifier called "/db_xref" which is
tagged to every CDS in the nucleotide database. Example:

   FT   CDS            54..1382
   FT                  /note="ribulose-1,5-bisphosphate carboxylase/
   FT                  oxygenase activase precursor"
   FT                  /db_xref="PID:g1006835"


The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in SWISS-PROT; but the
DT lines in TREMBL are referring to the TREMBL release. The difference is
shown in the example below.

    DT lines in a SWISS-PROT entry:

    DT   01-JAN-1988 (REL. 06, CREATED)
    DT   01-JUL-1989 (REL. 11, LAST SEQUENCE UPDATE)
    DT   01-AUG-1992 (REL. 23, LAST ANNOTATION UPDATE)


    DT lines in a TREMBL entry:

    DT   01-NOV-1996 (TREMBLREL. 01, CREATED)
    DT   01-NOV-1996 (TREMBLREL. 01, LAST SEQUENCE UPDATE)
    DT   01-FEB-1997 (TREMBLREL. 02, LAST ANNOTATION UPDATE)
  

TrEMBL release 1.0

Published October 2, 1996
First release of TREMBL, a protein sequence database supplementing
the SWISS-PROT Protein Sequence Data Bank


INTRODUCTION
============

TREMBL is a protein sequence database supplementing the SWISS-PROT Protein
Sequence Data Bank. TREMBL contains the translations of all coding
sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet
integrated in SWISS-PROT. TREMBL can be considered as a preliminary
section of SWISS-PROT. For all TREMBL entries which should finally be
upgraded to the standard SWISS-PROT quality, SWISS-PROT accession numbers
have been assigned.


FIRST RELEASE OF TREMBL
=======================

This TREMBL release is created from the EMBL Nucleotide Sequence Database
release 48 and contains 104'962 sequence entries, comprising 28'063'108
amino acids. TREMBL release 1 will be distributed with SWISS-PROT release 34.


TREMBL is split in two main sections; SP-TREMBL and REM-TREMBL:

SP-TREMBL (SWISS-PROT TREMBL) contains the entries (86'040) which should be
incorporated into SWISS-PROT. SWISS-PROT accession numbers have been assigned
for all SP-TREMBL entries.

SP-TREMBL is organized in subsections:

fun.dat (Fungi):                3367 entries
hum.dat (Human):                4322 entries
inv.dat (Invertebrates):       11920 entries
mam.dat (Other Mammals):        1768 entries
mhc.dat (MHC proteins):         2050 entries
org.dat (Organelles):           6285 entries
phg.dat (Bacteriophages):        866 entries
pln.dat (Plants):               5524 entries
pro.dat (Prokaryotes):         17222 entries
rod.dat (Rodents):              4573 entries
vrl.dat (Viruses):             26042 entries
vrt.dat (Other Vertebrates):    2101 entries


REM-TREMBL (REMaining TREMBL) contains the entries (19'255) that we do
not want to include in SWISS-PROT.


ACCESS/DATA DISTRIBUTION
========================

FTP server:     ftp.ebi.ac.uk/pub/databases/trembl
SRS server:     http://www.ebi.ac.uk/srs/srsc

TREMBL is also available on the SWISS-PROT CD-ROM.
SP-TREMBL will be searchable on the FASTA and BLITZ servers of the EBI.



TREMBL HAS BEEN PREPARED BY:
============================

Rolf Apweiler, Alain Gateau, Vivien Junker, Stephanie Kappus, Fiona Lang,
Nicoletta Mitaritonna and Claire O'Donovan at the EMBL Outstation - European
Bioinformatics Institute (EBI) in Hinxton, UK;
Amos Bairoch at the Medical Biochemistry Department of the University
of Geneva, Switzerland.


=======================================================================
Rolf Apweiler                           | SWISS-PROT Coordinator
EMBL Outstation                         | Email:apweiler@ebi.ac.uk
European Bioinformatics Institute (EBI) | URL:  http://www.ebi.ac.uk
Wellcome Trust Genome Campus, Hinxton   | Tel:  +44 (1223) 494435
Cambridge CB10 1SD, UK                  | Fax:  +44 (1223) 494968
========================================================================