############################################################# README for ftp://ncbi.nlm.nih.gov/refseq/ Last updated: July 17, 2020 ############################################################# _________________________________________________________________________ National Center for Biotechnology Information (NCBI) National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, MD 20894, USA tel: (301) 496-2475 fax: (301) 480-9241 e-mail: info@ncbi.nlm.nih.gov _________________________________________________________________________ ========================================================================= UPDATES TO THIS FTP SITE: July 2, 2003 RefSeq Release 1 is available by anonymous FTP ftp://ftp.ncbi.nih.gov/refseq/release/ August 16, 2005 Added documentation of the 'wgs' directory Removed 'cumulative' directory and related documentation as it was replaced by the RefSeq release. Modified documentation of the LocusLink directory to reflect the obsolete/archival status. April 27, 2007 Added documentation of 'special_requests' directory October 3, 2007 Added documentation of the uniprotkb directory October 12, 2007 Added documentation about the /removed/ directory June 11, 2010 Modified format for new report files in the /removed/ directory to remove space padding for 'replaced-by' accession entries. Removed references to the historical resource LocusLink. August 23, 2016 Revised accession prefix list Updated directory descriptions, and reorganized content order Added documentation for TargetedLoci directory Added documentation for H_sapiens/RefSeqGene Added documentation for H_sapiens/alignments September 19, 2017 Added a note on the 4 different categories of suppressed/withdrawn records January 11, 2017 Deleted column for 'gi' from description of 'removed' directory. October 4, 2018 Added section 'supplemental' with documentation on the 'NP_YP_WP.txt' protein ID mapping file July 17, 2020 Added documentation for the 'FunctionalElements' and 'MANE' directories See the README file and the RefSeq Release notes for more information: ftp://ftp.ncbi.nih.gov/refseq/release/README ftp://ftp.ncbi.nih.gov/refseq/release/release-notes/ ========================================================================= The NCBI Reference Sequence project (RefSeq) provides reference sequence standards for the naturally occurring molecules of the central dogma, from chromosomes to mRNAs to proteins. RefSeq standards provide a foundation for the functional annotation of the human genome. They provide a stable reference point for mutation analysis, gene expression studies, and polymorphism discovery. Scope: Currently, RefSeq records are provided for the following molecule types: Molecule Type Accession Prefix ---------------------------------------------- protein NP_; XP_; AP_; YP_; WP_ rna NM_; NR_; XM_; XR_ genomic NC_; AC_; NG_; NT_; NW_; NZ_ Additional information is available from https://www.ncbi.nih.gov/RefSeq/ _________________________________________________________________________ The following directories are available from this RefSeq ftp site: RefSeq FTP release and interim updates: daily release removed wgs Organism-specific directories: B_taurus D_rerio H_sapiens M_musculus R_norvegicus S_scrofa X_tropicalis Additional content: FunctionalElements MANE special_requests supplemental TargetedLoci uniprotkb ========================================== RefSeq FTP release and interim updates ========================================== release ========================================== Regular RefSeq releases are made available in this directory area. The directory is organized into several sub-directories. Sequence content is provided as ASN.1 (only in complete directory), as nucleotide or protein FASTA, and as nucleotide GenBank format or protein GenPept format. Directory Contents --------------------------------- archaea sequence bacteria sequence complete sequence fungi sequence invertebrate sequence mitochondrion sequence plant sequence plasmid sequence plastid sequence protozoa sequence release-catalog documentation; . accessions included in the release . accession to GeneID correspondence . files installed (sequence data) for the release . accessions removed since last release . organisms added and changed since the last release . mapping of prokaryotic WP_ proteins to genome annotation . report of multispecies prokaryotic WP_ proteins release-notes documentation; content, scope, organization, structure release-statistics documentation; statistics per sequence directory global statistics vertebrate_mammalian sequence vertebrate_other sequence viral sequence ========================================== daily ========================================== The daily directory contains daily updates of non-WGS refseq gi's since the RefSeq release. This directory is not cumulative. The contents of the directory are removed following the installation of a new RefSeq release. Release-related updates to this directory may result in an small number of retained files that represent sequences that were released during the time period that the RefSeq release was being processed. File name format: rsnc.[MonthDay.YEAR].bna.gz Nucleotide sequence, in ASN.1 binary format rsnc.[MonthDay.YEAR].faa.gz Protein sequences, in FASTA format rsnc.[MonthDay.YEAR].fna.gz Nucleotide sequences, in FASTA format rsnc.[MonthDay.YEAR].gbff.gz GenBank flatfile view (nucleotides) rsnc.[MonthDay.YEAR].gpff.gz GenPept flatfile view (proteins) ========================================== removed ========================================== The removed directory contains daily update reports of RefSeq records that have been removed from the collection since the last RefSeq release. If no records have been removed, then no file is supplied for that day. The directory is not cumulative and contents are removed following the installation of a new RefSeq release. File name format: removed-records.mmdd.yyyy Columns (tab delimited) accession version removal category --where category may be one of: temporarily suppressed permanently suppressed temporarily withdrawn permanently withdrawn replaced by {accession} - see NOTE1. removal date --yyyy.mm.dd NOTE1: the report file format has been updated to removed padded spaces for those accessions in replaced-by entries. NOTE2: there is an additional category of removal, reported as dead proteins in the RefSeq release report of removed records, that is not currently included in this daily report. NOTE3: the distinction between the 4 different categories of suppressed/withdrawn records is largely internal; many accessions that are 'temporarily suppressed' will remain permanently in that state, and some accessions that are 'permanently suppressed' may occasionally be revived at a later date. See also: ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/README documentation for: release#.removed-records ========================================== wgs ========================================== The wgs directory contains daily updates of WGS refseq gi's since the RefSeq release. This directory is not cumulative. The contents of this directory are removed following the installation of a new RefSeq release. Release-related updates to this directory may result in an small number of retained files that represent sequences that were released during the time period that the RefSeq release was being processed. File name format: rswgs.[WGS_project].bna.gz Nucleotide sequence, in ASN.1 binary format rswgs.[WGS_project].faa.gz Protein sequences, in FASTA format rswgs.[WGS_project].fna.gz Nucleotide sequences, in FASTA format rswgs.[WGS_project].gbff.gz GenBank flatfile view (nucleotides) rswgs.[WGS_project].gpff.gz GenPept flatfile view (proteins) ========================================== Organism specific directories ========================================== Select organism specific files are also provided so that previously provided service is not discontinued. We do not plan to add additional organism-specific directories at this time. Data is updated weekly in a cycle independent of the RefSeq release, cumulative, and daily update processing and constitutes a full release of transcript and protein data for the organism. ftp://ftp.ncbi.nlm.nih.gov/refseq/B_taurus/mRNA_Prot/ ftp://ftp.ncbi.nlm.nih.gov/refseq/D_rerio/mRNA_Prot/ ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/ ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/alignments/ ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/ ftp://ftp.ncbi.nlm.nih.gov/refseq/R_norvegicus/mRNA_Prot/ ftp://ftp.ncbi.nlm.nlm.nih.gov/refseq/S_scrofa/mRNA_Prot/ ftp://ftp.ncbi.nlm.nih.gov/refseq/X_tropicalis/mRNA_Prot/ mRNA_Prot: ---------- sequence data is available in the following formats update frequency: weekly organism.#.protein.faa.gz protein fasta format organism.#.protein.gpff.gz protein GenPept format organism.#.rna.fna.gz nucleotide fasta format organism.#.rna.gbff.gz nucleotide GenBank format organism.files.installed list of file names Note: where # is a numerical increment; files are split based on size thresholds to support access by customers with different internet connections. H_sapiens/RefSeqGene: --------------------- See also: https://www.ncbi.nlm.nih.gov/refseq/rsg/ Update frequency: weekly (gbff and fastat files), daily (non-sequence files), refseqgene.#.genomic.gbff.gz nucleotide GenBank format; see # note above refseqgene.#.fna.gz nucleotide fasta format; see # note above GCF_000001405.*_refseqgene_alignments.gff3 RefSeqGene alignments to the primary human reference assembly where '*' indicates the specific assembly version See: https://www.ncbi.nlm.nih.gov/assembly/ Aligned2RefSeqGene previous versions of transcript reference standards gene_RefSeqGene reports GeneID to RefSeqGene accession LRG_RefSeqGene reports data associations among GeneID, RefSeqGene, LRG presentations public presentations H_sapiens/alignments in GFF3 format: ------------------------------------ More about the gff3 standard: http://www.sequenceontology.org/gff3.shtml GCF_000001405.*.refseqgene_alignments.gff3 RefSeqGene alignments to the primary human reference assembly Updated daily GCF_000001405.*.knownrefseq_alignments.gff3 Known RefSeq (NM_, NR_, NP_) alignments to the human reference assembly Updated daily GCF_000001405.*.modelrefseq_alignments.gff3 Model RefSeq (XM_, XR_, XP_) alignments to the human reference assembly Updated per full annotation release Note: where '*' indicates the specific assembly version See: https://www.ncbi.nlm.nih.gov/assembly/ ========================================== Additional Content ========================================== ========================================== FunctionalElements ========================================== Data from the RefSeq Functional Elements project representing experimentally validated human and mouse non-genic functional elements. See https://www.ncbi.nlm.nih.gov/refseq/functionalelements/ for additional information. Files provided (updated weekly): [human/mouse].biological_region.fna.gz -- RefSeq accessions for genomic biological regions (NG_ prefix) in FASTA format [human/mouse].biological_region.gbff.gz -- RefSeq accessions for genomic biological regions (NG_ prefix) in GenBank flatfile format Directory: trackhub: --------- Track hub for RefSeq Functional Element biological regions, features, regulatory interactions and recombination partners. The track hub can be viewed on a compatible genome browser, including the UCSC Genome Browser (all tracks), the NCBI Genome Data Viewer (select tracks only) or the Ensembl genome browser (select tracks only), using the following URL: https://ftp.ncbi.nlm.nih.gov/refseq/FunctionalElements/trackhub/hub.txt See https://ftp.ncbi.nlm.nih.gov/refseq/FunctionalElements/trackhub/RefSeqFE_Hub.html for more details. data sub-directory -- Species-specific annotation release (AR##) sub-directories containing: Genome-annotated biological region and feature files in bigBed format: FEbiolregions_AR##.bb -- biological regions with metadata FEfeats_AR##.bb -- functional features with metadata Pairwise interaction data files in bigInteract format: FErecombpartners_AR##.inter.bb -- recombination interactions FEregintxns_AR##.inter.bb -- regulatory interactions Where ## represents the NCBI annotation release identifier other -- Genome assembly sub-directories and other files necessary for track hub support. See http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html for more details. ========================================== MANE ========================================== Data from the Matched Annotation from NCBI and EMBL-EBI (MANE) project. See https://www.ncbi.nlm.nih.gov/refseq/MANE/ for additional information. See the MANE/README.txt file for directory description. ========================================== special_requests ========================================== Additional reports are provided upon request. The reports may be limited in scope in that they report data for a sub-set of the RefSeq collection. See the special_requests/README file for additional information. ========================================== supplemental ========================================== In 2014 and 2015, NCBI re-annotated all prokaryotic genomes, except a small set of Reference Genomes, using NCBI's Prokaryotic Genome Annotation Pipeline based on a new protein data model. This new RefSeq non-redundant protein model is identified by a "WP_" accession prefix, which is different from the traditional RefSeq prokaryotic protein "NP_" or "YP_" accession. This re-annotation resulted in the removal of nearly 7 million NP_ and YP_ accessions as prokaryotic genomes were updated to directly cross-reference the new non-redundant WP_ accessions. For conserved proteins, the same WP accession may appear on thousands of genomes. However, we are aware that the NP_ and YP_ accessions have been used in many publications and biomedical projects, which may refer scientists to NCBI protein pages, which currently provide the new non-redundant proteins with WP_ accessions. The file "NP_YP_WP.txt" is a protein ID mapping file that provides the association of traditional NP_ and YP_ proteins with new WP_ proteins of identical sequences. The ID mapping file consists of five columns IPG - the IPG ID (https://www.ncbi.nlm.nih.gov/ipg/) NP_YP_AccVer - the NP/YP accession and version WP_AccVer - the associated WP accession NP_YP_Taxid - Taxonomy ID NP_YP_Status - the status of NP/YP protein live: the NP/YP protein is still annotated on Reference Genomes replaced: the NP/YP protein was replaced by a WP protein suppressed: the NP/YP protein was first replaced by WP protein, which was subsequently suppressed because it is no longer annotated on any genome withdrawn: the NP/YP protein is no longer annotated on any genome Additional information: https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#reference_genomes https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/ https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/reannotation/ ========================================== TargetedLoci ========================================== Additional information: https://www.ncbi.nlm.nih.gov/refseq/targetedloci/ Directories: Archaea 5S rRNA, 16S rRNA, 23S rRNA from type material Bacteria 5S rRNA, 16S rRNA, 23S rRNA from type material Fungi 28S rRNA, internal transcribed spacer from type/validated material Sequence data is provided in GenBank format and FASTA format. ========================================== uniprotkb ========================================== In collaboration with UniProtKB, corresponding RefSeq to UniProt protein accession data are now reported. UniProt calculates corresponding accessions based on the following criteria: a) Identical sequence and identical species (NCBI tax_id) --the majority of the corresponding pairs fall into this category b) Common protein ID and identical tax_id where both RefSeq and UniProt records cite the same protein accession as one that was used to create the RefSeq or UniProt record. c) Common protein ID and equivalent but non-identical tax_id where the common protein ID is as above, and tax_ids are converted from strain or sub-strain level to the species level (e.g., UniProt and RefSeq may differ in their decisions to represent a sequence as the species vs a specific strain but they are based on the same underlying GenBank data and are considered equivalent). File: gene_refseq_uniprotkb_collab ---------------------------------- Column header line is the first line in the file. Columns are tab-delimited with one accession pair per line Accession values are not unique; a single accession from one database may have multiple corresponding accessions from the other database. Column 1: RefSeq protein accession Column 2: UniProt protein accession