Archiving Guidelines
This page summarizes the archival process and standards for data providers submitting continual, or recurring data streams in ATRAC. See the NCEI Archive page for general information about the archive process.
Follow these guidelines to prepare recurring data for submission to the archive. Use Send2NCEI (S2N) to submit non-recurring or single-delivery data. Please note that specific programs or scientific themes may have additional guidelines that aren't included here.
Archive Submission Process
Contact NCEI before submitting an archive request to ensure there is sufficient time and available resources to support new data submissions. NCEI staff will provide crucial feedback and project specific guidelines during the data development phase to improve the data and documentation before they are archived.
Steps to Submit Data:
- Archive request and appraisal: Fill out the ATRAC submission form to start the process. NCEI will use the information you provide to assess the data's suitability, feasibility and complexity.
- Submission Agreement: Once the archive appraisal is approved and the data submission is scheduled, you can negotiate the details of the data submission with an NCEI archive representative and document the information in the data submission agreement. The data submission agreement codifies how the data will be submitted and what subsequent services will be provided.
Submission Formats and Technical Requirements
NCEI has specific technical requirements and limitations for continual data submissions, including the total size, number of files, naming conventions, and metadata format. Refer to the NCEI common guidance for archive file formats.
General Requirements
- Data submissions should not include proprietary file formats. Any proprietary file format must be converted to an open standard file format before the data submission.
- Inclusion of source code is optional. Do not include executable files and other files that have an IT security risk.
File Packaging and Sizing
- File limitations per data submission:
- Typical maximum data volume submitted in 24 hours: 0.5 TB
- Typical maximum file size: 10 GB (can be compressed)
- Absolute maximum file size: 25 GB (can be compressed)
- Typical minimum file size: 10 MB (can be lower for low file counts)
- Typical maximum file count submitted in 24 hours: 500
- Absolute maximum number of files in a tar file: 50,000
- Organize and group data by time, type, or another organizing principle so users can easily find and understand it.
- You may need to aggregate your data into tar files to meet file number limitations.Tar file contents can be extracted after submission to improve future discovery and access.
- Each submitted tar file should have complete contents, containing all files as expected for the tar file.
- The tar files should not contain any symbolic links or file names with special characters.
- The contents of the archived tar files are usually inventoried. If a tar file contains subdirectories, then the file paths will be inventoried.
File Naming Conventions
See the NCEI common guidance for file naming conventions.
Requirements
- Human-readable.
- Exclude special characters such as punctuation, symbols and white spaces (other than underscores, dashes, and periods for the file name extension).
- Unique file name for the dataset.
- Case-sensitive file name field values.
- Less than 160 characters in length for the archive file name.
- Less than 255 characters in length including the file paths within an archive tar file when extracted on a file system.
- Must contain enough descriptive information to indicate the type of data contained in the file, including the date/time covered by the data.
- Use the appropriate file format extension for the type, i.e., for data, documentation, or ancillary files.
Recommendations
- Only use periods ('.') to denote the file format extension.
- Use underscores or dashes to delineate fields in the file name.
- Include a field in the file names for the "file creation date" to make the file name values unique.
- Include a data product version in the file name (if applicable)
- Use Semantic Versioning practices to format the version number.
Submission Manifest Files
The submission manifest file contains checksum information for each submitted data file. This information is used by the data ingest system to ensure the integrity of the received file.
A manifest file contains:
- The associated data file name followed by
- The corresponding checksum value
- The data file size in bytes.
- The three text values are comma delimited on one line with no spaces.
Content format of a submission manifest file:
[filename],[checksum],[filesize]
Example content:
patmosx-level2_v05r03_NOAA-18_d20121231_c20130831.tar,da3e100dc9e7bebb810985e37875de38,130631488
File name format of a submission manifest file:
[filename].mnf
Example file name:
patmosx-level2_v05r03_NOAA-18_d20121231_c20130831.tar.mnf
Additional Information
- One submission manifest per associated data file
- Transferred via the same protocol and method as the associated data file
- Transferred following a successful transfer of the associated data file
- Multiple checksum algorithms are supported, including MD5, SHA-256, and SHA-384
- A correct submission manifest file is required for the successful ingest of the associated data file
- Code to create a submission manifest file can be provided upon request
Supported File Transfer Protocols and Guidelines for Submitting Data to NCEI
- Options for file transfer protocol in order of preference:
- S3 Get by Data Ingest from the Data Provider
- SFTP Push by the Data Provider to Data Ingest
- FTPS Pull by Data Ingest from the Data Provider
- SFTP Pull by Data Ingest from the Data Provider
- Authenticated FTP Pull by Data Ingest from the Data Provider
- Anonymous FTP Pull by Data Ingest from the Data Provider
- All new file transfer connections (Push or Pull) must first be reviewed and approved.
- The data provider is required to keep data available for no less than seven days after it is initially staged or pushed to data ingest. Fourteen days or more is recommended to minimize the chance of data loss.
- Operational data ingest and archive systems are designed to run continuously. However, NCEI can only monitor, identify and resolve operational issues during Federal Government business hours (M-F, 8AM-5PM).
- It takes more time and effort to submit data on physical media, because NCEI needs additional resources to mount, scan, move and potentially process the data on the physical media. As a result, it takes significantly more time to establish a data connection for data on physical media.
File Submission Testing
You will need to test your connection to the data ingest system before submitting your data. Prepare representative sample files of the data you plan to archive and discuss a testing connection method with your NCEI point-of-contact. You must submit files with companion submission manifests following the expected file naming conventions via the file transfer interface.
Once successful testing is demonstrated and a deployment to operations is approved, the data submission start date is scheduled.. If the initial file transfer is delayed more than 3 months after the start date, the testing steps must be repeated.
Data Discovery and Access
NCEI provides common services for the public to search and access the data in its archive. Additional, customized services that meet specific community needs may be supported depending on the project's scope.
The services to distribute the data and provide web access support are discussed during the project's planning phase and are documented in the associated Data Submission Agreement with NCEI. The data access services provided by NCEI depend on the resources available for the project.
All data submitted to the archive may be publicly shared unless otherwise stated in the data submission agreement or another agreement with NCEI. NCEI can only provide public access to data after the data has been ingested to the archive and access services are deployed.
You (the data provider) are responsible for providing metadata information about your data. NCEI is responsible for ensuring the provided metadata complies with NOAA requirements following ISO standards.
NCEI may adjust titles, abstracts, and other identification information about the data to promote data access and discoverability on the web. NCEI will issue a DOI for the collection after the data are archived and the ISO metadata are published.
NCEI provides Tier 1 and Tier 2 Data Access services in support of Data Stewardship.
| Tier 1 | Long Term Preservation and Basic Access
|
| Tier 2 | Enhanced Access and Basic Quality Assurance
|
Specific implementations can be broken down as follows:
Existing User Community and Legacy Capabilities
Tier 1
- Web Accessible Folder (WAF/https, S3, Object Storage, etc.)
- FTP (not recommended, but supported)
- Cloud compatible formats (Zarr, Parquet, etc.)
Tier 2
- Web Application and API Service Support
- Built to specification per container; included Common Data Services components (OneStop, Common Access, OnlineStore/NES2, etc.)
- Data subsetting and order fulfillment functions
- GIS map viewers and services
- Data APIs
- ZarrDAP (mimics OpenDAP protocol)
- THREDDS Data Server
- ERDDAP
- Hyrax
- Jupyter Notebook clients
- Metadata APIs (JSON, DCAT, OSDD, ATOM, STAC etc.)