Skip to content

S3

Preserv uses S3 to recieve bags from depositors, to store them in long-term preservation, and to restore files and objects to depositors. Preserv also uses a staging bucket to process items during ingest. In addition, Glacier, Glacier Deep Archive, and Wasabi provide preservation storage buckets.

Receiving Buckets

To deposit materials into APTrust, depositors upload tarred bags to their receiving bucket. A cron job periodically scans receiving buckets for new bags, creating a Registry WorkItem and an NSQ message for each new bag.

Receiving buckets are accessible to APTrust users and admins at the owning institution. Bucket names follow the patterns below. In each case, inst identifitier is the institution's identifier, which also happens to be its domain name (virginia.edu, georgetown.edu, etc.).

aptrust.receiving.<inst identifier>          # Production Environment
aptrust.receiving.test.<inst identifier>     # Demo Environment
aptrust.receiving.staging.<inst identifier>  # Staging Environment

Restoration Buckets

Users and admins at depositing institutions also have access to their institution's restoration bucket. When a depositor requests a restoration, Preserv copies files and objects to this bucket for depositors to download.

Restoration buckets follow this naming scheme:

aptrust.restore.<inst identifier>          # Production Environment
aptrust.restore.test.<inst identifier>     # Demo Environment
aptrust.restore.staging.<inst identifier>  # Staging Environment

When Preserv restores an entire object, it creates a new bag from all of the object's preserved files and puts the bag in the depositor's restoration bucket. If the University of Virginia were to restore an object called BagOfPhotos, it would appear in the restoration bucket like this:

aptrust.restore.virginia.edu/BagOfPhotos.tar

When restoring a single file from that bag, it would appear in a subdirectory like this:

aptrust.restore.virginia.edu/BagOfPhotos/data/photo.jpg

Staging Bucket

During ingest, the ingest_staging_uploader extracts files from tarred bags in the receiving buckets and copies them into a staging bucket for further processing. The staging bucket is not public and is not accessible to any depositors. Files remain in the staging bucket until ingest completes, at which point the ingest_cleaup worker deletes them.

If a bag fails ingest due to too many transient errors (usually network errors), its files will remain in the staging bucket so that when an APTrust admin requeues the WorkItem, Preserv can pick up where it left off and continue the ingest.

The staging buckets are:

aptrust.prod.staging         # Production environment
aptrust.test.staging         # Demo environment
aptrust.staging.staging      # Staging environment

If an item fails ingest due to transient errors and an APTrust admin decides to cancel the ingest instead of requeueing it, the admin will need to manually delete the files from staging.

Inside the staging bucket is a folder for each Ingest WorkItem in process. Inside that folder are the bag's manifests, tag manifests, and tag files, as well as the bag's payload files. Manifests and tag files are stored under the same name they had in the original bag (bag-info.txt, manifest-sha256.txt, etc.). Payload files are stored with UUID names instead of their original paths. They'll have these same UUID names when copied to preservation storage.

The staging bucket entries for WorkItem 5432 on the demo system would look like this:

aptrust.test.staging/5432
aptrust.test.staging/5432/aptrust-info.txt
aptrust.test.staging/5432/bag-info.txt
aptrust.test.staging/5432/manifest-sha256.txt
aptrust.test.staging/5432/1e3fa668-73b7-4885-98e1-f16170b7ad54
aptrust.test.staging/5432/49c5ebc5-e840-4fcb-b851-c8f02cd4953d
aptrust.test.staging/5432/5a757826-31a1-406d-a889-83c05c245213
aptrust.test.staging/5432/e6b32693-6ed9-4974-946c-0ce1808015aa

Preservation Buckets

Preservation buckets are for long-term preservation storage. They are not publicly accessible and nothing inside them is publicly accessible. APTrust admins can list bucket contents and retrieve files but cannot delete anything from these buckets.

All deletions must be done by the depositing insitution, through the Registry, and all must be approved by an administrator at the depositing institution before Preserv will carry them out.

Files in the preservation buckets are stored with UUID names and have the following metadata:

Name Description
x-amz-meta-md5 The md5 checksum we calculated for this file on the most recent ingest.
x-amz-meta-sha256 The sha256 checksum we calculated for this file on the most recent ingest.
x-amz-meta-bagpath The original path the file inside the bag that the depositor submitted. E.g. data/photos/image01.jpg. Note: In Wasabi, this field is called x-amz-meta-bagpath-encoded, due to a bug in Wasabi's parsing of HTTP headers. Wasabi will not accept any header with two or more consecutive spaces, which we sometimes encounter in depositor file names. Then encoded version of the bagpath uses URL query string encoding to replace spaces with %20.
x-amz-meta-institution The identifier (domain name) of the institution that owns the bag.
x-amz-meta-bag The Registry's identifier for the bag to which this file belongs. E.g. virginia.edu/BagOfPhotos.

Note

Due to an early depositor decision back in 2014, preservation buckets generally do not use encryption. The exceptions are Glacier archives and Wasabi buckets, where encryption is mandatory.

Also due to a 2014 depositor decision, versioning is turned off on all preservation buckets. Newer versions of files overwrite older versions. If a depositor explicitly wants to save two versions of an object, they have to upload bags with names bag BagOfPhotos_v1.tar and BagOfPhotos_v2.tar.

Production Preservation Bucket Names

Name Storage Option Encrypted Versioned Description
aptrust.preservation.storage Standard No No Standard S3 preservation storage bucket in Virginia. Everything in this bucket is replicated to aptrust.preservation.oregon.
aptrust.preservation.oregon Standard Yes No Glacier storage for items using the stanard storage option. Everything in here is replicated from aptrust.preservation.storage.
aptrust.preservation.glacier-deep.oh Glacier-Deep-OH Yes No Glacier Deep Archive in Ohio.
aptrust.preservation.glacier-deep.or Glacier-Deep-OR Yes No Glacier Deep Archive in Oregon.
aptrust.preservation.glacier-deep.va Glacier-Deep-VA Yes No Glacier Deep Archive in Virginie.
aptrust.preservation.glacier.oh Glacier-OH Yes No Glacier archive in Ohio.
aptrust.preservation.glacier.or Glacier-OR Yes No Glacier archive in Oregon.
aptrust.preservation.glacier.va Glacier-VA Yes No Glacier archive in Virginia.
aptrust-production-wasabi-or Wasabi-OR Yes No Wasabi storage in Oregon.
aptrust-production-wasabi-va Wasabi-VA Yes No Wasabi storage in Virginia.

Demo Preservation Bucket Names

Name Storage Option Encrypted Versioned Description
aptrust.test. preservation.storage Standard No No Standard S3 preservation storage bucket in Virginia. Everything in this bucket is replicated to aptrust.test. preservation.oregon.
aptrust.test. preservation.oregon Standard Yes No Glacier storage for items using the stanard storage option. Everything in here is replicated from aptrust.test. preservation.storage.
aptrust.test. preservation.glacier-deep.oh Glacier-Deep-OH Yes No Glacier Deep Archive in Ohio.
aptrust.test. preservation.glacier-deep.or Glacier-Deep-OR Yes No Glacier Deep Archive in Oregon.
aptrust.test. preservation.glacier-deep.va Glacier-Deep-VA Yes No Glacier Deep Archive in Virginia.
aptrust.test. preservation.glacier.oh Glacier-OH Yes No Glacier archive in Ohio.
aptrust.test. preservation.glacier.or Glacier-OR Yes No Glacier archive in Oregon.
aptrust.test. preservation.glacier.va Glacier-VA Yes No Glacier archive in Virginia.
aptrust-demo-wasabi-or Wasabi-OR Yes No Wasabi storage in Oregon.
aptrust-demo-wasabi-va Wasabi-VA Yes No Wasabi storage in Virginia.

Staging Preservation Bucket Names

Name Storage Option Encrypted Versioned Description
aptrust.staging. preservation.storage Standard No No Standard S3 preservation storage bucket in Virginia. Everything in this bucket is replicated to aptrust.staging. preservation.oregon.
aptrust.staging. preservation.oregon Standard Yes No Glacier storage for items using the stanard storage option. Everything in here is replicated from aptrust.staging. preservation.storage.
aptrust.staging. preservation.glacier-deep.oh Glacier-Deep-OH Yes No Glacier Deep Archive in Ohio.
aptrust.staging. preservation.glacier-deep.or Glacier-Deep-OR Yes No Glacier Deep Archive in Oregon.
aptrust.staging. preservation.glacier-deep.va Glacier-Deep-VA Yes No Glacier Deep Archive in Virginia.
aptrust.staging. preservation.glacier.oh Glacier-OH Yes No Glacier archive in Ohio.
aptrust.staging. preservation.glacier.or Glacier-OR Yes No Glacier archive in Oregon.
aptrust.staging. preservation.glacier.va Glacier-VA Yes No Glacier archive in Virginia.
aptrust-staging-wasabi-or Wasabi-OR Yes No Wasabi storage in Oregon.
aptrust-staging-wasabi-va Wasabi-VA Yes No Wasabi storage in Virginia.