S3
Preserv uses S3 to recieve bags from depositors, to store them in long-term preservation, and to restore files and objects to depositors. Preserv also uses a staging bucket to process items during ingest. In addition, Glacier, Glacier Deep Archive, and Wasabi provide preservation storage buckets.
Receiving Buckets
To deposit materials into APTrust, depositors upload tarred bags to their receiving bucket. A cron job periodically scans receiving buckets for new bags, creating a Registry WorkItem and an NSQ message for each new bag.
Receiving buckets are accessible to APTrust users and admins at the owning institution. Bucket names follow the patterns below. In each case, inst identifitier
is the institution's identifier, which also happens to be its domain name (virginia.edu, georgetown.edu, etc.).
aptrust.receiving.<inst identifier> # Production Environment
aptrust.receiving.test.<inst identifier> # Demo Environment
aptrust.receiving.staging.<inst identifier> # Staging Environment
Restoration Buckets
Users and admins at depositing institutions also have access to their institution's restoration bucket. When a depositor requests a restoration, Preserv copies files and objects to this bucket for depositors to download.
Restoration buckets follow this naming scheme:
aptrust.restore.<inst identifier> # Production Environment
aptrust.restore.test.<inst identifier> # Demo Environment
aptrust.restore.staging.<inst identifier> # Staging Environment
When Preserv restores an entire object, it creates a new bag from all of the object's preserved files and puts the bag in the depositor's restoration bucket. If the University of Virginia were to restore an object called BagOfPhotos, it would appear in the restoration bucket like this:
aptrust.restore.virginia.edu/BagOfPhotos.tar
When restoring a single file from that bag, it would appear in a subdirectory like this:
aptrust.restore.virginia.edu/BagOfPhotos/data/photo.jpg
Staging Bucket
During ingest, the ingest_staging_uploader
extracts files from tarred bags in the receiving buckets and copies them into a staging bucket for further processing. The staging bucket is not public and is not accessible to any depositors. Files remain in the staging bucket until ingest completes, at which point the ingest_cleaup
worker deletes them.
If a bag fails ingest due to too many transient errors (usually network errors), its files will remain in the staging bucket so that when an APTrust admin requeues the WorkItem, Preserv can pick up where it left off and continue the ingest.
The staging buckets are:
aptrust.prod.staging # Production environment
aptrust.test.staging # Demo environment
aptrust.staging.staging # Staging environment
If an item fails ingest due to transient errors and an APTrust admin decides to cancel the ingest instead of requeueing it, the admin will need to manually delete the files from staging.
Inside the staging bucket is a folder for each Ingest WorkItem in process. Inside that folder are the bag's manifests, tag manifests, and tag files, as well as the bag's payload files. Manifests and tag files are stored under the same name they had in the original bag (bag-info.txt, manifest-sha256.txt, etc.). Payload files are stored with UUID names instead of their original paths. They'll have these same UUID names when copied to preservation storage.
The staging bucket entries for WorkItem 5432 on the demo system would look like this:
aptrust.test.staging/5432
aptrust.test.staging/5432/aptrust-info.txt
aptrust.test.staging/5432/bag-info.txt
aptrust.test.staging/5432/manifest-sha256.txt
aptrust.test.staging/5432/1e3fa668-73b7-4885-98e1-f16170b7ad54
aptrust.test.staging/5432/49c5ebc5-e840-4fcb-b851-c8f02cd4953d
aptrust.test.staging/5432/5a757826-31a1-406d-a889-83c05c245213
aptrust.test.staging/5432/e6b32693-6ed9-4974-946c-0ce1808015aa
Preservation Buckets
Preservation buckets are for long-term preservation storage. They are not publicly accessible and nothing inside them is publicly accessible. APTrust admins can list bucket contents and retrieve files but cannot delete anything from these buckets.
All deletions must be done by the depositing insitution, through the Registry, and all must be approved by an administrator at the depositing institution before Preserv will carry them out.
Files in the preservation buckets are stored with UUID names and have the following metadata:
Name | Description |
---|---|
x-amz-meta-md5 | The md5 checksum we calculated for this file on the most recent ingest. |
x-amz-meta-sha256 | The sha256 checksum we calculated for this file on the most recent ingest. |
x-amz-meta-bagpath | The original path the file inside the bag that the depositor submitted. E.g. data/photos/image01.jpg . Note: In Wasabi, this field is called x-amz-meta-bagpath-encoded , due to a bug in Wasabi's parsing of HTTP headers. Wasabi will not accept any header with two or more consecutive spaces, which we sometimes encounter in depositor file names. Then encoded version of the bagpath uses URL query string encoding to replace spaces with %20. |
x-amz-meta-institution | The identifier (domain name) of the institution that owns the bag. |
x-amz-meta-bag | The Registry's identifier for the bag to which this file belongs. E.g. virginia.edu/BagOfPhotos . |
Note
Due to an early depositor decision back in 2014, preservation buckets generally do not use encryption. The exceptions are Glacier archives and Wasabi buckets, where encryption is mandatory.
Also due to a 2014 depositor decision, versioning is turned off on all
preservation buckets. Newer versions of files overwrite older versions.
If a depositor explicitly wants to save two versions of an object,
they have to upload bags with names bag BagOfPhotos_v1.tar
and
BagOfPhotos_v2.tar
.
Production Preservation Bucket Names
Name | Storage Option | Encrypted | Versioned | Description |
---|---|---|---|---|
aptrust.preservation.storage | Standard | No | No | Standard S3 preservation storage bucket in Virginia. Everything in this bucket is replicated to aptrust.preservation.oregon. |
aptrust.preservation.oregon | Standard | Yes | No | Glacier storage for items using the stanard storage option. Everything in here is replicated from aptrust.preservation.storage. |
aptrust.preservation.glacier-deep.oh | Glacier-Deep-OH | Yes | No | Glacier Deep Archive in Ohio. |
aptrust.preservation.glacier-deep.or | Glacier-Deep-OR | Yes | No | Glacier Deep Archive in Oregon. |
aptrust.preservation.glacier-deep.va | Glacier-Deep-VA | Yes | No | Glacier Deep Archive in Virginie. |
aptrust.preservation.glacier.oh | Glacier-OH | Yes | No | Glacier archive in Ohio. |
aptrust.preservation.glacier.or | Glacier-OR | Yes | No | Glacier archive in Oregon. |
aptrust.preservation.glacier.va | Glacier-VA | Yes | No | Glacier archive in Virginia. |
aptrust-production-wasabi-or | Wasabi-OR | Yes | No | Wasabi storage in Oregon. |
aptrust-production-wasabi-va | Wasabi-VA | Yes | No | Wasabi storage in Virginia. |
Demo Preservation Bucket Names
Name | Storage Option | Encrypted | Versioned | Description |
---|---|---|---|---|
aptrust.test. preservation.storage | Standard | No | No | Standard S3 preservation storage bucket in Virginia. Everything in this bucket is replicated to aptrust.test. preservation.oregon. |
aptrust.test. preservation.oregon | Standard | Yes | No | Glacier storage for items using the stanard storage option. Everything in here is replicated from aptrust.test. preservation.storage. |
aptrust.test. preservation.glacier-deep.oh | Glacier-Deep-OH | Yes | No | Glacier Deep Archive in Ohio. |
aptrust.test. preservation.glacier-deep.or | Glacier-Deep-OR | Yes | No | Glacier Deep Archive in Oregon. |
aptrust.test. preservation.glacier-deep.va | Glacier-Deep-VA | Yes | No | Glacier Deep Archive in Virginia. |
aptrust.test. preservation.glacier.oh | Glacier-OH | Yes | No | Glacier archive in Ohio. |
aptrust.test. preservation.glacier.or | Glacier-OR | Yes | No | Glacier archive in Oregon. |
aptrust.test. preservation.glacier.va | Glacier-VA | Yes | No | Glacier archive in Virginia. |
aptrust-demo-wasabi-or | Wasabi-OR | Yes | No | Wasabi storage in Oregon. |
aptrust-demo-wasabi-va | Wasabi-VA | Yes | No | Wasabi storage in Virginia. |
Staging Preservation Bucket Names
Name | Storage Option | Encrypted | Versioned | Description |
---|---|---|---|---|
aptrust.staging. preservation.storage | Standard | No | No | Standard S3 preservation storage bucket in Virginia. Everything in this bucket is replicated to aptrust.staging. preservation.oregon. |
aptrust.staging. preservation.oregon | Standard | Yes | No | Glacier storage for items using the stanard storage option. Everything in here is replicated from aptrust.staging. preservation.storage. |
aptrust.staging. preservation.glacier-deep.oh | Glacier-Deep-OH | Yes | No | Glacier Deep Archive in Ohio. |
aptrust.staging. preservation.glacier-deep.or | Glacier-Deep-OR | Yes | No | Glacier Deep Archive in Oregon. |
aptrust.staging. preservation.glacier-deep.va | Glacier-Deep-VA | Yes | No | Glacier Deep Archive in Virginia. |
aptrust.staging. preservation.glacier.oh | Glacier-OH | Yes | No | Glacier archive in Ohio. |
aptrust.staging. preservation.glacier.or | Glacier-OR | Yes | No | Glacier archive in Oregon. |
aptrust.staging. preservation.glacier.va | Glacier-VA | Yes | No | Glacier archive in Virginia. |
aptrust-staging-wasabi-or | Wasabi-OR | Yes | No | Wasabi storage in Oregon. |
aptrust-staging-wasabi-va | Wasabi-VA | Yes | No | Wasabi storage in Virginia. |