AWS Glacier Deep Archive Storage

The Amazon S3 Glacier storage classes are purpose-built for data archiving and provide durability and low cost archive storage in the cloud. There are a number of tools listed in our Cloud Backup Options webpage that facilitate migrating data to the cloud.

Consultation

If you'd like to discuss budgeting and options on how to migrate your data to cloud archive, we recommend contacting our team for assistance at rc@zi.columbia.edu.

Suggested use

Long term archive of data that is unlikely to be needed again, but which either must be retained for compliance or other reasons. While the storage is inexpensive, retrieval fees can be costly ($20 per TB retrieved, plus per object request fees).

AWS Deep Glacier Archive Costs

Current pricing can be viewed here: https://aws.amazon.com/s3/pricing/

Pricing for storage and retrieval (not including additional request per object fees) (Last updated 3/16/2023):

Monthly storage cost per TB	$0.99
Deep Glacier Retrieval fees per TB	$20.00

Preparation of data to be archived

Data transfer rates are much faster with fewer large files than with a greater number of small files, even when the same amount of data is being transferred. In order to get data restored as quickly as possible it would be best to make a small number of large tar, tar-gzip or zip archives. Of course one must also factor in the amount of time spent transferring unneeded data when considering this factor.

Additionally there is a per-request cost with each file being a separate "request". With datasets including many millions of files, this can add up to a significant cost and must be considered.

There is no "one size fits all" approach here. Only you know your data, and only you can guess as to what might need to be restored, but using tar, tar-gzip or zip archives to organize the data to be archived can save a lot of money when done properly. For data that benefits from compression, you will also see a savings in terms of storage and transfer fees if you compress the data. For data that does not compress well using gzip or zip (photo images are one example of this) compression may be a waste of time, and tar may be the best option. Putting in the time to research and/or test how compression affects your data can be well worth it. We stress that once an archive workflow has been designed, time spent automating such a process would be very well spent, as the time needed to do such work manually could be prohibitive.

Example Lab Usage and Costs

Lab A has determined that they have 10 datasets that they feel will probably never be needed again, but which have to be retained for compliance purposes. Each Dataset is 10TB and is made up of many directories each with many small files.
We advised them to make tar or tgz archives segregated by data that they think could need to be restored. This is a balance between the throughput issues related to pushing such data to Glacier as well as the per request charges which can add up when transferring many small files. In this case each 10TB dataset contains ten 1TB tar archives.

Lab A uses S3cmd on a Linux server with sufficient Ram and CPU, to push 100TB of tar archives to Glacier Deep storage, this is done from the CU network, over the 10GB DirectConnect link to AWS US East1. This will cost them $100/month to store this data in Glacier Deep storage, and the transfer fees for the upload are free. There is a charge for requests during the upload portion (each file is an HTTPS request when uploaded or downloaded) but since these are collected into tar archives this will not be significant ($0.05.)

1 year later, Lab A needs to retrieve one of the datasets, which means that ten 1TB tar files will need to be downloaded. Glacier Deep Storage Retrieval is $20 per TB, so that would be $200 to retrieve the data. Standard Retrieval Request cost per 1000 requests is $0.05, which is pretty minor in this case but which could add up for a million files

Below we show how changing the way the data is grouped using tar will affect the costs. For the purpose of this, we will assume that without the use of tar, the dataset of 100TB would contain 100,000,000 files, and that the dataset being retrieved contains 10,000,000 files:

Storage Costs 100TB of data at the 1 year point

	Storage Fees	API Requests Upload costs for 100TB of data ($0.05 per 1000)	Total
Data organized in 1TB tar archives (100 files)	$1200	$0.05	$1200.05
Data organized in 1GB tar archives (100.000 files)	$1200	$5.00	$1200.50
Data archived as-is (100,000,000 files)	$1200	$5000	$6200.00

Retrieval Costs for 10TB of data

	API Requests Retrieval costs for 10TB of data ($0.05 per 1000)	Storage Retrieval costs for 10TB of data	DirectConnect Transfer Fees	Total
Data organized in 1TB tar archives (100 files)	$0.05	$100	$200	$300.05
Data organized in 1GB tar archives (100.000 files)	$0.50	$100	$200	$300.50
Data archived as-is (100,000,000 files)	$500	$100	$200	$800

If you have files in the hundreds of millions, we stress the need to review API request fees to accurately project your costs. Remember to count your directories as files when calculating your total number of files. Note that even deleting files will count as API requests, so in our above example with the data as-is, the act of deleting the files would cost another $5000 in API requests.

Tools for working with Glacier

s3tools are command line tools for transferring data to S3, Glacier and other S3 compatible storage servers. s3cmd is for Linux and MacOS, and S3Express for Windows.

s4cmd is a drop in replacement for s3cmd that is multi-threaded and can sometimes drastically out perform s3cmd, especially on servers with generous amounts of RAM and CPU.

s3browser is a Windows GUI tool for transferring data to S3, Glacier and other S3 compatible storage servers.

Using Powershell to interact with S3/Glacier

Compression/Archive tools

tar is "old faithful" as it has been around for a long time, and is a standard format that is widely used and stable.

gzip is another favorite which is often combined with tar. It has good compression and is widely available and supported.

pigz is a parallel implementation of gzip, and can in some cases be far faster.

bzip2 is an alternative to gzip, and can be used with tar.

7zip is yet another alternative to gzip.

Linux command line examples

s3cmd examples

Make bucket
    s3cmd mb s3://BUCKET
Remove bucket
    s3cmd rb s3://BUCKET
List objects or buckets
    s3cmd ls [s3://BUCKET[/PREFIX]]
List all object in all buckets
    s3cmd la 
Put file into bucket
    s3cmd put FILE [FILE...] s3://BUCKET[/PREFIX]
Get file from bucket
    s3cmd get s3://BUCKET/OBJECT LOCAL_FILE
Delete file from bucket
    s3cmd del s3://BUCKET/OBJECT

tar examples

Usage:

tar [options] [archive-file] [file or directory to be archived]

Some options:

-c : Creates Archive
-x : Extract the archive
-f : creates archive with given filename
-t : displays or lists files in archived file
-u : archives and adds to an existing archive file
-v : Displays Verbose Information
-A : Concatenates the archive files
-z : zip, tells tar command that create tar file using gzip
-j : filter archive tar file using tbzip
-W : Verify a archive file
-r : update or add file or directory in already existed .tar file

Examples where "source" is the directory to be archived, and "archive.tar" is the resulting tar archive:

Create uncompressed tar archive:
    tar cvf archive.tar source
Create gzip compressed tar archive:
    tar cvzf archive.tgz source
Verify the contents/validity of a tar archive:
    tar -Wvf archive.tar
Verify the contents/validity of a tar-gzip archive:
    tar -Wvzf archive.tgz
List the contents of a tar archive:
    tar -tf archive.tar
Extract the contents of a tar archive to a specific destination:
    tar -xvf archive.tar -C /path/to/destination

Tips/Tricks for usage of tar/tar-gzip

Paths are included relative to where the command is run from (unless otherwise specified.) The examples above assume running the command from the directory containing both the source and resulting tar archive. It is possible to output the resulting tar archive anywhere on the filesystem (or even on another host's filesystem by doing something like piping the output through SSH) but it is usually best to run the command from the parent directory of the file or directory to be archived, unless you have a specific reason not to, and know exactly what you are doing.
Naming and file extensions are up to you, and will not be enforced by tar. If you don't use compression and name your output file with a .tgz extension, the -xvzf will give an error as it expects a compressed file and doesn't care about file extensions. The same goes for the opposite (compressing and naming the output .tar instead of .tgz) These names are conventions to make it easier to handle these files later, and I suggest you use them to avoid confusion later.
Verbose mode (-v) is in these examples because I want to show what is happening. If you don't want the output, then omit the -v from the command. I suggest using verbose mode when practical, and always when verifying archives.

Example tar and archive to S3 Glacier workflow

Data to be archived is called "DatasetA", and it contains data from three groups of mice "mice1" "mice2" and "mice3" all in in their own directories, inside of DatasetA.

Example workflow

# Create a bucket on Glacier to hold the archives
s3cmd mb s3://General_Archive_Bucket_Name/DatasetA

# Create a tar archive of mice1, mice2, and mice3
cd DatasetA
tar -cf mice1.tar mice1
tar -cf mice2.tar mice2
tar -cf mice3.tar mice3

Verify the tar archives created
tar -Wf mice*.tar

# Push the archives to the Glacier bucket
s3cmd put FILE mice*.tar s3://General_Archive_Bucket_Name/DatasetA/

# Check that the archives were copied up to S3
s3cmd ls s3://General_Archive_Bucket_Name/DatasetA

# Delete the original data and local copies of the tar archives
rm -rf mice1
rm -rf mice2
rm -rf mice3
rm mice1.tar
rm mice2.tar
rm mice3.tar

Additional reading on tar usage

https://www.linuxtechi.com/17-tar-command-examples-in-linux/

https://www.geeksforgeeks.org/tar-command-linux-examples/

https://linux.die.net/man/1/tar

Getting set up on AWS

Information is on this page Cloud Resources at Zuckerman and CU accounts must be set up here

Zuckerman Institute