AWS Glacier Deep Archive Storage
The Amazon S3 Glacier storage classes are purpose-built for data archiving and provide durability and low cost archive storage in the cloud. There are a number of tools listed in our Cloud Backup Options webpage that facilitate migrating data to the cloud.
Consultation
If you'd like to discuss budgeting and options on how to migrate your data to cloud archive, we recommend contacting our team for assistance at rc@zi.columbia.edu.
Suggested use
Long term archive of data that is unlikely to be needed again, but which either must be retained for compliance or other reasons. While the storage is inexpensive, retrieval fees can be costly ($20 per TB retrieved, plus per object request fees).
AWS Deep Glacier Archive Costs
Current pricing can be viewed here: https://aws.amazon.com/s3/pricing/
Pricing for storage and retrieval (not including additional request per object fees) (Last updated 3/16/2023):
Monthly storage cost per TB | $0.99 |
Deep Glacier Retrieval fees per TB | $20.00 |
Preparation of data to be archived
Data transfer rates are much faster with fewer large files than with a greater number of small files, even when the same amount of data is being transferred. In order to get data restored as quickly as possible it would be best to make a small number of large tar, tar-gzip or zip archives. Of course one must also factor in the amount of time spent transferring unneeded data when considering this factor.
Additionally there is a per-request cost with each file being a separate "request". With datasets including many millions of files, this can add up to a significant cost and must be considered.
There is no "one size fits all" approach here. Only you know your data, and only you can guess as to what might need to be restored, but using tar, tar-gzip or zip archives to organize the data to be archived can save a lot of money when done properly. For data that benefits from compression, you will also see a savings in terms of storage and transfer fees if you compress the data. For data that does not compress well using gzip or zip (photo images are one example of this) compression may be a waste of time, and tar may be the best option. Putting in the time to research and/or test how compression affects your data can be well worth it. We stress that once an archive workflow has been designed, time spent automating such a process would be very well spent, as the time needed to do such work manually could be prohibitive.
Example Lab Usage and Costs
Lab A has determined that they have 10 datasets that they feel will probably never be needed again, but which have to be retained for compliance purposes. Each Dataset is 10TB and is made up of many directories each with many small files.
We advised them to make tar or tgz archives segregated by data that they think could need to be restored. This is a balance between the throughput issues related to pushing such data to Glacier as well as the per request charges which can add up when transferring many small files. In this case each 10TB dataset contains ten 1TB tar archives.
Lab A uses S3cmd on a Linux server with sufficient Ram and CPU, to push 100TB of tar archives to Glacier Deep storage, this is done from the CU network, over the 10GB DirectConnect link to AWS US East1. This will cost them $100/month to store this data in Glacier Deep storage, and the transfer fees for the upload are free. There is a charge for requests during the upload portion (each file is an HTTPS request when uploaded or downloaded) but since these are collected into tar archives this will not be significant ($0.05.)
1 year later, Lab A needs to retrieve one of the datasets, which means that ten 1TB tar files will need to be downloaded. Glacier Deep Storage Retrieval is $20 per TB, so that would be $200 to retrieve the data. Standard Retrieval Request cost per 1000 requests is $0.05, which is pretty minor in this case but which could add up for a million files
Below we show how changing the way the data is grouped using tar will affect the costs. For the purpose of this, we will assume that without the use of tar, the dataset of 100TB would contain 100,000,000 files, and that the dataset being retrieved contains 10,000,000 files:
Storage Costs 100TB of data at the 1 year point
Storage Fees | API Requests Upload costs for 100TB of data ($0.05 per 1000) | Total | |
---|---|---|---|
Data organized in 1TB tar archives (100 files) | $1200 | $0.05 | $1200.05 |
Data organized in 1GB tar archives (100.000 files) | $1200 | $5.00 | $1200.50 |
Data archived as-is (100,000,000 files) | $1200 | $5000 | $6200.00 |
Retrieval Costs for 10TB of data
API Requests Retrieval costs for 10TB of data ($0.05 per 1000) | Storage Retrieval costs for 10TB of data | DirectConnect Transfer Fees | Total | |
---|---|---|---|---|
Data organized in 1TB tar archives (100 files) | $0.05 | $100 | $200 | $300.05 |
Data organized in 1GB tar archives (100.000 files) | $0.50 | $100 | $200 | $300.50 |
Data archived as-is (100,000,000 files) | $500 | $100 | $200 | $800 |
If you have files in the hundreds of millions, we stress the need to review API request fees to accurately project your costs. Remember to count your directories as files when calculating your total number of files. Note that even deleting files will count as API requests, so in our above example with the data as-is, the act of deleting the files would cost another $5000 in API requests.
Tools for working with Glacier
s3tools are command line tools for transferring data to S3, Glacier and other S3 compatible storage servers. s3cmd is for Linux and MacOS, and S3Express for Windows.
s4cmd is a drop in replacement for s3cmd that is multi-threaded and can sometimes drastically out perform s3cmd, especially on servers with generous amounts of RAM and CPU.
s3browser is a Windows GUI tool for transferring data to S3, Glacier and other S3 compatible storage servers.
Using Powershell to interact with S3/Glacier
Compression/Archive tools
tar is "old faithful" as it has been around for a long time, and is a standard format that is widely used and stable.
gzip is another favorite which is often combined with tar. It has good compression and is widely available and supported.
pigz is a parallel implementation of gzip, and can in some cases be far faster.
bzip2 is an alternative to gzip, and can be used with tar.
7zip is yet another alternative to gzip.
Linux command line examples
Tips/Tricks for usage of tar/tar-gzip
- Paths are included relative to where the command is run from (unless otherwise specified.) The examples above assume running the command from the directory containing both the source and resulting tar archive. It is possible to output the resulting tar archive anywhere on the filesystem (or even on another host's filesystem by doing something like piping the output through SSH) but it is usually best to run the command from the parent directory of the file or directory to be archived, unless you have a specific reason not to, and know exactly what you are doing.
- Naming and file extensions are up to you, and will not be enforced by tar. If you don't use compression and name your output file with a .tgz extension, the -xvzf will give an error as it expects a compressed file and doesn't care about file extensions. The same goes for the opposite (compressing and naming the output .tar instead of .tgz) These names are conventions to make it easier to handle these files later, and I suggest you use them to avoid confusion later.
- Verbose mode (-v) is in these examples because I want to show what is happening. If you don't want the output, then omit the -v from the command. I suggest using verbose mode when practical, and always when verifying archives.
Example tar and archive to S3 Glacier workflow
Data to be archived is called "DatasetA", and it contains data from three groups of mice "mice1" "mice2" and "mice3" all in in their own directories, inside of DatasetA.
Additional reading on tar usage
https://www.linuxtechi.com/17-tar-command-examples-in-linux/
https://www.geeksforgeeks.org/tar-command-linux-examples/
https://linux.die.net/man/1/tar
Getting set up on AWS
Information is on this page Cloud Resources at Zuckerman and CU accounts must be set up here