Engram: Technical and Cost Comparison to Cloud Services

The information below was accurate as of July 2020. The latest pricing for cloud offerings from AWS can be found here (for S3) and here (for EFS).

A frequent question that arises is how Engram storage compares to cloud services such as Amazon Web Services (AWS), Google Cloud Storage, and Microsoft Azure Storage. This page will clarify where Engram fits into a hybrid public/private cloud model. We'll do this by performing technical, operational and cost/benefit trade-off comparisons between storage on Engram and AWS S3 and EFS storage. We use AWS as an exemplar for cloud computing more broadly due to its position as a market leader, as well as its position as the first cloud provider to become publicly available.

Technical Comparison

The most glaring difference between Engram and Amazon's cloud offerings is geographical. Engram is located in only one data center within the Jerome L. Greene Building. Amazon's cloud storage, in contrast, can be located in one of many different regions (such as Germany, Ohio, and Northern Virginia). This makes cloud storage more resilient to outages, which is why it often appeals to internet start-ups whose web services require high uptime. While Engram is only located in one data center, it's important to note that its backups are stored at a second location away from Columbia.

Beyond geography, there are also differences in the kinds of storage offered. Engram is designed to support a wide range of demanding, large-scale file applications and workloads.

Engram storage is accessible over a network, using a network filesystem such as NFS or SMB/CIFS. The data stored is also automatically backed up to archival storage on a regular basis (see here). Engram's method of providing access makes its tiers most comparable to Amazon's Elastic File System (EFS), which also provides access to files over a network. Engram's automatic backup to archival storage is similar to Amazon's S3 Glacier offering, which we will discuss later.

Amazon's EFS has two major tiers:

EFS Standard is for everyday usage.
EFS Infrequent Access

The service provided by Engram and EFS both differs from block-level storage, which is provided as part of Cortex or Amazon EBS (see here for more details). Block-level storage only allows one computer to access it at a time, while network filesystems allow many computers to access the same storage concurrently.

An alternative to network filesystems is object storage. Object storage differs from traditional filesystems in that data in object stores are treated as abstract entities that can have metadata attributes, such as tags (allowing you to categorize files similar to how you might categorize Tweets or label emails in Gmail). It can be laid over traditional filesystems as a useful abstraction for data management purposes. It can best be thought of as occupying a space between a traditional unstructured hierarchical filesystem and a highly structured database. AWS S3 is an example of object storage.

Engram does not currently provide an object storage abstraction, but it is still useful to compare it to object storage services such as S3. The two aren't directly comparable, but the practical differences between the two only apply if you are using features specific to object storage. Features that are particular to S3 that are absent in a network filesystem include: the ability to add rich metadata to objects, the ability to automatically manipulate data using database-like transactions and the ability to readily mirror your data across several geographical regions on a global scale. Object storage also places data into buckets instead of folders. Such buckets do not natively have folders or a hierarchy- instead data objects are given a unique name. However, AWS S3 provides the illusion of a hierarchical filesystem through some additional logic.

AWS S3 has several major tiers:

S3 Standard is the general purpose tier. It is most comparable to Engram's labshare or staging tiers. Data stored in S3 standard are distributed across 3 different availability zones (data centers that generally reside within the same geographical region, such as Germany, Ohio, or Northern Virginia).
S3 Infrequent Access is most comparable to Engram's locker tier. Data stored here are also distributed across 3 availability zones.
S3 Infrequent Access (One-Zone) is also comparable to locker, but instead of being stored across 3 availability zones / data centers, it is only stored in one data center.
Glacier is an archival tier meant for cold storage. It can be used for files that you only think you'll need to access a few times a year. Access times are not immediate, and there is a period of several hours during which files are retrieved. It is often used for long-term backup. Its performance characteristics / speed is comparable to the tape backup for Engram volumes.
Glacier Deep Archive is a variant of Glacier that it more restrictive. It is typically used for files that you may only need to access once or twice a year (if that).

A caveat of object storage in AWS is that it does not use the traditional Unix or Windows permission schemes. Instead, access to a resource is granted or denied based off policy documents that are written in the JSON data exchange format. These policy documents are generally substantially more complex than conventional access control mechanisms, and can provide very fine-grained access to various AWS resources beyond S3 (see here for more information).

Operational Comparison

There are some differences in how a group would interact with public cloud storage vs on-premises storage like Engram. To illustrate these differences, we'll go over a few common scenarios that you may encounter when managing your lab's data, and how you might handle them with a cloud-based workflow vs an on-premises workflow. Both of these scenarios assume the creation of an AWS account (see here for more details on how to set an account up at Columbia).

Sending Raw Data to the Cloud vs Engram

In cloud workflows, typically an internet-enabled workstation or storage device (e.g., a Synology NAS) would receive data directly from the data acquisition instrumentation (e.g., an fMRI scanner, microscope, EEG net). Depending on the warranty terms and capabilities of your instrumentation, this process may involve the manual transfer of data using external hard drives or thumb drives. Other times, you may be able to use data copying utilities such as rsync or format-specific protocols such as the DICOM network protocol (for medical images).

Once the raw data is on the workstation or storage device, you would use a utility such as Cyberduck, s3cmd or the AWS CLI to upload the data to S3. In some cases, storage devices (such as a Synology) might come with an AWS S3 utility built in. These S3 utilities would require that you set up an AWS user with appropriately scoped permissions and to generate AWS access and secret key credentials for that user (see here). To upload the data to EFS, you would mount your EFS volume on an AWS EC2 instance, which is comparable to a Cortex VM (see here for a comparison between Cortex and EC2). You would then use a tool such as rsync to copy your data up to EFS with the EC2 instance as an intermediary. The previously mentioned tools (s3cmd, rsync, etc) could be configured to run at regular intervals automatically.

In an Engram-based workflow, you might still copy data from data acquisition instrumentation to an intermediate workstation or storage device. However, in contrast to an AWS-based workflow, Engram can be mounted directly on a workstation or even connected directly to a storage device or instrumentation. This means that there is a more direct path between your tools and data storage. Engram also does not require that your instrumentation or workstation be connected to the public internet, which is essential for sensitive data.

Interacting with Data Stored in the Cloud vs Engram

Workstations and Laptops

To interact with data in the cloud using a workstation or laptop, you would need to first download the data to your local PC. This would incur additional request and egress costs (see Egress, Request and Retrieval Costs below) as data is streamed over the public internet to your PC. For S3, you could navigate the directory tree for your data by using a tool like Cyberduck or you could mount your storage using a filesystem interface like s3fs. Note that s3fs, while giving the appearance of a typical external drive, still requires that any files that you open on it be downloaded first before use to a hidden local folder on your PC's hard disk, and that modified files be reuploaded when they are closed. Additionally, it can give object storage the appearance of network storage like Engram, S3 remains fundamentally object storage. As a final caveat, s3fs also has substantial performance penalties compared to using a network filesystem like Engram.

EFS is not directly accessible on local PCs in a functional way.

In contrast, with Engram you would use the instructions at the following links to connect your Engram storage to a PC:

Your Engram storage would then be navigable just as like an external hard drive without the need to download your data to your PC.

Within the Cloud Itself

If you interact with your data in the cloud itself (e.g. using the aforementioned EC2 instances) you will not be charged data egress fees if your EC2 instances and storage are colocated in the same geographical region. For S3 storage, you would still need to download your data to your EC2 instance's block-level EBS storage / locally-attached hard disk or use s3fs. If your EC2 instance and S3 storage are colocated in the same data center, downloads can potentially be very fast. EFS volumes, in contrast, would not require you to download data and could be attached to and navigated within EC2 instances in much the same way as Engram.

Data Sharing

Private Datasets

In an AWS-centric workflow, you might upload your dataset to a non-public S3 bucket. You would then create an AWS user for your collaborator and apply a JSON policy to your bucket that would give access to your collaborator. Your collaborator would then be able to download your data to either a local workstation (with egress costs) or an EC2 instance. This workflow may make sense if you and your collaborator are both working in the public cloud, although there are less costly alternatives if both you and your collaborator can use locally available resources.

Our team can connect your Engram storage to Globus, which you can then use to transfer data to another institution. Globus is a university-provided service that provides very fast transfers between institutions using the GridFTP protocol and the Internet2 backbone (see CUIT's page on Globus here). Globus can even collect data directly from instrumentation and make it available to colleagues at other institutions in near real-time. We can assist you with setting up permissions and coordinating with IT groups at other universities to make data available to your colleagues.

University policy currently prohibits the use of Globus for datasets containing sensitive information (such as protected health information).

Public Datasets

The AWS-centric workflow for public datasets would not differ substantially from the workflow for private datasets. The primary difference is that you would not need to create users or JSON policies in AWS. Another major difference is that Amazon will potentially subsidize the cost of storing your dataset in S3 after an approval process, through its public dataset program. This public dataset program is used by the Human Connectome Project, OpenNeuro, HHMI Janelia's FlyLight project, and the Allen Brain Observatory.

Globus can also be used for public dataset sharing, and even integrates with S3 (see here). If you are already sharing a private dataset via Globus with colleagues, you can readily send it to an S3 bucket for public consumption.

Cost Comparison

For this document, cloud rates will be quoted relative to Amazon's Northern Virginia data center, although they may vary slightly at other data centers.

Rates

The rates for AWS services and Engram are summarized below:

Service Name	Storage Type	Rate per GB/month
Engram Staging	Network	0.02035
Engram Labshare	Network	0.00977
Engram Locker	Network	0.00610
EFS Standard	Network	0.30000
EFS Infrequent Access	Network	0.02500
S3 Standard (Over 500 TB / Month)	Object	0.02100
S3 Standard (Next 450 TB / Month)	Object	0.02200
S3 Standard (First 50 TB / Month)	Object	0.02300
S3 Infrequent Access	Object	0.01250
S3 Infrequent Access (One-Zone)	Object	0.01000
S3 Glacier	Object	0.00400
S3 Glacier Deep Archive	Object	0.00099

The long-term cumulative costs of these rates can be visualized as follows for network filesystem storage:

For a comparison of AWS object storage cumulative costs and Engram, we present the following graph (which assumes the rate for storage >450 GB for S3 standard):

As we can see in both cases, the costs for Engram storage are substantially less in the long-term, with the exception of files that are stored in S3's archival tiers (Glacier or Glacier Deep Archive).

Egress, Request and Retrieval Costs

In addition to more expensive rates for storage at-rest, AWS S3 and EFS Infrequent Access also charge for the number of times you interact with your data. EFS Infrequent Access charges by the number of GB that are transferred as a result of an action, while S3 charges for the number of actions that you perform. These rates are as follows:

Service Name	Unit of Billing	Rate
EFS Infrequent Access	per GB transferred	0.01
S3 Standard (Over 500 TB / Month)	per 1000 read/write actions	0.005
S3 Standard (Next 450 TB / Month)	per 1000 read/write actions	0.005
S3 Standard (First 50 TB / Month)	per 1000 read/write actions	0.005
S3 Infrequent Access	per 1000 read/write actions	0.01
S3 Infrequent Access (One-Zone)	per 1000 read/write actions	0.01
S3 Glacier	per 1000 read/write actions	0.05
S3 Glacier Deep Archive	per 1000 read/write actions	0.05

There are also fees for the amount of data that is downloaded from S3 (egress fees). These fees are incurred for downloads out to the public internet are as follows and are segmented into separate bands depending on the amount of data transferred over the course of a month:

Maximum Data Transferred	Rate (per GB/month)
1 GB	0.00
9.999 TB	0.09
40 TB	0.085
100 TB	0.07
No Limit (rate applies to transfers > 150 TB)	0.05

For small amounts of data, these charges will not incur very much cost, but they can be very costly for large datasets.

Additionally, Amazon's archival tiers (S3 Glacier and S3 Glacier Deep Archive) both incur fees to retrieve data from storage. These costs are as follows:

Service Name	Rate (per GB retrieved)
S3 Glacier	0.01
S3 Glacier (faster retrieval)	0.03
S3 Glacier Deep Archive	0.02

Engram does not charge extra for read/write operations and does not incur charges for retrieving backups from archival storage. Engram does not charge for any data egress over the VPN.

Advantages of Public Cloud Platforms

AWS EFS may make sense if you are making use of AWS EC2 instances (see here), and have a use case where multiple independent EC2 instances need to access the same storage simultaneously. AWS S3 has similar advantages, although it is not typically mounted and interacted with as a filesystem would be. Since EFS volumes and S3 buckets can be colocated within the same infrastructure / data centers as EC2 instances, they should have faster access times when used together (i.e., it wouldn't make as much sense to mix a Cortex VM with cloud storage).
EC2 instances also incur egress costs for data that is downloaded to non-AWS storage over the internet. These egress costs are waived when you use AWS-native storage services such as S3 and EFS, so it may make sense to use these services if you are primarily performing computations using EC2.
AWS S3 is an ideal platform for sharing open access datasets. Through its public dataset program, Amazon will potentially subsidize the cost of storing datasets in S3 after an approval process. Google Cloud has a similar program described here.
For dataset sharing, AWS S3 gives you the ability to mirror your dataset across multiple geographical regions. This means that colleagues in other countries, such as Germany, can download your dataset from a server that is physically closer and better connected to their campus infrastructure. S3 is particularly popular among commercial website authors (such as Netflix), since it can be used as a way to store images and other web assets as close to their users as possible (see here). Generally speaking, S3 should be used for such public sharing of files over the web, while EFS should be used for internal sharing.
AWS S3 can serve as a backing store for data sharing utilities like the Globus Toolkit, which can make sharing with other institutions easier.

Disadvantages of Public Cloud Platforms

There is better network connectivity between your lab's instrumentation and Engram. The Jerome L. Green Center is wired with a 10Gbit fiber network so that data will arrive faster on Engram relative to the public internet. The closest AWS data centers are in Northern Virginia and Ohio and must route through numerous intermediaries; Engram is in the same building as your tools.
AWS storage does not come with backups baked in- you have to implement a backup strategy independently. AWS backup storage incurs an additional cost above and beyond regular usage. Engram is backed up regularly at no extra charge (see here).
Since Engram is actively managed by Zuckerman IT, there is a closer line of support for storage-related issues.
You can commit to a pre-allocated amount of storage for a one-time fee with Engram, which can be advantageous for budgeting and grant-writing purposes. AWS S3 uses a variable costs billing model, so charges can be inconsistent from month to month.

Summary

For general purpose everyday use, Engram is faster and more cost efficient than AWS.
AWS S3 has a clear advantage for public dataset sharing over Engram, and this is the application for which we recommend using it. If this is something you'd like to set up, please reach out to rc@zi.columbia.edu so that we can configure S3 storage through our Amazon contacts.
Both AWS S3 and EFS may make sense if you are processing data that is already available in AWS (such as a public dataset) and you prioritize speed over cost.

Zuckerman Institute