Managing Files and Data

Cluster Disk Space

Disk space on the cluster is limited to the amount purchased by the group which granted you access. Your home directory in the cluster is in a shared file system belonging to your group. While there is no quota on the home directory it is bound by your group's quota so if you fill it up no one in your group can work. Ideally only active data should live in the cluster storage space and completed research data should be moved to Engram or other permanent storage.

Places to save data include (where [UNI] is your UNI and [LABACCOUNTNAME] is the abbreviated name of your lab account as specified in the table on this page):

  • Your home directory: A directory unique to your user on the cluster.  This directory is not intended for sharing with other users on Axon.  It can be found at the following paths:
    • (the tilde character is shorthand for your home directory in Unix-based operating systems like Linux)
    • /home/[UNI]
    • /share/[LABACCOUNTNAME]/users/[UNI]
  • Your group's shared directory:  We want to move away from this; a directory where you can store data if you want to work on a project with multiple people in your group.  We suggest that you create a sub-directory within the shared directory with your project's name.  Your group's shared directory is located at the following path:
    • /share/[LABACCOUNTNAME]/projects
  • Engram CIFS shares: We are moving away from using shared directories at /share and towards the use of AutoFS to mount require Engram shares on an as-needed basis both to simplify workflows and to improve performance. This is explained below.
    • /mnt/smb/[tier]/[LABACCOUNTNAME]-[tier]

Engram Mounts

In order to best handle file ownership in a CIFS environment, we set up an automount system for Engram shares. The way it works is that your login credentials are used to authenticate the Engram mount as your user. Once the mount has been idle for a fairly short time, it will unmount, so the suggested workflow is to mount the share, run your process, and then re-mount before the next stage where the share is needed. Mounting is done by using the "ls" command to request the contents of the directory in question:

ls /mnt/smb/<tier>/<labname>-<tier>

Replace <labname> and <tier> with the actual labname and tier desired, in the case of my "lab" (ZRC) and tier (locker) the command would be:

ls /mnt/smb/locker/zrc-locker

The way that this would be used in a bash script would be:

ls /mnt/smb/locker/zrc-locker
/usr/local/bin/command /mnt/smb/locker/zrc-locker/work_being_done
ls /mnt/smb/locker/zrc-locker
/usr/local/bin/command /mnt/smb/locker/zrc-locker/next_work_being_done


Checking Your Home Directory's Disk Usage

Linux and other Unix-based operating systems do not continuously tabulate how much space directories take up due to performance and software complexity concerns.  This means that you will often need to manually run a command such as du to determine how much space a directory (such as your home directory) takes up.  Depending on how many files that directory has (and whether or not they've been recently accessed), du may take a particularly long time.

To work around this, Zuckerman Research Computing has created a script that runs daily on the Axon login node and computes the amount of disk space that every directory nested under your home directory takes up.  It then saves these computations / records of disk space usage to the login node's hard disk, where you can access them at any point during the day.  These records will only reflect the specific time that the computation of disk usage took place, but should give you a relatively good idea of which sub-directories take up the most space in your home directory.  It is important to bear in mind, however, that your current disk usage at any given hour may be wildly different depending upon whether or not files have been added or removed since the disk usage calculation took place.

To access these pre-computed home directory usage values, you can run the following command (only available on the login node):

(base) [jsp2205@axon ~]$ homeusage
Note: Results reflect disk usage as of: 2020-09-08 04:13:07.768825790 -0400

Note that the command will first print a timestamp indicating when the usage records were computed.  It will then place you within an interface that can be browsed with a keyboard using the arrow keys.  For instructions on how to use this interface, you can type the ? character and a help screen will appear.  Typing q will quit the interface.

Note: The pre-computed disk usage results are only available to your individual researcher account- no other users are allowed to see it. If you need to share these results with someone else for some reason email rc@zi.columbia.edu.

The homeusage script is only available on Axon's login node (axon.rc.zi.columbia.edu).

Checking the Top Disk Usage for Your Lab Account's Storage

In addition to the homeusage command, there is also a labusage command that will give you a pre-computed list of how much disk space each researcher within your lab account is using.  This list is sorted in descending order, so lab members that are using the most disk space will be at the top:

(base) [jsp2205@axon ~]$ labusage
Note: Results reflect disk usage as of: 2020-10-22 05:04:33.188372594 -0400
Showing disk usage for all members of account "zrc"
===================
UNI     Space Used
===================
jsp2205  1.6T
aa3301  213G
bs2679  53G
lh2332  528
ds3688 68

Note: The pre-computed lab account disk usage results are only available to you and other members of your lab account- no other users are allowed to see it. It only shows the aggregate sum of disk space used and does not indicate which directories take up the most space under any given home directory.  For that you will need to run the homeusage command.  You may also need to run the du command against any directories in you group's shared projects directory.

The labusage script is only available on Axon's login node (axon.rc.zi.columbia.edu).

Transferring Data to the Cluster

To transfer files to the one of the paths described above, use the command below to upload files via the cluster login node.  The example below uploads to the shared storage for the ctn lab account:

scp transfer to cluster
$ scp -r amygdala-stuff/ UNI@axon.rc.zi.columbia.edu:/share/ctn/projects

Running this command would copy the directory amygdala-stuff, and all of its contents, to the /share/ctn/projects directory on the cluster creating a new sub-directory with that name (e.g. /share/ctn/projects/amygdala-stuff).

If you want to upload to Axon data and are off-campus and not using the CUIT VPN, you will need to modify the above scp command to look like the following form:

$ scp -P 55 -r amygdala-stuff/ UNI@axon-remote.rc.zi.columbia.edu:/share/ctn/projects

The above form uses the axon-remote.rc.zi.columbia.edu domain, which can be accessed off-campus without the VPN, and includes the -P flag, which tells the scp command to use the port that is open on axon-remote.rc.zi.columbia.edu for usage by ssh and scp.

SSD Read Cache

Axon follows NVIDIA's best practices for networked storage in implementing a local read cache of files (see here).  This means that any files that live within your home directory or within your lab's storage will be automatically copied to a node's local SSD hard disk whenever you access them.  This means that if the dataset that you are working with can fit entirely within the node's local hard disk, you can achieve substantial performance increases for deep learning workloads, since the first epoch will "warm" this cache and ensure that subsequent epochs do not need to load data over the network.

The SSD read cache cannot improve write performance over the network.  If your jobs are write-intensive, then we recommend using scratch storage (described below).

Local SSD Scratch Space

Datasets will automatically be copied over the network to the SSD read cache described above whenever you access them.  For this reason, copying datasets to the local scratch space is not necessary; in fact, copying files to the local SSD scratch space will cause files to be copied twice (once to the read cache, and once to the scratch space).

Each node in the cluster comes equipped with locally attached SSDs mounted under /local.  If you know that your job is write-intensive, it can make sense to write out intermediate results to these SSDs instead of writing to Engram.  The amount of space on each scratch disk can be found on Technical Specifications or can be displayed using the sfree command described here.  Note that these scratch disks are not backed up, and that they are automatically cleared out every 10 days by an automated process.

Github

The only node that can reach github is the login node, so you'll have to do your git pulling and pushing from there