User Tools

Site Tools


howto:data

How to Manage your Data Usage

Understand how critical data management is

Managing data is an integral part of doing computational analysis work. Scientific applications generally use and create very large amounts of raw data. The graphs and tables that eventually show up in reports, theses and research publications represent this data in post-processed form. However, the raw data generated in the process of may take up several Terabytes of data. Once the raw data has been post-processed, it is seldom necessary to keep it, and it should be removed as soon as possible.

Data management is the responsibility of the Principal Investigator

Individual users, mostly post-graduate students, come and go. For this reason, it is the responsibility of the principal investigator to manage the data and storage occupation of their research programme. This is a big task, and we recommend that PIs of large groups appoint a person responsible for managing the compute activities of the group. Where users need access to shared data, put in a helpdesk request for a shared group directory to be created.

Pre-process - Run - Post-process - Store results - Clean-up

This mantra is critical:

  1. Prepare your run, download and link driving data, generate grids, setup initial and boundary conditions, pepare run scripts
  2. Execute the run
  3. Inspect the raw results, then post-process the raw information immediately
  4. Store the processed results
  5. Remove the raw results and driving data, unless these will immediately be re-used

Offload data with Globus

If you need to move a lot of data off the cluster to your own storage, consider using Globus

Request long-term storage from Dirisa

If you need large amounts of data to be archived or stored in an accessible form, request this from Dirisa, the Data-intensive research branch of the National Integrated Cyber Infrastructure System.

The CHPC's limited storage capacity

On the Lengau cluster, users have access to a total of 4 Petabytes of “scratch” storage space. The hardware underpinning this storage is a Lustre storage cluster, consisting of meta-data servers, object storage servers (OSSs), object storage targets (OSTs), a high-speed Infiniband network and several thousand spinning hard drives. To the user it looks like a single file system. This 4 PB scratch space is intended purely for short term storage. The system prioritises speed over reliability. The CHPC's mostly unread policy document states clearly that files older than 90 days may be removed at the CHPC's discretion.

Please be pro-active about managing your data before we do it for you. Without first asking your permission. In order to get a list of files that have not been accessed in the last 90 days, use the find command:

find  -type f -atime +90 -printf "%h/%f, %s, %AD \n"

which will produce a csv table with the path, size in bytes and last time of access. To make it even easier for yourself, simply delete these files automagically with find:

find  -type f -atime +90 -delete

Compressing Files

Some data is stored in text format files, like the common CSV file format. This takes up more space than binary data files (like HDF5 or CDF). Many applications that read CSV files can read them if they are compressed using the Linux standard gzip or bzip2 compression methods. The latter method is preferred as it gives better compression ratios, but may not supported.

To compress all CSV files, with extension .csv, in a sub-directory use the bzip2 command along with find program:

find . -iname "*.csv" -exec bzip2 -z \{\} \;

This will compress all CSV files in place, replacing each .csv file with a compressed file of the same name and new extension .csv.bz2.

NOTE: By default, bzip2 deletes the specified files during compression or decompression, to keep the original files, use the -k or –keep option.

The tar command supports automatic use of gzip and bzip2 via the -z and -j options, respectively. For example, use

tar -cjvf archive1.tar.bz2 dir1

to create, and compress with bzip2 (the j option), a tar file called archive1.tar.bz2 which contains all files and sub-directories of the directory dir1.

Building a very large compressed file archive can take a great deal of time. To accelerate this process, you can use parallel bzip2. This is available as a module:

module load chpc/compmech/pbzip2/1.1.13

Help on using it is available by typing in

pbzip2 -h 
/var/www/wiki/data/pages/howto/data.txt · Last modified: 2020/09/09 12:30 by ccrosby