`
High-performance computing is closely related to “big data”. Many HPC users both use and generate very large data sets. The CHPC's default recommendation to users is:
However, the CHPC recognizes that this workflow is not universal. Many research groups own or produce very large data sets that need to be stored safely for sharing or future re-use. This is where the Data Intensive Research Initiative of South Africa comes into play. Like the CHPC, DIRISA is also part of South Africa's National Integrated Cyber Infrastructure System. DIRISA operates storage systems that are designed for long-term safe storage of research data.
Register for DIRISA's data storage services here.
Here is a video tutorial of the process.
DIRISA operates two object stores:
These storage systems can be accessed by way of DIRISA's web-based Data Deposit Tool. However, the data deposit tool is not well suited to the requirements of a CHPC cluster user. It is intended for smaller quantities of data that are not already stored on NICIS resources.
The DIRISA object stores are directly accessible from the dtn
node in the CHPC cluster. The good news is that this data transfer node mounts all the cluster file systems, which makes it possible to transfer data more efficiently to the DIRISA object stores. The bad news is that DIRISA's storage systems differ fundamentally from the cluster's storage:
All computer users are familiar with the directory and file structure of a POSIX file system. An object store has several compelling advantages for long-term data storage, but lacks the transaction-friendly structure of a file system. To facilitate the transfer of data to and from the object store, the iCommands tools were developed. Superficially these are simply the familiar Unix file-handling utilities, prepended with the character i
:
Unix | iCommand | Meaning |
---|---|---|
cp | icp | copy |
rm | irm | delete (remove) |
ls | ils | list files |
cd | icd | change directory |
mkdir | imkdir | make directory |
rmdir | irmdir | remove directory |
iput | copy file to object store | |
iget | copy file from object store |
It is important to get away from thinking in terms of a directory structure. The data objects exist in a namespace, and this namespace can be addressed with the iCommands in such a way that it resembles a directory structure. Data objects in the same namespace can be stored in either the iRODS or the tape system. From the perspective of the user, the command ils
will display data objects in the current namespace, whether they have been stored in iRODS or tape. For example:
[jblogs@dtn:~/lustre]$ ipwd /dirisa.ac.za/home/jblogs [jblogs@dtn:~/lustre]$ imkdir myObjectCollection [jblogs@dtn:~/lustre]$ icd myObjectCollection [jblogs@dtn:~/lustre]$ iput LotsOfMyFiles.tgz LotsOfMyFiles_toiRODS.tgz [jblogs@dtn:~/lustre]$ iput -R LTA LotsOfMyFiles.tgz LotsOfMyFiles_toTape.tgz [jblogs@dtn:~/lustre]$ ils /dirisa.ac.za/home/jblogs/myObjectCollection: LotsOfMyFiles_toiRODS.tgz LotsOfMyFiles_toTape.tgz [jblogs@dtn:~/lustre]$ ils -l /dirisa.ac.za/home/jblogs/myObjectCollection: jblogs 0 dirisa_root;dirisa_replication;CapeTown;irods-resccpt02Resource 2492058420 2023-07-14.18:26 & LotsOfMyFiles_toiRODS.tgz jblogs 0 LTA 2492058420 2023-07-14.18:26 & LotsOfMyFiles_toTape.tgz
In the above example, a new collection called “myObjectCollection” was created and we moved into that namespace with the icd
command. Two iput
commands were then executed. In the first, a file was transferred to the default (iRODS) storage system. The second iput
command had the parameter -R LTA
, which means “use the resource long term archive”. When we check the contents of the collection with ils
we see that both data objects are listed. However, ils -l
shows that the two objects in the same collection are stored in two totally different storage systems.
If you don't know what caveat means, go look it up.
Managing large quantities of data is a specialized, mission-critical and difficult task that cannot be undertaken casually. It is expensive to create a lot of data but very easy to muck it up. Each research group must appoint an experienced full-time researcher to take custody of its valuable data. This is not a job for amateurs. If you are intimidated by a raw command line, managing big data is not for you.
It is cripplingly slow and inefficient to transfer lots of small files. Roll small files up into a tarball and transfer that.
tar czf LotsOfMyFiles.tgz myDirectoryContainingLotsOfTinyFiles
The DIRISA object stores are designed for storage, not use. At best you can use the Data Deposit Tool to edit metadata and view the first few lines of files. It is recommended to avoid conducting uploads or downloads of data stored in iRODS directly within a PBS jobscript. You might be wondering, “What about automating my workflow?”
It is generally recommended to perform data transfers separately from jobscripts, either before or after job execution, using dedicated tools and scripts. This approach ensures better control over data transfer processes, reduces the impact on job performance, and enhances jobscript portability and reliability.
Read the documentation for the iCommands to ensure that you know what these commands and their parameters do.
You are only an i away from permanently deleting valuable data. Proceed with due care.
irm -r myDataDirectory
will delete your data collection at DIRISA.
rm -r myDataDirectory
will irretrievably delete your data on Lustre.
To move really large data files, consider using terminal multiplexers like screen or tmux. These tools allow you to detach from a session, leaving your tasks (such as large data transfers) running in the background. Even if you close the terminal window or your local machine is switched off, the tasks will continue to run. When you're ready, you can open a new terminal window and reattach to the still-running session. This ensures that your tasks continue uninterrupted even when you sign out of LENGAU.