When determining your data management strategy for your workflow, considering a range of backup options for your data beyond just a single copy on your workstation or your external hard drive is paramount. Creating a seamless workspace that will easily transition between workstations and while maintaining durability and availability is easily achievable once you know what resources might be available and a general guideline.
General considerations for how you will be managing and sharing data is crucial, especially for collaborative projects when files must often be accessible in real time.
Considering how long you might need to retain data and how often you might need to access it will drastically change your approach to your storage strategy.
3-2-1 Data Backup Rule
If you walk away form this with nothing else, remember the 3-2-1 rule. The key to ensuring durability of your data—preventing loss due to hardware or software malfunction, fire, viruses, and institutional changes or uproars—is following the 3-2-1 Rule. Maintaining three or more copies on two or more different mediums (i.e. cloud and HDD) with at least one off-site copy.
An example of this would be to have a primary copy of your data on your desktop that is backed up continuously via Dropbox and nightly via an external hard drive. There are three copies of your data between your local workstation, external hard drive (HD), and Dropbox. By having your media saved on hard drive disks (HDDs) on your workstation and external HD in addition to ‘the cloud’ (Dropbox), you have accomplished spreading your data across exactly two different mediums. Lastly, since cloud storage is located on external servers connected via the internet, you have successfully maintained at least one off-site copy. Additionally, with a second external HD, you could create weekly/monthly/yearly backups and store this HD offsite.
Version Control Versus Data Backup
Maintaining a robust version control protocol does not ensure your data will be properly backed up and vice versa. Notably, you should not be relying on services such as GitHub to back up your data, only your code (and possibly very small datasets, i.e. <50 MB). However, you should still maintain an effective strategy for version control.
- Code Version Control
- Should use Git, it is the standard
- GitHub is a great hosting service for small files (e.g. code, limited data inputs, limited data outputs), with a limit of 100 MB (but a practical limit of 50 MB)
- Git Guides:
- Large File Version Control
- GitHub is not the place to be storing and sharing large datasets, only the code to produce large datasets
- Git Large File Storage (LFS) can be used for a Git-based version-control on large files
Data Storage: Compression
Compressing data reduces the amount of storage required (thereby reducing cost), but ensuring the data’s integrity is an extremely complex topic that is continuously changing. While standard compression techniques (e.g. .ZIP and HDF5) are generally effective at compression without issues, accessing such files requires additional steps before having the data in a usable format (i.e. decompressing the files is required). It is common practice (and often a common courtesy) to compress files prior to sharing them, especially when emailed.
7-Zip is a great open-source tool for standard compression file types (.ZIP, .RAR) and has its own compression file type. Additionally, a couple of guides looking into using HDF5/zlib for NetCFD files are located here and here.
Creating Your Storage Strategy
To comply with the 3-2-1 strategy, you must actively choose where you wish to back up your files. In addition to pushing your code to GitHub, choosing how to best push your files to be backed up is necessary. However, you must consider any requirements you might have for your data handling:
My personal strategy costs approximately $120 per year. For my workstation on campus, I primarily utilize DropBox with a now-outdated version control history plugin that allows for me to access files one year after deletion. Additionally, I instantaneously sync these files to GoogleDrive (guide to syncing). Beyond these cloud services, I utilize an external HDD that backs up select directories nightly (refer below to my script that works with Windows 7).
It should be noted that Cornell could discontinue its contracts with Google so that unlimited storage on Google Drive is no longer available. Additionally, it is likely that Cornell students will lose access to Google Drive and Cornell Box upon graduation, rendering these options impractical for long-term or permanent storage.
- Minimal Cost (Cornell Students)
- Cornell Box
- Google Drive
- Local Storage
- TheCube
- Accessibility and Sharing
- DropBox
- Google Drive
- Cornell Box (for sharing within Cornell, horrid for external sharing)
- Minimal Local Computer Storage Availability
Access Via Web Interface (Cloud Storage) or File Explorer
- DropBox
- Google Drive (using Google Stream)
- Cornell Box
- TheCube
- External HDD
- Reliable (accessibility through time)
- Local Storage (especially an external HDD if you will be relocating)
- Dropbox
- TheCube
- Always Locally Accessible
- Local Storage (notably where you will be utilizing the data, e.g. keep data on TheCube if you plan to utilize it there)
- DropBox (with all files saved locally)
- Cornell Box (with all files saved locally)
- Large Capacity (~2 TB total)
- Use Cornell Box or Google Drive
- Extremely Large Capacity (or unlimited file size)
Storage Option Details and Tradeoffs
Working with large datasets can be challenging to do between workstations, changing the problem from simply incorporating the files directly within your workflow to interacting with the files from afar (e.g. keeping and utilizing files on TheCube).
But on a personal computer level, the most significant differentiator between storage types is whether you can (almost) instantaneously update and access files across computers (cloud-based storage with desktop file access) or if manual/automated backups occur. I personally like to have a majority of my files easily accessible, so I utilize Dropbox and Google Drive to constantly update between computers. I also back up all of my files from my workstation to an external hard drive just to maintain an extra layer of data protection in case something goes awry.
- Requirements for Data Storage
- Local Storage: The Tried and True
- Internal HDD
- Installed on your desktop or laptop
- Can most readily access data for constant use, making interactions with files the fastest
- Likely the most at-risk version due to potential exposure to viruses in addition to nearly-constant uptime (and bumps for laptops)
- Note that Solid State Drives (SSDs) do not have the same lifespan for the number of read/write as an HDD, leading to slowdowns or even failures if improperly managed. However, newer SSDs are less prone to these issues due to a combination of firmware and hardware advances.
- A separate data drive (a secondary HDD that stores data and not the primary operating system) is useful for expanding easily-accessible space. However, it is not nearly as isolated as data contained within a user’s account on a computer and must be properly configured to ensure privacy of files
- External Hard Drive Disk (HDD)
- One-time cost ($50-200), depending on portability/size/speed
- Can allow for off-line version of data to be stored, avoiding newly introduced viruses from preventing access or corrupting older versions (e.g. ransomware)—requires isolation from your workflow
- May back up data instantaneously or as often as desired: general practice is to back up nightly or weekly
- Software provided with external hard drives is generally less effective than self-generated scripts (e.g. Robocopy in Windows)
- Unless properly encrypted, can be easily accessed by anyone with physical access
- May be used without internet access, only requiring physical access
- High quality (and priced) HDDs generally increase capacity and/or write/read speeds
- Alternative Media Storage
- Flash Thumb Drive
- Don’t use these for data storage, only temporary transfer of files (e.g. for a presentation)
- Likely to be lost
- Likely to malfunction/break
- Outdated Methods
- DVD/Blu-Ray
- Floppy Disks
- Magnetic Tapes
- M-Discs
- Required a Blu-Ray or DVD reader/writer
- Supposedly lasts multiple lifetimes
- 375 GB for $67.50
- Dropbox
- My experience is that Dropbox is the easiest cloud-storage solution to use
- Free Version includes 2 GB of space without bells and whistles
- 1 TB storage for $99.00/year
- Maximum file size of 20 GB
- Effective (and standard) for filesharing
- 30-day version history (extended version history for one year can be purchased for an additional $39.00/year)
- Professional, larger plans with additional features (e.g. collaborative document management) also available
- Can easily create collaborative folders, but storage counts against all individuals added (an issue if individuals are sharing large datasets)
- Can interface with both a web interface and across as operating system desktops
- Fast upload/download speeds
- Previous version control can allow access to previous versions if ransomware becomes an issue
- Supports two-factor authentication
- Requires internet access for online storage/backup, but has offline access
- Google Drive
- My experience is that Google Drive is relatively straight forward
- Unlimited data/email storage for Cornell students, staff, and faculty
- Costs $9.99/mo for 1 TB
- Maximum file size of 5 GB
- Easy access to G Suite, which allows for real-time collaboration on browser-based documents
- Likely to lose access to storage capabilities upon graduation
- Google Drive is migrating over to Google Stream which stores less commonly used files online as opposed to on your hard drive
- Google File Stream (used to sync files with desktop) requires a constant internet connection except for recently-used files
- Previous version control can allow access to previous versions if ransomware becomes an issue
- Supports two-factor authentication
- Requires internet access for online storage/backup
- Cornell Box
- My experiences are that Cornell Box is not easy to use relative to other options
- Unlimited storage space, 15 GB file-size limit
- Free for Cornell students, staff, and faculty, but alumni lose access once graduating
- Can only be used for university-related activities (e.g. classwork, research)
- Sharable links for internal Cornell users; however, it is very intrusive to access files for external users (requires making an account)
- Version history retains the 100 most recent versions for each file
- Can connect with Google Docs
- Previous version control can allow access to previous versions if ransomware becomes an issue
- Supports two-factor authentication
- Requires internet access for online storage/backup, but has offline access
- TheCube
- Contains 1 TB of storage for each user
- Includes all space being used by the user within the Linux OS
- Cannot easily be share data between users
- Requires software (e.g. MobaXterm) or linux with Secure Shell (SSH) interface to interact with files between your computer and the cluster (transfer speeds limited to bandwidth contraints)
- Globus (high-speed transfers of data) is recommended for larger transfers.
Long-Term (5+ Years) Data Storage
It should be noted that most local media types degrade through time. Utilizing the 3-2-1 strategy is most important for long-term storage (with an emphasis on multiple media types and off-site storage). Notably, even if stored offline and never used, external HDDs, CDs, and Blu-Ray disks can only be expected to last at most around five years. Other strategies, such as magnetic tapes (10 years) or floppy disks (10-20 year), may last longer, there is no truly permanent storage strategy (source of lifespans).
M-Discs are a write-once (i.e. read only, cannot be modified) storage strategy that is projected to last many lifetimes and up to 1,000 years. If you’re able to dust off an old Blu-Ray disk reader/writer, M-Discs are likely the best long-term data strategy that is likely to survive the test of time—making two copies stored in two locations is definitely worthwhile. However, the biggest drawback is that M-Discs are relatively difficult to access compared to plugging in an external HD.
Because of the range of lifespans and how cheap storage has become, I would recommend maintaining your old (and likely relatively small) data archives within your regular storage strategy which is likely to migrate between services through time.
For larger datasets that you are required to retain and would like to easily access, I would maintain them on at least two offline external hard drive stored in separate locations (e.g. at home and your office) while occasionally (i.e. every six months) checking the health of the hard drives in perpetuity and replacing them as required.
Relying only on cloud storage for long-term storage is not recommended due to the possibility of companies closing their doors or simply deactivating your account. However, they can be used as an additional layer of protection in addition to having physical copies (i.e. external HD, M-Discs).
Windows 7 Robocopy Code
The script I use for backing up specific directories from my workstation (Windows 7) to an external HD is shown below. To set up my hard drive, I first formatted it to a format compatible with multiple operating systems using this guide. Note that your maximum file size and operating system requirements require different formats. Following this, I used the following guide to implement a nightly backup of all of my data while keeping a log on my C: drive. Note that I have only new files and versions of files copied over, ensuring that the back up does not take ages.
@echo off
robocopy C:\Users\pqs4\Desktop F:\Backups\Desktop /E /XA:H /W:0 /R:3 /REG > C:\externalbackup.log
robocopy E:\Dropbox F:\Backups\Dropbox /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4\Downloads F:\Backups\Downloads /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4\Documents F:\Backups\Documents /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4 F:\Backups\UserProfile /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy E:\Program Files\Zotero F:\Backups\Zotero /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log