Magazine Article | May 15, 2001

Segment Those Files

Segmenting data within large files when performing backups helps to decrease network load.

Business Solutions, May 15 2001

As part of disaster recovery planning, network systems administrators commonly implement data backup or traditional HSM (hierarchical storage management) technology to protect their companies from data loss. Although small files do not present a problem in these setups, large files can easily overload the network. Slow access times will result from entire files being written to a single medium such as tape or magneto-optical media.

Most data protection schemes are set up so that every time a file is updated - no matter how small or large the change - the entire file is resaved to the storage media. This configuration does not take into account the inefficiencies caused by rewriting an entire file when only a portion of that file has actually been changed.

In video editing situations, for example, producers often edit small segments of the file and leave the rest of the production as is. In this instance, there is no need to resave the entire file. In fact, resaving the entire file can cause great inefficiencies in the network by hogging the resources.

File segmentation, or the ability to back up only the parts of a large file that have been changed, avoids the problem of unnecessary data backup and shaves backup and restore times from minutes down to seconds. With a proper file segmentation function, the data storage software will recognize the bytes in a file that have been changed and then update only those bytes. The rest of the file will be left unchanged.

RAIT Still Requires Full Restore
To appreciate the network system strain presented by large files that are not segmented, it helps to review how network administrators have attempted to circumvent limitations. Many administrators implement RAIT (redundant array of independent tape drives) to allow single large files to be striped (fragmented) across multiple, simultaneously accessed tape drives. However, to change or add even a single byte to a RAIT'd file, the entire file must still be restored. This requires allocating the same number of tape drives as were used for creating the disk file on tape. If not all tape drives or tapes are available simultaneously, access to the file is impossible.

Another current limitation of backup or traditional HSM systems is that users must usually wait until the entire file is restored back to the file system before it can be read. This has a significant impact on the overall access performance of the network and file system. For example, a user would have to wait an average of 10 minutes to get the first byte of data from a two gigabyte file using a DLT 7000 tape drive. Larger files or slower drives would produce substantially longer access times.

However, when a file resides in a UNIX file system, it requires only the affected disk blocks to be accessed or modified. By taking advantage of the underlying UNIX file system capability and identifying a workable solution to traditional HSM limitations, file segmentation software programs have been developed. For example, some software manages large files by automatically segmenting files of a predetermined size. This allows each segment to be managed as if it were an individual file.

In this setup, network users see the file as one very large file using standard UNIX-type commands. When only a file segment changes, the program only re-archives (recopies) the changed segment and not the entire file. Also, the program supports the release of individual segments (freeing up online disk space), the archiving of individual segments (making multiple nearline copies), and the staging of individual select segments (bringing back from nearline media). Together these functions can eliminate major data bottlenecks and will dramatically improve read and write performance when accessing the entire file.

Segmenting Speeds Access
This type of file segmentation is totally transparent to the user and application. When an application user opens a file and seeks a certain position within the file, the file system transparently determines the segments that contain the desired data. When the program identifies which of the requested data resides on disk (instead of secondary tape storage), the rest of the segments are automatically staged back to the disk for fast access. Coupled with the data read-ahead capabilities for file staging, the user potentially can expect to have access to the "first requested byte" within seconds, regardless of file size and data location on the file.

In summary, a file segmentation program should allow network administrators and system users to:

  • stripe a single file across multiple nearline devices in parallel
  • restore striped files with as few as one nearline media device (rather than many required under traditional scenarios)
  • copy only changed parts of a file back to secondary media
  • restore only the parts of a file that are requested
  • support files larger than the actual file system.
Questions about this article? E-mail the author at mutke@lsci.com.