Reagan Moore, Joseph Lopez, Charles Lofton, Wayne Schroeder, George
Kremenek
San Diego Supercomputer Center
Michael Gleicher
Gleicher Enterprises, LLC
Abstract
Archival storage systems must operate under stringent requirements,
providing 100% availability while guaranteeing that data will not be lost.
In this paper we explore the multiple interconnected subsystems that must
be tuned to simultaneously provide high data transfer rates, high transaction
rates, and guaranteed meta-data backup. We examine how resources must be
allocated to the subsystems to keep the archive operational, while simultaneously
allocating resources to support the user I/O demands. Based on practical
experience gained running one of the largest IBM High Performance Storage
Systems, we propose tuning guidelines that should be considered by any
group that is seeking to improve the performance of an archival storage
system.
Introduction
The San Diego Supercomputer Center (SDSC) is the leading edge site for the National Partnership for Advanced Computational Infrastructure (NPACI). This National Science Foundation project supports academic computational research for researchers throughout the United States. NPACI has the aggressive long-term objective of providing a multi-teraflops capable compute engine linked to a petabyte sized archive that can support data movement at rates up to 10 GB/sec. The combined system will support computationally intensive computing in which terabyte-sized data sets are written to the archive and data-intensive computing in which multi-terabyte data sets are read from a data collection stored within the archive. The expectation is that the ability to analyze very large data sets will become a very important supercomputing application.
The critical element in the NPACI data-intensive computing system is an archival storage system that can meet the transaction and I/O rate demands of the compute engine. The long-term goal is to provide the same access to the archive as the compute engine has to its local disk. This will make it possible to decrease the amount of disk cache that must be provided to both systems. Based upon measurements made on Cray supercomputers [1,2], a teraflops capable computer is expected to generate data at the rate of 10 GB/sec. A sustained rate of 10 GB/sec is equivalent to the movement of a petabyte of data per day. Hence our desire to improve archival storage performance by understanding the configuration and tuning needed to keep any component of the system from becoming a performance bottleneck.
The difficulty is that archival storage systems are quite complex. They
contain multiple subsystems that support data movement, transaction processing,
event logging for meta-data backup, and data migration between multiple
levels of a cache hierarchy. We examine each of these subsystems to understand
how they can impact overall archival storage performance. We illustrate
the configuration selection and associated tuning based upon the IBM High
Performance Storage System (HPSS) [3]. The SDSC HPSS system is one of the
largest HPSS systems in production use [4]. It stores over 6.2 million
files comprising 80 TB of data distributed between three tape silos, and
internally moves up to two terabytes of data per day. The total capacity
for uncompressed data is 180 TB, using 10-gigabyte tapes. SDSC plans to
increase the size of the system by a factor of at least 10 over the next
three years. We therefore also look at the scalability of the system to
determine whether the goal of a high-performance petabyte archive that
can be completely read in a single day is achievable.
Designing an archive
Archival storage systems typically provide resources to support long term storage of data on inexpensive media such as tape. To improve response times, the data is cached on disk. The performance of a system is then usually considered in terms of external usage metrics such as the average access time needed to retrieve data, the rate at which the system can ingest or export data, or the total capacity of the system. From this perspective, archive configuration is focused on the appropriate allocation of resources to disk caches and tape robots to meet user demand.
Of equal importance, however, are the resources that are dedicated to the archival storage infrastructure. These include the number of CPUs used to execute the storage servers, the amount of memory used to support executing processes, the disk space allocated to support internal meta-data directories and transaction logs, and the tape resources dedicated to backup of internal system tables. Tuning of an archival storage configuration must therefore address the allocation of resources to support internal archive functions as well as the external user load.
The configuration of an archival storage system can be understood in terms of three fundamental subsystems:
The tuning steps revealed the complex set of interactions that occur
between the archival storage subsystems. A number of potential problems
had to be overcome:
Configuration
The SDSC HPSS configuration is required to support a peak teraflops capable compute engine in 1999. The estimated load that must be sustained in 1999 is on the order of 10 TB of data movement per day, into an archive that holds up to 500 TB of data. The dominant concern in the configuration tuning was the development of a system that could meet the corresponding sustained I/O rate of 100 MB/sec. The approach taken was to first validate the ability of the underlying operating system (AIX) and hardware (IBM SP with eight Silver nodes) to handle the maximum expected load. We then evaluated the I/O capability of the system, and the ability of the archival storage system to drive a large fraction of the system I/O rate.
The IBM SP Silver nodes are SMPs, each containing four 604e processors. The nodes are interconnected by a TrailBlazer3 (TB3) switch, which has a peak I/O bandwidth of 150 MB/sec per node. The HPSS standard suite of load tests was run, using the SP to both generate the load and support the HPSS system. This limited the maximum load level to 7 terabytes of data moved through the HPSS system per day. When the load is generated by a separate compute engine, we expect the sustainable data movement to be at least 10 terabytes of data per day. The stress test included all of the service classes, effectively designed to support small, medium, and large files as shown in Table 1.
The tests revealed multiple hardware problems, some of which could only be seen at the highest load. Upgrades to the RAID disk, the SP switch, and the node hardware eliminated all of the problems. In some cases this required the next version of hardware or software. In other cases faulty hardware was replaced. The process eliminated all of the potential sources of hardware problems for the production use of the system. In fact, during the following month of November, the archival storage system was stable, with the only down time occurring during preventive maintenance periods.
The I/O capability of the system was measured by explicit timed movement of large data sets from disk to an SP node and between SP nodes. Two types of disk storage were used: HiPPI attached MAXSTRAT disk that had a measured I/O bandwidth of 60 MB/sec, and IBM SSA attached RAID disk that sustained up to 30 MB/sec for disk reads. An IBM High Performance Gateway Node (HPGN) was used to tie external communication channels into the IBM TB3 switch. To send data to an external computer, the data would flow from a Silver node, through the TB3 switch into the HPGN, and then across the external network. Data movement between nodes through the TB3 switch was measured at rates up to 130 MB/sec. Data movement between nodes connected through the HPGN was measured at rates up to 90 MB/sec.
The I/O rates that could be sustained using HPSS version 3.2 were then measured. The test configuration consisted of seven Silver SMP nodes. One held the HPSS core servers and three Encina SFS (meta-data) servers. Two nodes supported FTP clients and also served as bitfile mover platforms for accessing SSA RAID disk. Four additional nodes served as dedicated bitfile mover platforms. The nodes were interconnected through the TB3 switch. The core server machine had 3 GB of memory. Five of the remaining nodes had 2 GB of memory, with a 3-GB node assigned randomly as a client or mover.
The test consisted of reading a 2-GB file using the HPSS Parallel File Transfer Protocol (PFTP) interface. PFTP is one of the high performance user interfaces to HPSS and is functionally a superset of conventional FTP.
Case 1. The 2-GB file was read from one SSA RAID string on one mover node. Results were 24-26 MB/sec.
Case 2. The file was set up as a parallel file and was read from two SSA RAID strings on one mover node with HPSS striping between the two RAID strings. Results were 48-51 MB/sec.
Case 3. The file was read from four SSA RAID strings connected to two mover nodes, with two SSA strings on each mover node, and with HPSS striping between the four RAID strings. Results were 97-98 MB/sec. That is, the client node was receiving 97-98 MB/sec, while each mover node was sending about half that amount.
Case 4. Two simultaneous files were transferred as in Case 3, using two client nodes, four mover nodes, and eight SSA strings. Essentially, two Case 3 studies were run, each with its own client node, mover nodes, and SSA strings, but sharing a common HPSS core server and system image. Results were 192-194 MB/sec aggregate across the client nodes. The load on the HPSS core server peaked at 75% CPU utilization, implying that higher bandwidths will either require a faster processor or use of an eight-way SMP node.
The above numbers were obtained with no other processes running in the
system. In particular there was no contention for the SSA disk strings.
Since the tests were run within the bounds of a single SP, the SP's own
TB3 switch served as the network. The network protocol was TCP/IP, which
was tuned for the test. The TCP/IP tuning was critical to the success of
the effort, with performance varying a factor of 2.5 depending upon the
parameter settings. Given data sets large enough to stripe across at least
four SSA RAID strings, the SDSC HPSS system should be able to support disk
data movement of 10 terabytes per day.
Subsystem resource allocation
Resources were allocated to the HPSS subsystems to avoid internal performance bottlenecks. The HPSS configuration for transaction support is shown in Figure 1, for data movement support is shown in Figure 2, and for backup support is shown in Figure 3. In practice, all of the subsystems reside on the IBM Silver node SP, augmented with a High node, a Wide node, an RS/6000, and the HPGN.
The goals of the transaction support tuning were to improve stability
and decrease the amount of system maintenance. The individual steps in
the process included:
Figure 1. Transaction support subsystem for HPSS.
The goals of the data movement subsystem tuning were to increase sustainable
I/O rates and decrease the latency of file access. The individual steps
in the process included:
Figure 2. Data movement subsystem for HPSS.
The goals of the backup subsystem tuning were to increase reliability
and remove all contention for backup resources. The individual steps in
the process included:
Figure 3. Backup support subsystem for HPSS.
Resource allocation for user data
We analyzed the usage of HPSS by logging statistics about every transaction done to access data. The transactions are maintained in an Oracle database, which allows general queries to be issued to compose usage statistics over arbitrary time periods [5]. During the month of November 1998, the HPSS system ran stably. A daemon pinged the system every 15 minutes and recorded no down times except during the weekly preventive maintenance period. This method of recording availability includes effects due to network reliability, application client reliability, as well as HPSS reliability. During November, a total of 6.7 TB of data was moved between HPSS and external clients.
The dominant use of the HPSS system at SDSC is for backup of files, constituting 30% of the data movement and 64% of the files stored. This provides a relatively uniform background load on the archive. User demand for storing or analyzing very large data sets is the next major use of the system. User demand for storing data from numerical simulations appeared to be relatively constant, and only decreased during holidays.
The user load is separated into three classes of service for small, medium, and large files. Separate disk space is allocated for each service class. Files are automatically assigned a service class based upon their size, as specified in Table 1. Note that separate disk space is allocated to support service classes that request a second copy of the data. The second copy is made when the data sets are written to tape, as all of the disks are RAID. The backup systems at SDSC dominate the usage of the small service class, and are a substantial component of the usage of the medium service class.
The actual user load was calculated by summing across all transactions that took place over the month of November. The total amount of data read or written into HPSS during November was 6.7 TB, giving an average data movement of 223 GB per day. The peak daily data movement was also calculated for each service class, and is listed in Table 2. The largest amount of data moved in a single day was 642 GB. Note that this is a measurement of the external I/O load on HPSS. The total I/O load is typically twice as much, since the data are written to disk and then to tape. During the peak data movement day in November, HPSS supported 1.4 TB of both internal and external data movement. Support for service classes in which data are automatically replicated can increase the total data movement to a total rate three times that of the external data request rate.
Table 1. HPSS service classes at SDSC.
|
|
|
Single Copy |
Two Copies |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 2. HPSS usage during November 1998.
|
|
per day |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The peak loads were caused by higher than normal storage of data into HPSS. This was driven by the migration of data from other computer centers into the SDSC archive. During November, the total amount of data written to the archive was 4.4 TB, and the total amount of data read was 2.3 TB. The peak amount stored in a single day was 380 GB, while the peak amount read was 264 GB. The peak daily data movement occurred on different days for each service class.
The effectiveness of the disk cache can be estimated by calculating the average length of time that a data set could remain on disk, before being purged to make room for new data. This requires knowing the average hit rate of the disk cache. During November, 62% of the file accesses were satisfied from disk, implying that the data were already resident on disk before the request was made. The average daily rate at which the disk cache is filled is then estimated as 38% of the read rate (representing data cached onto disk from tape) added to the average storage rate into the archive. By dividing this estimate of the average daily load into the disk space reserved for each service class, the average residency time of a file in the cache can be estimated. Using the peak daily data movement rate gives the minimum cache life. Since this procedure assumes all reads and writes are done against different data sets and that no rewrite of data is done within the cache, this gives a lower bound on the cache lifetime.
Table 3. HPSS file disk cache lifetimes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The resource allocation to the Small service class is meeting its dominant requirement, that small files can be retrieved from disk without having to mount tapes. The target for the Small service class is to hold data for at least a month on disk. The target for the Medium service class is to store data for a week to support research projects more effectively, while the Large service class should hold data for at least two days. When the load on the system increases, the amount of disk assigned to each service class will need to be expanded proportionally.
By comparing the total data movement per day against the average tape speed, we can estimate the number of tape drives that must be provided to support migration of data off of the disk cache. This computation includes the average tape access latency which is dominated by the time needed to spin the tape forward to the file location. We assume that the data access latency is incurred primarily on data retrieval from the archive. Writes are assumed to be done directly to an available tape. This implies that a workload dominated by tape reads will require more tape drives than one dominated by tape writes.
Table 4. HPSS tape drive utilization.
|
|
I/O Rate (MB/sec) |
tape drives (average load) |
|
|
|
|
|
|
Thus on average usage days, we keep two tape drives continually busy migrating data off of disk and retrieving data sets. When a tape drive is in use, it spends roughly 15% of the time reading or writing data. The rest of the time is spent positioning the tape to read the file. To achieve a higher effective tape speed, the size of the data sets will have to be increased.
In practice, a substantially larger number of tape drives are needed
to support multiple service classes. If all SDSC service classes are accessed
simultaneously, a total of 12 tape drives would be needed.
Scalability
Based on the analysis of the actual load on the HPSS system at SDSC, and the corresponding utilization of the HPSS resources, the scaling needed to achieve a sustained data rate of 100 MB/sec can be estimated. This is a factor of 40 times as much data movement as is presently supported in production. The transaction processing subsystem is expected to support external data rates of 100 MB/sec with the present configuration, based upon benchmark tests. Note that this implies an internal data rate of at least 200 MB/sec within HPSS. The benchmark tests indicated the present system is capable of sustaining this rate.
The data movement subsystem will require expansion of the disk data cache size to 40 TB to maintain a comparable file cache life. A smaller disk cache will decrease the hit rate, causing a larger fraction of the data sets to be read from tape and increasing the number of tape mounts that are needed. Even with a 40 TB disk cache, at least 80 tape drives would be needed to support the migration of data. However, if the average size of the data sets increases by a factor of 40, the effective speed of the drives increases by a factor of five. The larger transfer decreases the fraction of the time devoted to tape manipulation and improves the effective transmission rate. The total number of drives that are needed is then only 18 to sustain the average I/O rate. If the disk cache is decreased in size by a factor of two, the number of tape drives that is needed is expected to double. Thus a 20 TB disk cache and 36 tape drives may be sufficient to support a teraflops supercomputer. It is interesting to note that the critical element for the disk cache is storage capacity and associated file cache lifetime, while the critical element for the tape system is average file size and effective bandwidth.
The backup subsystem is expected to support the desired I/O rate, but
may require parallelization of the transaction logging across more servers.
Conclusion
The SDSC production HPSS system has been configured and tuned in anticipation of the I/O loads expected from a teraflops-capable computer. The process has illustrated the necessity to adequately characterize and support the internal subsystems of the archival storage system, as well as the need to characterize and support the user I/O requirements. Successfully implementing a scalable archival storage system requires assigning sufficient hardware resources to keep all system components from failing when under heavy load.
The next set of challenges will be how to increase the data rate to
the 10 GB/sec range needed to be able to move a petabyte of data per day.
This requires much cheaper disks, and much faster tape drives. Cost effective
disk storage capacity is becoming available at the rate needed to build
data caches that can handle the storage of hundreds of terabytes. High-performance
tape storage will be harder to acquire. Increasing bandwidths to the desired
level for tape subsystems will either require major advances in tape drive
I/O rates, or striping over a much larger number of peripherals.
Acknowledgements
This work was supported by the National Science Foundation Cooperative
Agreement ACI-9619020.
References