Data Oasis Overview
The Lustre file system consists of Object Storage Servers (OSSs), which store file data on one or more Object Storage Targets (OSTs), analogous to virtual hard drives, a Meta Data Server (MDS), which maintains filenames, directories, file locations, etc., and a high performance network that links the compute nodes to the file system. Oasis /phase1 in particular has 1 MDS, 16 OSSs, and 64 OSTs (4 OSTs per OSS).
In order to achieve parallel I/O, data is typically split across multiple OSTs, referred to as striping. Users are able to control the stripe count (number of OSTs to stripe over), stripe size (number of bytes on each OST), and stripe index (OST index of first stripe) per file or directory. All three can affect I/O performance and should be matched with the type of I/O being performed. The default stripe count is currently set to 1.
Striping provides in increase in bandwidth when performing I/O and increases the available disk space for single files. However, a maximum stripe size should not be used as it also causes contention to the network. To set the stripe for a given file or directory use the command ‘lfs setstripe’, e.g.:
where:/opt/lustre/bin/lfs setstripe --size <stripe_size> --count <stripe_count> --index <start_ost_index> <directory|filename>
The current stripe settings for a directory/file can be found by using getstripe:stripe_size: Number of bytes on each OST (0 filesystem default) Can be specified with k, m or g (in KB, MB and GB respectively) stripe_count: Number of OSTs to stripe over (0 default, -1 all) start_ost_index: OST index of first stripe (-1 default)
/opt/lustre/bin/lfs getstripe <directory|filename>
I/O ScenariosDifferent I/O usage patterns can stress different parts of the lustre file system. Specific examples are examined below:
Serial I/OIn this setup, a single process performs the I/O for the application. If the program is parallel computation this means all I/O data must be set to one task, and the interconnect will become a bottleneck for the program. The stripe count and stripe size are an important factor to get the best performance out of the system. The table below shows read/write speeds for different counts, stripe sizes, and transfer sizes for a single writer:
|Stripe Count||Stripe Size||Transfer Size||Max Write (MB/s)||Max Read (MB/s)|
A stripe count of 2 or 4, with a stripe and transfer size of 1m should be sufficient to get good performance on /phase1. Note: If you have a very large file (250GB or more) you should set the stripe count to a number larger than 1 to avoid overfilling individual OSTs.
Parallel I/O with one file per processIn this example, each process writes individual files. Since multiple files will be open at the same time, the I/O network and the meta data resources become important factors in performance.
For low core counts (<128 cores) there should not be an issue to write to each file if default striping is used. If a non-default stripe count is set, consideration should be given to: the number of writers, stripe count per file, stripe size, and lustre server load. For example, if a program writes 32 files simultaneously with a stripe count of 2 all 64 OSTs will be utilized (32 writers * 2 stripe = 64 hits). Lustre will normally balance these writes out across the 64 OSTs available on /phase1. However, if the program has 128 writers with a stripe count of 4, there will be 512 hits. As /phase1 has 64 OSTs this will require each OST to have ~8 stripes writing to it at once. This will cause lower performance and a higher load on each luster server. The increasing load increases the chances of timeouts and failures. For this core count, the best performance is available at a low stripe count of 1 or 2.
For medium core counts (<512 cores), it is not recommended to have all processes writting out to files at the same time as limits to handle the throughput will be reached. For medium jobs that require this and for large jobs (>512 cores) the code must stagger I/O in order to avoid putting too high of a load on the machine.