UCSD Mail Code 0505
9500 Gilman Drive
La Jolla, CA 92093-0505
Phone: 619 534-5158
Symbiosis is a term from biology meaning the living together of dissimilar organisms in close proximity. We adapt that term to refer to an increase in throughput that can occur when jobs are coscheduled on multithreaded machines. On a multithreaded machine such as the Tera MTA (Multithreaded Architecture) coscheduled jobs share system resource very intimately on a cycle by cycle basis. This can increase system utilization and boost throughput but it can also lead to pathological resource conflicts that lower overall system performance. We exhibit a number of job interactions both beneficial and harmful and explain observed phenomena in a framework of shared system resources. We describe a user space jobscheduler called S.O.S. that dynamically determines which jobs ought to be coscheduled based on resource utilization measurements. S.O.S. can boost system throughput by more than 10% even when the job mix being scheduled is already highly tuned and efficient.1 Introduction
A brief discussion of the motivation and architectural innovation behind multithreading will set the stage for our discussion.1.1 The Efficacy of Multithreading
The Tera MTA (Multithreaded Architecture) uses multithreading to cover operational latencies and so keep the processor highly utilized. Multithreading assists the goal of high processor utilization by fetching instructions from multiple threads of instructions. Many modern processors attempt to exploit ILP (instruction level parallelism) within a single thread to cover latencies. Independent instructions can be executed without waiting for each other to complete. However, there may be insufficient ILP in a single sequence of instructions to cover all latencies. When latencies cannot be covered idle cycles called phantoms occur. The probability of having an instruction available for issue at every opportunity can be increased by fetching and scheduling from multiple threads. The source of these threads can be independent parts of the same parallel program (such as independent iterations of a loop) or threads from different programs. Figure 1 shows a comparison of processor utilization levels for three non-multithreaded processors and the MTA running three floating point intensive NAS kernel benchmarks. It can be seen that the MTA is, on average, able to achieve greater processor utilization by multithreading.
Figure 1: Performance as a percentage of peak of three commodity processors and the MTA on the three floating point intensive NPB 2.3 Class A benchmark kernels. Source for all results except the MTA is http://www.nas.nasa.gov/Software/NPB
1.2 Architectural overview of the MTA
The MTA system is designed to have from 1 to 256 processors that share a large memory. The machine can have either 1 GB or 2GB of memory per processor, but randomized memory mapping and a high-interconnectivity network provide near-uniform access time from any processor to any memory location. There are no data caches. Instead, multithreading is used to tolerate the latency of memory accesses, which require on the order of 100 or 150 cycles. Each processor has about 100 streams, where a stream is hardware (including 32 registers and a program counter) that is devoted to running a single thread of control. The processor makes a context switch on each cycle, choosing the next instruction from one of the streams that is ready to execute. From the point of view of a thread, when it is assigned to a stream, its instructions are executed one at a time, requiring 21 cycles per instruction. Each instruction includes a "lookahead" number (between 0 and 7) that designates how many additional instructions can be executed before the result of the memory operation is needed. Since memory operations take 5 or 6 times the 21-cycle execution speed, fast thread execution requires that the compiler is able to schedule memory operations ahead of when their results are needed. From the point of view of a processor, each stream can provide it with at best one operation every 21 cycles (if there is sufficient lookahead), but at worst, fewer than one operation every 100 cycles. Full utilization requires the compiler to schedule enough threads that have sufficient lookahead. If the processor has no stream ready to issue an instruction, the no-op for that issue slot is called a phantom. If a single program has insufficient threads to fully utilize the processor, additional programs can be scheduled, assigned to streams and allowed to compete for the processor.
2. Shared System Resources and Symbiosis
Coscheduled jobs on a multithreaded machine share system resources very intimately. This intimate sharing leads to interactions both harmful and beneficial that would not be observed on traditional, non-multithreaded machines.2.1 Shared system resources
When a multithreaded machine is multiprogrammed, system resources are shared between the running programs cycle by cycle. This is in contrast to the resource sharing that takes place on multiprogrammed, non-multithreaded machines. On traditional machines, multiprogramming is achieved by timeslicing. One program acquires the CPU for several million cycles and has exclusive use of the fetch unit, functional units and other CPU hardware during that time. After expiration of the timeslice, another program steps in and uses the CPU and this continues in some fair fashion. However on the MTA, as on other multithreaded machines such as the SMT multiprogrammed jobs compete against each other for system resource each cycle. They compete, for example, for issue slots. Each cycle a ready instruction from one of the streams is selected for issue. Although the algorithm used for selecting the next ready instruction is a fair policy approximating round robin, one program may get more of the CPU if it has more streams and/or more ready instructions than the other. The cosheduled jobs compete for other resources as well such as instruction cache entries and memory bandwidth. Because sharing of and competition for system resources is so intimate on a multithreaded machine, the interactions between multiprogrammed jobs can be complicated. The performance of a job can be profoundly affected by what jobs it is coscheduled with. The usage of shared system resources and ultimately the utilization of the machine, can be dependent on which jobs are selected to run together.2.2 Symbiosis
Symbiosis is a term from biology that means the beneficial living together of distinct organisms in close proximity. We have adapted that term to refer to an increase in system utilization that can occur when to or more programs are executed together on a multithreaded machine. In previous work  we exhibited symbiosis for a jobmix made up of the NPB 2.3 kernels. Figure 2 extends results of that paper and shows the benefit of running modified versions of the kernels and a volume rendering application called MPIRE pairwise on the MTA. Symbiosis is defined as a quantity normally between 0 and 1 on the MTA (when there is some benefit to coscheduling) and is negative when throughput would be better if the jobs were run separately. Formally, we define symbiosis in terms of throughput rate where throughput rate is given by
Figure 2: Symbiosis of pairs of jobmix NPB 2.3 kernels Class A and MPIRE.
2.2.1 Positive symbiosis
Positive symbiosis is the usual result of coscheduling two or more jobs on the MTA. This corresponds to the positive bars in Figure 2. Jobs on the MTA typically alternate serial and parallel sections. When a job is in a parallel section it has lots of threads each corresponding to distinct pieces of parallel work. These threads in turn are mapped to streams; the probability of having an instruction to issue each cycle from one of the streams is high because many streams are active. Phantoms are low in parallel sections. In serial sections by contrast, system utilization is low and phantoms high. A serial section corresponds to just one thread and this is mapped to just one stream. The MTA cannot issue instructions from the same stream more often than once every 21 cycles. So if a serial section is not coscheduled with some other job, 20/21 ~= 95% of the cycles will be phantoms. Figure 3 shows output of a dynamic execution profiling tool called Traceview for MG and FT running each alone and then paired together. Traceview graphs system utilization during execution; a value of 1.0 indicates 100% system utilization. Traceview plots have the shape of a step function with the steps being parallel sections and the gaps between the steps being serial sections. Phantoms manifest as vertical space between the top of the step function and the line marked '1.0' (these are phantoms in parallel sections) and also as horizontal space between steps (these are phantoms in serial sections.) It will be seen from Fig. 3 that when two jobs are run together they overlap each other's serial sections with parallel sections. The net result is a significant boost in system utilization and throughput. If a parallel job does not fully utilize the machine in its parallel sections then system throughput can also be increased by overlapping of parallel sections. In the last graph of Figure 3 both horizontal and vertical space has been reduced lowering overall phantoms.
Figure 3: Traceviews of FT and MPIRE running alone and then together. The top graph is FT running alone, the middle MPIRE running alone and the bottom graph is both running together.
Fig. 4 gives a breakdown of how much of the increase in utilization for each pair of Fig. 2 was due to eliminating phantoms in parallel sections and how much was due to eliminating phantoms in parallel sections. It will be seen that, for these highly tuned codes, little gain is achieved from overlapping efficient parallel sections and most symbiosis comes from overlapping serial with parallel sections. EP + MPIRE actually caused an increase in phantoms which will be discussed in the next section.
Figure 4: Breakdown of phantoms reduced in symbiotic pairings of Fig. 2 by category.
2.2.2 Negative symbiosis
Figs. 3 & 4 show one case where a pathological job pairing causes system utilization to drop. While this is not the usual case on the MTA, a case like this exhibits a resource allocation policy shortcoming that can lead to reduced system utilization and throughput. In the case of EP + MPIRE, it would have been better to schedule the jobs alone or with other partners. Coscheduling these particular jobs resulted in decreased system utilization and throughput.
When two parallel jobs both ask for streams at the same time, typically one gets all that it asks for and the other gets only what is left over. Now when the first jobs goes out of parallel section and gives up its streams one of two things can happen; 1) the other job recognizes this and acquires more streams or 2) the other job does not recognize this and goes on as before with fewer streams than it could have. While software development on the MTA is evolving towards a system which will flexibly and dynamically reapportion streams as they become available, it is currently possible to get into pathological situations where one job "throttles" the other. In this case EP asks for and gets 90 streams. MPIRE asks for 90 but gets only 10. When EP finishes MPIRE does not know to ask for more and continues in a throttled state. The result is a long inefficient tail where EP has completed and MPIRE runs with reduced streams. This can be corrected by rewriting MPIRE to "come up for air" and query the system more often or by causing the system to split the available stream pool in two equal halves. However, with just the second approach, the longer running MPIRE job will still be somewhat throttled since it will not know when EP has finished and 50 more streams have become available. If a policy of fair resource division is combined with smart programs that "come up for air" then the problem is fully resolved. Each job gets a fair share of the system when both are running and neither gets less than it could have when the other finishes and more resource is available. Besides encountering resource scheduling anomalies, jobs can conflict on system resources such as instruction cache entries and bandwidth to main memory. In section 4 we present a scheme for detecting and avoiding harmful job interactions.2.3 Dynamic hardware counters - oversubscription and undersubscription
If we need to know the execution times of jobs running alone and also the execution time when run together to establish whether or not jobs are symbiotic then symbiosis is difficult to exploit in a practical way. If we were to run the same batch of jobs every night we could experiment and, over time, evolve to the most efficient schedule. However, we want a dynamic job scheduler that can identify good coschedules on the fly even if the jobs have never been run before. Fortunately, machine utilization is the instantaneous, differentiable component of throughput. If a job pairing results in increased system utilization over the course of its run then we can be sure that the overall execution time will be reduced. (This could only be false if the jobs run together were somehow executing more instructions than the jobs run separately. This does not occur on the MTA.) The Tera MTA comes equipped with dynamic hardware counters that can be used to measure system utilization. Some of the more useful are...
Cycle counter. The difference between successive reads of this counter is the number of cycles in the elapsed interval per processor
Issue counter. The difference between successive reads of this counter is the number of instructions issued in the elapsed interval per program
Phantom counter. The difference between successive reads of this counter is the number of phantoms in the elapsed interval per processor
Ready counter. The difference between successive reads of this counter is the number of ready instructions (those available to be issued) in the elapsed interval per processor
Sum of all(Issue) + Phantom = Cycle since on every cycle an instruction is either issued or not. The phantom counter is a measure of system undersubscription. If this counter is high and if other resources are not overallocated (such as streams or memory) it would make sense to bring in more jobs to increase the odds of issuing each cycle. The phantom counter is typically high in serial sections.
The ready counter is a measure of oversubscription. Ideally the system would have one ready instruction every cycle. It does not hurt to have more than one but there is no advantage to it as you can only issue one instruction per CPU per cycle. If the current set of coscheduled jobs is oversubscribed while another set (either on a different processor or in a different timeslice) is undersubscribed, it may make sense to swap a highly parallel job in the current set with a less resource intensive job in the other set. This would boost resource utilization in the undersubscribed set and would probably not adversely effect system utilization in the current oversubscribed set.
Let us see how smoothing out system resource utilization can pay off. If we take four NPB benchmarks and number them 0:CG, 1:EP, 2:FT, 3:MG we can see that there are three possible ways of taking two jobs at a time corresponding to three possible partners for CG. We can identify these three schedules as 01_23, 02_13 and 03_12. 01_23 means that in this schedule jobs 0 and 1 (CG and EP) are run together and then jobs 2 and 3 (FT and MG); the other schedules are also labeled in a way that describes their pairings. It turns out that the schedule 03_12 is slightly faster than 02_13 and about 4.3% faster than 01_23. Figure 5 shows the ready counter for each pairing of each schedule. 01_23 is highly imbalanced. CG and EP as a pair oversubscribe the machine. FT and MG together undersubscribe it. The other schedules smooth out system issue slot competition as indicated by the reduced difference in the height of their bars in Fig. 5.
Figure 5: Ready counter measurements for the 1st and 2nd pairs of each of the possible schedules of 0:CG, 1:EP, 2:FT, 3:MG.
2.4 Symbiosis with managed resource sharing
As observed above in section 2.2.2, one way of staying out of pathological job interaction is to manage resources from a higher level. Instead of letting jobs fight among themselves for how many streams or other resource each ought to get, a resource manager can divide the resources up evenly. However, while dividing up resources evenly is an easy and fair policy to implement, a clever resource scheduler can do better. There is a sense in which one job can "deserve" more resource than another. Apart from priority (which is just an outside valuation of one job as being more valuable than another) a job that uses resource efficiently contributes to system goals such as utilization and throughput. A scheduler that recognizes this can exploit it. The efficiency of jobs on the MTA v.s. the number of streams devoted to them have the typical shapes shown in Figures 6 and 7
Figure 6: A plot of execution time v.s. streams devoted to FT
Figure 7: A plot of execution time v.s. streams devoted to EP
Because 21 streams are required simply to cover the pipeline latency, most parallel jobs have nearly linear speedups up to that number of streams. More streams may be required to cover memory latencies. After a certain point there is no more benefit to adding more streams either because the system is fully utilized and all (or almost all) latencies are covered or because the job has insufficient parallelism to benefit from more streams. If it is the first reason then the job is highly efficient and will exhibit little symbiosis at least in its parallel sections. However, if it is the second reason then the job may have high phantoms in its parallel sections and so present opportunities for symbiosis. Furthermore, it makes sense to devote proportionally more streams to the symbiotic partner that can make better use of them. Note in Figures 6 and 7 that the slope of EP is steeper between 20 and 30 streams than is FT. In fact FT suffers from a lack of outer loop parallelism. There is an optimization that corrects this by doing multiple butterflies at once but we do not consider that version of FT here. Fig. 8 shows that there is an 8% performance gain to be had by splitting 90 streams 30-60 in favor of EP rather than 45-45 evenly between EP and FT.
Figure 8: Execution time of both EP and FT together depending on streams devoted to both (out of 90 streams total on one CPU.)
2.5 Symbiosis with a realtime constraint
While high system utilization with associated high throughput is a laudable goal, some jobs have realtime constraints that must be respected even if utilization is adversely affected. MPIRE is a job that is used, in one mode, to interactively volume render datasets. In interactive mode, there is a tolerable threshold for image turnaround that should not be violated. This amounts to a soft realtime constraint. The user will tolerate delays of 1 or 2s but not 10s between frames. To ensure high turnaround one might consider not coscheduling any other job with MPIRE. However Fig. 9 shows that utilization can be boosted with only a slight increase in image turnaround if the number of streams devoted to a competing job is kept low. In this case a few streams are devoted to FT with an increase in system utilization and throughput and a small increase in the latency between images generated by MPIRE. The more resource devoted to FT, the better overall system utilization (up to a point). There is a tradeoff between boosting system utilization and degrading MPIRE performance. An increase in image turnaround of 3-4% will not be noticed by the user. One could consider assigning a priority to such jobs as MPIRE indicative of the amount of machine they would be willing to share. A priority of 0 would indicate that they are unwilling to share the machine at all. A priority of 5 would mean they would be willing to share 5 streams (or 5% of the machine.) This sort of scheme would allow the system to fill cracks around high priority jobs without affecting them very much.
Figure 9: Increase in throughput v.s. decrease in MPIRE performance when streams are devoted to a parasite program (FT)
4 S.O.S. (Sample, Optimize, Symbios)
Drawing on techniques and observations detailed in the previous section, we have designed a user level scheduler that schedules jobs fairly while boosting system utilization and throughput. Presented with a mix of jobs to run and a priority for each expressed as a maximum percentage of the machine which each job would be willing to share, the scheduler first runs several schedules of the jobs for short bursts each. This is called the sample phase. A schedule is a set of tuples where each tuple is a set of coscheduled jobs. In this study our tuples are pairs. Schedules differ by what jobs are paired together. Every pair must be physically runnable (that is it cannot overallocate memory or streams) and every schedule must run all of the jobs. During the sample phase, data is collected on each schedule. Specifically, the Ready, Phantom and Issue counters are read. It is allowed to consult the sampled data to guide the permuting of schedules. So for example, a job in an overallocated pair may be swapped for a job in an under allocated pair to form a new schedule. In the optimize phase one of the sampled schedules is predicted to be better than the others based on examination of the sampled data. In this case a schedule with a lower phantom rate is predicted to be better than a schedule with higher phantoms. In the symbios phase the best predicted schedule is run to completion. In this case we sample for 10 seconds and symbios for at least 500 seconds. We have experimented with two flavors of S.O.S., one in which job pairs are completely replaced each timeslice and one in which only one job is replaced each timeslice. The second flavor reduces pressure on the memory subsystem during swapping by ensuring that at least one job is executing while at most one is paging in and out.
Figure 10 shows the results of using S.O.S. on a jobmix made up of versions of the 5 NPB kernels and MPIRE. We schedule a single MTA CPU and 1 GB of main memory. These jobs are modified to each run about 150 seconds when run alone and to each require 500MB of memory. Due to the memory required no more than two jobs can be coscheduled. We noted above that, in any case, there is not much benefit to running more that two highly tuned programs at once. The first column labeled SEQ is the total execution time if each job is run by itself. The second, BLIND, is the average execution time to complete the job mix achieved by several schedules that blindly pair jobs. The third, SYMBIO is that achieved by the scheduler after first doing a sample phase to determine efficient pairings. The fourth HIERARCH, is that achieved by the scheduler applying the method SYMBIO but also looking for performance sweet spots during the sample phase in the division of streams as in section 2.4. The fifth, BEST, applies all of the previous techniques but also only swaps one job at a time (effectively halving the swapping overhead.) We can see that although all of these jobs are highly efficient on the MTA (in fact all have been extensively tuned for the MTA) none max out machine utilization. Higher machine utilization is achieved by pairing jobs for co-execution. Doing this cleverly by trying several possible pairings and picking the best one pays off in yet higher machine utilization and throughput. By avoid pairs that overallocate the machine and those that underallocate it, SYMBIO and the disciplines that improve upon it (HIERARCH and BEST) avoid pathological pairings such as EP + MPIRE. These show up as overallocated pairs in the sample phase as they both have big, low phantom, parallel sections. SYMBIO and the improvements on it smooth out resource allocation across the schedule and so stay out of pathological stream throttling situations. Giving more resource to jobs that use it better and overlapping swapping and execution of jobs to hide swapping overhead yields even more improvement.
Figure 10: Wall clock time to complete jobmix CG, EP, FT, IS, MG and MPIRE using various S.O.S. scheduling disciplines.
Multithreaded machines such as the MTA have the ability to operate as throughput engines. It is possible to squeeze almost the last cycle of performance out of the machine regardless of the performance of the individual jobs. Compute facilities do not have to be stuck with the utilization levels of individual jobs; even inefficient jobs, if coscheduled together can use almost every cycle. However, some care must go into the design of jobschedulers for these machines. Issues and concerns arise due to the intimate sharing of resources that simply would not arise on more traditional machines. The S.O.S. scheme allows the scheduler to dynamically discover good coschedules and to avoid pitfalls in co-scheduling. S.O.S. pays off in improved system throughput.Future Work
In future work we will show S.O.S. avoiding other coscheduling pitfalls that come about when jobs compete for scarce system resource such as instruction cache and memory bandwidth. We will also report results for job tuples size greater than two with less highly tuned programs.
This work was supported in part by NSF award ASC-9613855 and DARPA contract DABT63-97-C-0028.References
Boisseau, Jay, Larry Carter, Allan Snavely, David Callahan, John Feo, Simon Kahan and Zhijun Wu. "CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger", Cray User's Group Conference, June 1998.
Carter, Larry, John Feo and Allan Snavely. "Performance and Programming Experience on the Tera MTA", SIAM, April 1999.
Johnson, Greg and Jon Genetti. "MPIRE: Massively Parallel Interactive Rendering Environment", http://mpire.sdsc.edu.
Snavely, Allan, Larry Carter, Jay Boisseau, Amit Majumdar, Kang Su Gatlin, Nick Mitchell, John Feo and Brian Koblenz. "Multi-processor Performance on the Tera MTA", SC 98, November 1998.
Snavely, Allan, Nick Mitchell, Larry Carter, Jeanne Ferrante and Dean Tullsen. "Explorations in Symbiosis on two Multithreaded Architectures", M-TEAC, January 1999.
Agarwal, R.C., B. Alpern, L. Carter, F.G. Gustavson, D. Klepacki, R. Lawrence and M. Zubair, ``High Performance Parallel Implementations of the NAS Kernel Benchmarks on the IBM SP2,'' IBM Systems Journal, Vol 34, pp. 263-272 (1995).
Linzer, E. and E. Feig, ``Modified FFT's for Fused Multiply-Add Architectures'', Mathematics of Computation, Vol 60, pp. 347-361 (January 1993).
Goedecker, S ``Fast Radix 2,3,4, and 5 Kernels for Fast Fourier Transformations on Computers with overlapping multiply-add instructions'', SIAM Journal on Scientific Computing, Vol 16, pp 1605-1611, (November 1997).
Crowl, L. A., ``How to Measure, Present, and Compare Parallel Performance,'' IEEE Parallel & Distributed Technology, pp. 9-25, (Spring 1994).
Hockney, R. W., ``A Framework for Benchmark Performance Analysis,'' Proc. 2nd European Workshop, (September, 1991).
``Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multi threading'' Jack L. Lo(UW), Susan J. Eggers(UW), Joel S. Emer(DEC), Henry M. Levy(UW), Rebecca L. Stamm(DEC) and Dean T ullsen (UCSD) ACM Transactions on Computer Systems , August 1997.