Está en la página 1de 28

Blue Gene/L

Table of Contents
Table of Contents.................................................................................................................1 ............................................................................................................................................1 Abstract................................................................................................................................1 Blue Gene/L.........................................................................................................................3 Introduction..........................................................................................................................3 The performance spectrum..............................................................................................5 Desi n and Anal!sis of the Blue Gene/L Torus Interconnection "et#or$.........................% Torus "et#or$.............................................................................................................% &imulator '(er(ie#...................................................................................................1) &ample *erformance &tudies.........................................................................................1+ Application........................................................................................................................1, The protein foldin problem..........................................................................................)1 Current (ie# of foldin mechanisms. ...........................................................................)) -eferences......................................................................................................................). ###.research.ibm.com/blue ene......................................................................................).

Abstract

Blue Gene/L Blue Gene/L /BG/L0 is a 1+2 /1535310 node scientific and en ineerin supercomputer that IB4 is de(elopin #ith partial fundin from the 5nited &tates Department of 6ner !. This paper describes one of the primar! BG/L interconnection net#or$s3 a three dimensional torus. 7e describe a parallel performance simulator that #as used e8tensi(el! to help architect and desi n the torus net#or$ and present sample simulator performance studies that contributed to desi n decisions. In addition to such studies3 the simulator #as also used durin the lo ic (erification phase of BG/L for performance (erification3 and its use there unco(ered a bu in the 9:DL implementation of one of the arbiters. Blue Gene/L /BG/L0 is a scientific and en ineerin 3 messa e;passin 3 supercomputer that IB4 is de(elopin #ith partial fundin from the 5.&. Department of 6ner ! La#rence Li(ermore "ational Laborator!. A 1+2 node s!stem is scheduled to be deli(ered to Li(ermore3 #hile a )<2 node s!stem #ill be installed at the IB4 T.=. 7atson -esearch Center for use in life sciences computin 3 primaril! protein foldin . A more complete o(er(ie# of BG/L ma! be found in >1?3 but #e briefl! describe the primar! features of the machine.

Blue Gene/L

Blue Gene/L

The Blue Gene/L architecture

Introduction
The first computer in the Blue Gene series3 Blue Gene/L3 de(eloped throu h a partnership #ith La#rence Li(ermore "ational Laborator!3 cost 5&@1<< million and is intended to scale to speeds in the hundreds of TAL'*&3 #ith a theoretical pea$ performance of 31< TAL'*&. This is almost ten times as fast as the 6arth &imulator3 the fastest supercomputer in the #orld before Blue Gene. In =une )<<+3 t#o Blue Gene/L protot!pes scored in the T'*5<< &upercomputer List at the B+ and B% positions. 'n &eptember ),3 )<<+3 IB4 announced that a Blue Gene/L protot!pe at IB4 -ochester /4innesota0 had o(erta$en "6CCs 6arth &imulator as the fastest computer in the #orld3 #ith a speed of 31.<1 TAL'*&3 beatin 6arth &imulatorCs 35.%1 TAL'*&. The machine later reached a speed of .<..) TAL'*&. Linu8 #ill be the main operatin s!stem for IB4Cs upcomin famil! of DBlue GeneD supercomputers;;a maEor endorsement for the operatin s!stem and the open;source computin model it represents. The decision to adopt Linu8 came3 in part3 as a result of the ro#in siFe and stren th of the open;source communit!. Thousands of de(elopers

Blue Gene/L around the #orld are participatin in the e(olution of Linu8. Creatin a ne# '& inside of IB4 #ould reGuire a massi(e en ineerin effort. 'n 4arch )+3 )<<53 the 5& Department of 6ner ! announced that Blue Gene/L bro$e its current #orld speed record3 reachin 135.5 TAL'*&. This feat #as possible because of doublin the number of rac$s to 3) #ith each rac$ holdin 13<)+ compute nodes. This is still onl! half of the final confi uration #ith 153531 compute nodes. The final Blue Gene/L installation #ill ha(e a total of 153531 compute nodes /i.e.3 )11 nodes0 and an additional 1<)+ I/' nodes. 6ach compute or I' node is a sin le A&IC #ith associated D-A4 memor! chips. The A&IC inte rates t#o *o#er*C ++< embedded processors3 a cache sub;s!stem and communication sub;s!stems. 6ach node is attached to three parallel communications net#or$sH a 3D toroidal net#or$ for peer;to;peer communication bet#een compute nodes3 a collecti(e net#or$ for collecti(e communication3 and a lobal interrupt net#or$ for fast barriers. The I/' nodes3 #hich run the Linu8 operatin s!stem3 pro(ide communication #ith the #orld (ia an 6thernet net#or$. Ainall!3 a separate and pri(ate 6thernet net#or$ pro(ides access to an! node for confi uration3 bootin and dia nostics. Blue Gene/L compute nodes use on a minimal operatin s!stem supportin a sin le process thread3 and lac$in interrupts and (irtual memor!. To allo# multiple pro rams to run concurrentl!3 compute nodes can be partitioned into electronicall! isolated sets of nodes. The number of nodes in a partition must be a positi(e inte er po#er of )3 and must contain at least )5 I 3) nodes. The ma8imum partition is all nodes in the computer. To run a pro ram on Blue Gene/L3 a partition of the computer must first be reser(ed. The pro ram is then run on all the nodes #ithin the partition3 and no other pro ram ma! access nodes #ithin the partition #hile it is in use. 5pon completion3 the partition nodes are released for future pro rams to use. 7ith so man! nodes3 components #ill be failin freGuentl!. Thus3 the s!stem #ill be able to electricall! isolate fault! hard#are to allo# the machine to continue to run.

Blue Gene/L In public relations terms3 it is bein positioned as the successor of IB4Cs Deep Blue chess computerJ ho#e(er it bears little architectural resemblance to Deep Blue.

Technical Details
The performance spectrum
In computin 3 AL'*& is an abbre(iation of ALoatin point 'perations *er &econd. This is used as a measure of a computerCs performance3 especiall! in fields of scientific calculations that ma$e hea(! use of floatin point calculations. /"oteH a hertF is a c!cle /or operation0 per second. Compare to 4I*& ;; million instructions per second.0 'ne should spea$ in the sin ular of a AL'*& and not of a AL'*3 althou h the latter is freGuentl! encountered. The final & stands for second and does not indicate a plural. Computin de(ices e8hibit an enormous ran e of performance le(els in floatin ;point applications3 so it ma$es sense to introduce lar er units than the AL'*&. The standard &I prefi8es can be used for this purpose3 resultin in such units as the me aAL'*& /4AL'*&3 1<1 AL'*&03 the i aAL'*& /GAL'*&3 1<, AL'*&03 the teraAL'*& /TAL'*&3 1<1) AL'*&03 and the petaAL'*& /*AL'*&3 1<15 AL'*&0. A cheap but modern des$top computer usin 3 for e8ample3 a *entium + or Athlon 1+ C*53 t!picall! runs at a cloc$ freGuenc! in e8cess of ) G:F and pro(ides computational performance in the ran e of a fe# GAL'*&. 6(en some (ideo ame consoles of the late 1,,<sC (inta e3 such as the Gamecube and Dreamcast had performance in e8cess of one GAL'*& . The ori inal supercomputer3 the Cra!;13 #as set up at Los Alamos "ational Laborator! in 1,.1. The Cra!;1 #as capable of %< 4AL'*& /or3 accordin to another source3 13%K )5< 4AL'*&0. In fe#er than 3< !ears since then3 the computational speed of supercomputers has Eumped a millionfold.

Blue Gene/L The fastest computer in the #orld as of "o(ember 53 )<<+3 the IB4 Blue Gene supercomputer3 measures .<..) TAL'*&. This supercomputer #as a protot!pe of the Blue Gene/L machine IB4 is buildin for the La#rence Li(ermore "ational Laborator! in California. Durin a speed test on )+ 4arch )<<53 it #as rated at 135.5 TAL'*&. Blue GeneCs ne# record #as achie(ed b! doublin the number of current rac$s to 3). 6ach rac$ holds 13<)+ processors3 !et the chips are the same as those found in hi h;end computers. The complete (ersion #ill ha(e a total of 1+ rac$s and a theoretical speed measured at 31< TAL'*&.

Architecture Details
Blue Gene/L /BG/L0 is a scientific and en ineerin 3 messa e;passin 3 supercomputer that IB4 is de(elopin #ith partial fundin from the 5.&. Department of 6ner ! La#rence Li(ermore "ational Laborator!. A 1+2 node s!stem is scheduled to be deli(ered to Li(ermore3 #hile a )<2 node s!stem #ill be installed at the IB4 T.=. 7atson -esearch Center for use in life sciences computin 3 primaril! protein foldin . BG/L is built usin s!stem;on;a;chip technolo ! in #hich all functions of a node /e8cept for main memor!0 are inte rated onto a sin le A&IC. This A&IC includes t#o 3); bit *o#er *C cores /the ++<0J the ++< #as de(eloped for embedded applications. Associated #ith each core is a 1+; bit LdoubleM floatin ;point unit /A*50 that can operate in &I4D mode. 6ach /sin le0 A*5 can e8ecute up to t#o multipl!;adds per c!cle3 meanin that the pea$ performance of the chip is % floatin ;point operations per c!cle. 6ach ++< has its o#n instruction and data caches /each 3)2B03 a small L) cache that primaril! ser(es as a pre;fetch buffer3 a +4B shared L3 cache built from embedded D-A43 and a DDmemor! controller. In addition3 the lo ic for fi(e different net#or$s is inte rated onto the A&IC. These net#or$s include a =TAG control and monitorin net#or$3 a Gbit 6thernet macro3 a lobal barrier and alert net#or$3 a LtreeM net#or$ for broadcasts and combinin operations such as those used in the 4*I collecti(e communications librar!3 and a three dimensional torus net#or$ for point;point communications bet#een nodes. The A&IC can be used as either an I/' node or as a Compute node. I/' nodes ha(e their 6thernet

Blue Gene/L macro connected to an e8ternal s#itch enablin connecti(it! to hosts3 ho#e(er the! do not use the torus net#or$. Compute nodes do not connect their 6thernet3 and tal$ to the I/' nodes o(er the tree net#or$. The Li(ermore machine #ill ha(e 1+ Compute nodes for each I/' node. I/' nodes #ill ha(e at least 51)4B and Compute nodes #ill ha(e at least )51 4B of memor!3 dependin on the cost of memor! at the time of deli(er!. Because of the hi h le(el of inte ration and relati(el! lo# tar et cloc$ speed /.<< 4:F tar et03 the s!stem is desi ned to deli(er unprecedented a re ate performance at both lo# cost and lo# po#er consumption. At this cloc$ rate3 each node has a pea$ of 5.1 GAlops3 #hile the 1+2 node s!stem has a pea$ of 31. Tera Alops. 6ach A&IC #ill consume onl! 1) #atts of po#er. Because of the lo# po#er3 a (er! hi h densit! of pac$a in can be achie(ed.T#o compute A&ICs and their associated memor! are pac$a ed onto a compute card3 11 compute cards are mounted on a node card3 and 11 node cards are pac$a ed in a 51) node midplane. T#o midplanes are pac$a ed in a 1<)+ node rac$3 #hich is about the siFe of a lar e refri erator. Because the ++< core does not contain shared memor! support3 the L1 caches of the t#o cores on the same A&IC are not coherent. 4emor! is consistent from the L) on out3 but soft#are is reGuired to appropriatel! mana e the L1Ns. The s!stem can operate in one of t#o modes. In communications coprocessor mode3 one core is responsible for computin #hile the other core handles most messa in functions. Careful soft#are coordination is reGuired in this mode to o(ercome the lac$ of L1 coherence. 7hen confi ured in this mode3 the pea$ performance of the 1+2 node s!stem is 1%3 Tera Alops. In the second mode3 L(irtual nodeM mode3 each core has its o#n memor! space and each core is responsible for both computin and messa e handlin J the s!stem has t#o sets of net#or$ inEection and reception AIA's3 so that both cores can simultaneousl! access the net#or$ interfaces.

Blue Gene/L

Design and Analysis of the Blue Gene/L Torus Interconnection Network


Torus Network
4an! of the desi n decisions #ere dri(en b! simulation performance studies. The torus net#or$ uses d!namic routin #ith (irtual cut throu h bufferin . A torus #as chosen because it pro(ides hi h band#idth nearest nei hbor connecti(it!3 #hich is common in scientific applications3 but also for its scalabilit!3 cost and pac$a in considerations. A torus reGuires no lon cables and3 because the net#or$ is inte rated onto the same chip that does computin 3 no separate s#itch is reGuired. *re(ious supercomputers such as the Cra! T36 ha(e also used torus net#or$s. Torus pac$ets are (ariable in siFe K from 3) to )51 b!tes in increments of 3) b!te chun$s. The first ei ht b!tes of each pac$et contain lin$ le(el protocol information /e. .3 seGuence number0 and routin information includin destination3 (irtual channel and siFe. A )+;bit C-C is appended to each pac$et3 alon #ith a one b!te (alid indicator. The C-C permits lin$ le(el chec$in of each pac$et recei(ed3 and a timeout mechanism is used for retransmission of corrupted pac$ets. The error detection and reco(er! protocol is similar to that used in IB4 &* interconnection net#or$s as #ell as in the :I**I standard. Aor routin 3 the header includes si8 LhintM bits3 #hich indicate in #hich directions the pac$et ma! be routed. Aor e8ample3 hint bits of 1<<1<< means that the pac$et can be routed in the 8O and !; directions. 6ither the 8O or 8; hint bits3 but not both3 ma! be set. If no 8 hops are reGuired3 the 8 hint bits are set to <. 6ach node maintains re isters that contain the coordinates of its nei hbors3 and hint bits are set to < #hen a pac$et lea(es a node in a direction such that it #ill arri(e at its destination in that dimension. These hint bits appear earl! in the header3 so that arbitration ma! be efficientl! pipelined. The hint bits can be initialiFed either b! soft#are or hard#areJ if done b! hard#are3 a set of t#o re isters per dimension is used to determine the appropriate directions. These re isters can be confi ured to pro(ide minimal hop routin . The routin is accomplished entirel! b! e8aminin the hint bits and (irtual channels3 i.e.3 there are no routin tables. *ac$ets ma! be either d!namicall! or

Blue Gene/L staticall! /8!F0 routed. Besides point;topoint pac$ets3 a bit in the header ma! be set that causes a pac$et to be broadcast do#n an! dimension. The hard#are does not ha(e the capabilit! to route around LdeadM nodes or lin$s3 ho#e(er3 soft#are can set the hint bits appropriatel! so that such nodes are a(oidedJ full connecti(it! can be maintained #hen there are up to three fault! nodes3 pro(ided the! are not co;linear. The torus lo ic consists of three maEor units3 a processor interface3 a send unit and a recei(e unit. The processor interface consists of net#or$ inEection and reception AIA's. Access to these AIA's is (ia the double A*5 re isters3 i.e.3 data is loaded into the AIA's (ia 1)% bit memor! mapped stores from a pair of A*5 re isters3 and data is read from the AIA's (ia 1)% bit loads to the A*5 re isters. There are a total of % inEection AIA's or aniFed into t#o roupsH t#o hi h priorit! /for inter;node '& messa es0 and si8 normal priorit! AIA's3 #hich are sufficient for nearest nei hbor connecti(it!. *ac$ets in all AIA's can o out in an! direction. 6ach roup of reception AIA's contains . AIA's3 one hi h priorit! and one dedicated to each of the incomin directions. 4ore specificall!3 there is a dedicated bus bet#een each recei(er and its correspondin reception AIA'. 5p to si8 inEection and si8 reception AIA's ma! be simultaneousl! acti(e. 6ach of the si8 recei(ers3 as sho#n in Ai ure 13 has four (irtual channels /9Cs0. 4ultiple 9Cs help reduce head;ofline bloc$in >+?3 but in addition3 mesh net#or$s includin tori #ith d!namic routin 3 can deadloc$ unless appropriate additional LescapeM 9Cs are pro(ided. 7e use a recent3 ele ant solution to this problem3 the LbubbleM escape 9C as proposed in BG/L has t#o d!namic 9Cs3 one bubble escape 9C that can be used both for deadloc$ pre(ention and static routin 3 and one hi h priorit! bubble 9C. 6ach 9C has 1 2B of bufferin 3 enou h for four full;siFed pac$ets. In addition to the 9Cs3 the recei(ers include a Lb!passM channel so that pac$ets can flo# throu h a node #ithout enterin the 9C buffers3 under appropriate circumstances. D!namic pac$ets can onl! enter the bubble escape 9C if no (alid d!namic 9Cs are a(ailable. A to$en flo# control al orithm is used to pre(ent o(erflo#in the 9C buffers. 6ach to$en represents a 3)B chun$. Aor simplicit! in the arbiters3 a 9C is mar$ed as una(ailable unless % to$ens /a full;siFed pac$et0 are a(ailable. :o#e(er3 to$en counts for pac$ets on d!namic 9Cs are incremented and decremented accordin to the siFe of the pac$et. The bubble rules3 as

Blue Gene/L outlined in reGuire that to$ens for one full;siFed pac$et are reGuired for a pac$et alread! on the bubble 9C to ad(ance3 but that to$ens for t#o full;siFed pac$ets are reGuired for a pac$et to enter the bubble 9C3 upon either inEection3 a turn into a ne# direction3 or #hen a d!namic 9C pac$et enters the bubble. This rule ensures that buffer space for one pac$et is al#a!s a(ailable after an insertion and thus some pac$et can al#a!s3 e(entuall! mo(e. :o#e(er3 #e disco(ered that this rule is incomplete for (ariable;siFed pac$ets #hen our simulator deadloc$ed usin this rule. 7ith this rule3 the remainin free space for one full; siFed pac$et can become fra mented resultin in a potential deadloc$. To pre(ent this3 the bubble rules are simpl! modified so that each pac$et on the bubble is accounted for as if it #ere a fullsiFed /% chun$0 pac$et. 6i ht b!te ac$no#led ement /ac$;onl!0 or combined to$en;ac$no#led ement /to$en;ac$0 pac$ets are returned #hen pac$ets are either successfull! recei(ed3 or #hen space has freed up in a 9C. Ac$no#led ements permit the torus send units to delete pac$ets from their retransmission AIA's3 #hich are used in the error reco(er! protocol. The send units also arbitrate bet#een reGuests from the recei(er and inEection units. Due to the densit! of pac$a in and pin constraints3 each lin$ is bit serial. The torus is internall! cloc$ed at onefourth the rate of the processor3 so at the tar et .<< 4:F cloc$ rate3 each torus lin$ is 1.5 4B/sec. There are sufficient internal busses so that each of the 1 out oin and 1 incomin lin$s can be simultaneousl! bus!J thus each node can be sendin and recei(in 1.<5 GB/sec. In addition3 there are t#o transfer busses /paths0 comin out of each recei(er that connect #ith the senders. Thus3 a sin le recei(er can ha(e up to + simultaneous transfers3 e. .3 one to its normal reception AIA'3 one to the hi h priorit! reception AIA'3 and t#o to t#o different senders. Arbitration is distributed and pipelined3 but occurs in three basic phases. It eneraliFes an approach used in >3? and represents tradeoffs bet#een comple8it!3 performance3 and abilit! to meet timin constraints. Airst3 each pac$et at the head of the inEection or 9C AIA's decides in #hich direction and on #hat 9C it prefers to mo(e. Aor staticall! routed pac$ets3 there is onl! one (alid choice3 but d!namicall! routed pac$ets ma! ha(e man! choices. The preferred direction and 9C are selected usin a modified L=oin the &hortest PueueM /=&P0 al orithm as follo#s. The senders pro(ide the recei(ers and inEection AIA's #ith a bit indicatin both lin$ and to$en a(ailabilit! for each 9C in each direction.

Blue Gene/L This bit (ector is and;ed #ith a bit (ector of possible mo(es constructed from the pac$etNs hint bits and 9C. This defines the set of possible and a(ailable arbitration reGuests. In addition3 the sender pro(ides ) bits for each 9C indictin one of four ran es of a(ailable do#nstream to$ens. 'f all the possible and a(ailable d!namic direction/9C pairs3 the pac$et selects the one #ith the most a(ailable do#nstream to$ens. Ties are randoml! bro$en. If no d!namic direction/9C combination is a(ailable3 the pac$et #ill reGuest its bubble escape direction/9C pair /if a(ailable03 and if that is also una(ailable3 the pac$et ma$es no arbitration reGuest. This is a some#hat simplified description since bus a(ailabilit! must also be ta$en into account. In addition3 #hen a pac$et reaches its destination3 the LdirectionM reGuested is simpl! the correspondin reception AIA'. &econd3 since each recei(er has multiple 9C AIA's /plus the b!pass0 an arbitration phase is reGuired to determine #hich of the reGuestin pac$ets in the recei(er #ins the ri ht to reGuest. If a hi h priorit! pac$et is reGuestin 3 it #ins. Barrin that3 a modified L&er(e the Lon est PueueM /&LP0 is used3 based on ) bit /+ ran es0 AIA' Aullness indicators3 i.e.3 the pac$et from the most full 9C /as measured to #ithin the ) bits of ranularit!0 #ins. :o#e(er3 this cannot al#a!s be used since doin so ma! completel! bloc$ out a 9C. Therefore3 a certain /pro rammable0 fraction of the arbitration c!cles are desi nated &LP c!cles in #hich the abo(e al orithm is used3 #hile the remainin c!cles select the #inner randoml!. A pac$et on the b!pass channel al#a!s recei(es the lo#est priorit! /unless it is a hi h priorit! pac$et0. Third3 the recei(ers and inEection AIA's present their reGuests to the senders. "ote that on a i(en c!cle a recei(er #ill present at most one reGuest to the senders. Thus each sender arbiter can operate independentl!. The sender i(es hi hest priorit! to to$en;ac$ or ac$;onl! pac$ets3 if an!. Barrin that3 the senders tend to fa(or pac$ets alread! in the net#or$ and use a similar modified &LP al orithm in #hich there are &LP c!cles and random c!cles. In particular3 a certain pro rammable fraction of c!cles /t!picall! 1.<0 i(e priorit! to pac$ets alread! in the net#or$ /unless the onl! hi h priorit! pac$et reGuestin is in an inEection AIA'0. 'n such c!cles the modified &LP al orithm is used. :i her priorit! can be i(en to inEection pac$ets b! lo#erin abo(e in; net#or$ priorit! fraction. 'n c!cles in #hich inEection pac$ets recei(e priorit! /barrin in;net#or$ hi h priorit! pac$ets03 the modified &LP al orithm is also used.

Blue Gene/L

Si ulator !"er"iew
Gi(en the comple8it! and scale of the BG/L interconnection net#or$3 ha(in an accurate performance simulator #as essential durin the desi n phase of the proEect. Due to the potential siFe of such a model3 simulation speed #as a si nificant concern and a pro(en shared memor! parallel simulation approach #as selected. In particular3 parallel simulation on shared memor! machines has been sho#n to be (er! effecti(e in simulatin interconnection net#or$s #hereas success #ith messa e passin parallel interconnection net#or$ simulators is harder to come b! .7e also reco niFed the difficulties in de(elopin an e8ecution dri(en simulator for a s!stem #ith up to 1+2 processes3 and therefore decided upon a simulator that #ould primaril! be dri(en b! application pseudo;codes3 in #hich messa e passin calls could be easil! passed to the simulatorJ such calls include the time since the last call /the e8ecution burst time03 the destination and siFe of the messa e3 etc. This pseudo;code included a subset of the 4*I point to point messa in calls as a #or$load dri(er for the simulator. 7e also e8tended the IB4 5T6 trace capture utilit! that runs on IB4 &* machines and #ere able to use such traces as simulator inputs /for up to se(eral hundreds of nodes0. The basic unit of simulation time is a net#or$ c!cle3 #hich is defined to be the time it ta$es to transfer one b!te. As BG/L is or aniFed around 51) node /%8%8%0 midplanes3 the simulator partitions its #or$ on a midplane basis3 i.e.3 all nodes on the same midplane are simulated b! the same processor /thread0 and midplanes are assi ned to threads in as e(en a manner as possible. Because different threads are concurrentl! e8ecutin 3 the local simulation cloc$s of the threads need to be properl! s!nchroniFed. To deal #ith this problem3 #e use a simple but effecti(e Lconser(ati(eM parallel simulation protocol $no#n as LQA7"&M .In particular3 #e ta$e ad(anta e of the fact that the minimum transit time bet#een midplanes is $no#n and is at least some constant #R1 c!cles. In this protocol3 time L#indo#sM of len th # are simulated in parallel b! each of the threads. Consider an e(ent that is e8ecuted durin

Blue Gene/L the #indo# /startin at time t0 on processor i that is destined to arri(e on processor E in the futureJ such an e(ent represents the arri(al of the first b!te of a pac$et. &ince the minimum transit time is #3 the arri(al cannot occur durin the current #indo#3 represented b! the inter(al >t3 tO#;1?. *rocessor i simpl! puts a pointer to the e(ent on an i;to;E lin$ed list. 7hen each processor reaches the end of the #indo#3 it enters a barrier s!nchroniFation. 5pon lea(in the barrier3 each processor is sure that e(er! other processor has e8ecuted all e(ents up to time tO#;1 and that all inter;processor e(ents are on the appropriate inter;processor lin$ed lists. *rocessor E can therefore o throu h all its i;to;E lin$ed lists3 remo(e e(ents from them3 and put the e(ents on its o#n future e(ent list. 'nce this is done3 the processors can simulate the ne8t #indo# >tO#3 tO)#;1?. If #I13 then this protocol reGuires a barrier s!nchroniFation e(er! c!cle3 ho#e(er3 on BG/L3 the minimum inter;midplane dela! #ill be appro8imatel! #I1< net#or$ c!cles. 7hen a lar e number of BG/L nodes are bein simulated3 each processor #ill e8ecute man! e(ents durin a #indo#3 i.e.3 bet#een barriers3 and thus the simulator should obtain ood speedups. The simulator runs on a 11;#a! IB4 Lni htha#$M &4* #ith 1+ GB of memor!. The model of the torus hard#are contains close to 1<< resources per node /lin$s3 9C to$en counters3 busses3 AIA's3 etc03 so that a full 1+2 node s!stem can be thou ht of as a lar e Gueuin net#or$ #ith appro8imatel! 1 million resources. It consumes a lar e amount of memor! and runs slo#l!J a 3)2 node simulation of full! loaded net#or$ ad(ances at about <.)5 microseconds of BG/L time per second of #all cloc$ time. :o#e(er3 it obtains e8cellent speedup3 t!picall! more than 1) on 11 nodes3 and sometimes achie(es superlinear speedup due to the pri(ate %4B L3 caches on the &4* and the smaller per node memor! footprint of the parallel simulator. The model3 #hich #as #ritten before the 9:DL3 is thou ht to be a Guite accurate representation of the BG/L hard#are3 althou h a number of simplifications #ere made. Aor e8ample3 in BG/L the arbitration is pipelined and occurs o(er se(eral c!cles. In the simulator3 this is modeled as a dela! of se(eral c!cles follo#ed b! presentation of the arbitration reGuest. Because the simulator focuses on #hat happens once pac$ets are inside the net#or$3 a ross simplification #as the assumption that the inEection AIA's #ere of infinite siFe3 and that pac$ets are placed in these AIA's as earl! as possible rather than as space frees up in the AIA's. This has little effect on net#or$ response time and throu hput measurements

Blue Gene/L durin the middle of a run3 but can affect the d!namics particularl! near the end of runs. The simulator also did not model the error reco(er! protocol3 i.e.3 no lin$ errors #ere simulated and the ac$onl! pac$ets that are occasionall! sent if a lin$ is idle for a lon time #ere not modeled. :o#e(er3 the arbitration al orithms and to$en flo# control are modeled to a hi h le(el of detail.

Sample Performance Studies

In this section3 #e present some e8amples of use of the simulator to stud! desi n trade; offs in BG/L. The studies presented are illustrati(e and sometimes use assumptions and correspondin parameters about the s!stem that do not reflect the final BG/L desi n. -esponse Time in Li ht TrafficH Ai ure ) plots the response time for (arious 3)2 node BG/L confi urations #hen the #or$load dri(er enerates pac$ets for random destinations and the pac$et eneration rate is lo# enou h so that the a(era e lin$ utiliFation is less than one. This Ai ure compares static routin to d!namic routin #ith one or more d!namic 9Cs and one or more busses /paths0 connectin recei(ers to senders. &impler3 random3 arbitration rules than &LP and =&P #ere used and the plot #as enerated earl! in our studies #hen the tar et lin$ band#idth #as 35< 4B/sec. /The 35< 4B/sec. assumption essentiall! onl! affects results b! a rescalin of the !;a8is.0 The fi ure sho#s the clear benefit of d!namic o(er static routin . It also sho#s that there is little benefit in increasin the number of d!namic 9Cs unless the number of paths is also increased. Ainall!3 it sho#s onl! mar inal benefit in oin from a ) 9C/) path to + 9C/+ path confi uration. All;to;AllH 4*ISAlltoAll is an important 4*I collecti(e communications operation in #hich e(er! node sends a different messa e to e(er! other node. plots the a(era e lin$ utiliFation durin the communications pattern implied b! this collecti(e. The Ai ure a ain sho#s the benefit of d!namic o(er static routin . Aor this pattern3 there is mar inal benefit in oin from 1 to ) d!namic 9Cs3 but #hat is important is that the a(era e lin$

Blue Gene/L utiliFation is3 at appro8imatel! ,%T3 close to the theoretical pea$. This pea$ includes the o(erhead for the to$en;ac$ pac$ets3 the pac$et headers and the + b!te C-C trailers. A reasonable assumption for the BG/L soft#are is that each pac$et carries )+< b!tes of pa!load3 and #ith this assumption the plot sho#s that the pa!load occupies %.T of the lin$s. "ot sho#n in these plots is the fact that a (er! lo# percenta e of the traffic flo#s on the escape bubble 9C and that statistics collected durin the run sho#ed that fe# of the 9C buffers are full. Three;dimensional AAT al orithms often reGuire the eGui(alent of an All;to;All3 but on a subset of the nodes consistin of either a plane or a line in the torus. &imulations of these communications patterns also resulted in near;pea$ performance. The abo(e simulation #as for a s!mmetric BG/L. :o#e(er3 the situation is not so optimistic for an as!mmetric BG/L. Aor e8ample3 the 1+2 node s!stem #ill be a 1+83)83) node torus. In such a s!stem3 the a(era e number of hops in the 8 dimension is t#ice that of the ! and F dimensions3 so that e(en if e(er! 8 lin$ is 1<<T bus!3 the ! and F lin$s can be at most 5<T bus!. Thus3 the pea$ lin$ utiliFation is at most 11..T. &ince 1)T of that is o(erhead3 the best possible pa!load utiliFation is 5,T. :o#e(er3 #e e8pect si nificantl! more bloc$in and throu hput de radation due to full 9C buffers. Indeed a simulation of the All;to;All communications pattern on a 3)811811 torus resulted in an a(era e lin$ utiliFation of +,T and pa!load utiliFation of ++T3 correspondin to .+T of the pea$. This fi ure is probabl! some#hat pessimistic due to the simulator artifact of infinite;siFed inEection AIA's3 #hich distorts the effects at the end of the simulation. 7e also belie(e that appropriate inEection flo# control soft#are al orithms can reduce 9C buffer bloc$in and achie(e closer to pea$ performance. "e(ertheless3 the abo(e stud! points out a disad(anta e of the torus architecture for as!mmetric machines in #hich the application cannot be easil! mapped so as to result in a close pro8imit! communications pattern. 9irtual Channel ArchitectureH :ere #e consider se(eral different deadloc$ pre(ention escape 9C architectures. The first proposed has t#o escape 9Cs per direction. 6ach dimension has a Ldateline.M Before crossin the dateline3 the escape 9C is the lo#er numbered of the pair3 but after crossin the dateline the escape 9C is the hi her numbered of the pair. In addition #e consider dimension ordered or direction ordered

Blue Gene/L escape 9Cs. In dimension ordered3 the escape 9C is 8 first3 then ! if no 8 hops remain3 then F if no 8 or ! hops remain. In direction ordered3 the escape 9Cs are ordered b! 8O3 !O3 FO3 8;3 !;3 F; /other orderin s are possible0. 7e also consider dimension and direction ordered escape 9Cs for the bubble escape. 7e a ain use the hot re ion #or$load #here the hot re ion starts at coordinates /<3<3<0 and the datelines are set at the ma8imum coordinate (alue in each dimension. plots the throu hput as a function of time. The dimension ordered dateline pair sho#s particularl! poor and #ild beha(ior3 #ith a steep decline in throu hput3 follo#ed b! a rise and then another steep decline. plots the throu hput on a per 9C basis for a lon er period of time. The decreasin and increasin band#idth #a(es persist e(en o(er this much lon er time scale. An appreciable fraction of the traffic flo#s on the escape 9Cs3 indicatin a hi h le(el of 9C buffer occupation. 7hat causes these #a(esU Airst3 the placement of the dateline causes an as!mmetr! in the torus3 #hereas the bubble escape is perfectl! s!mmetrical in each dimension. &ince there are t#o escape 9Cs3 #e thou ht it li$el! that pac$ets at the head of the 9C buffers could be #aitin for one of the escape 9Cs but to$ens are returned for the other escape 9C. In such a situation3 no pac$ets could mo(e e(en thou h the lin$ ma! be a(ailable and do#nstream buffer space is a(ailable. To confirm this3 the simulator #as instrumented to collect additional statistics. In particular3 #e measured the fraction of time a to$en;ac$ is returned that frees at least one pre(iousl! bloc$ed pac$et to mo(e. plots this unbloc$in probabilit! alon #ith the throu hput as a function of time. The unbloc$in probabilit! is relati(el! constant for the bubble /after the initial decline03 but (aries directl! #ith the throu hput for the dateline pairJ #hen the unbloc$in probabilit! increases3 the throu hput increases and (ice;(ersa. *erformance 9erificationH To (erif! the 9:DL lo ic of the torus3 #e built a multi;node (erification testbench. This testbench3 #hich runs on the Cadence 9:DL simulator3 consisted of #or$load dri(ers that inEect pac$ets into the inEection AIA's3 lin$s bet#een nodes on #hich bits could be corrupted to test the error reco(er! protocol3 and pac$et chec$ers that pull pac$ets out of the reception AIA's and chec$ them for a (ariet! of conditions3 such as #hether the pac$et arri(ed at the correct destination and #hether its contents #ere recei(ed correctl!. The #or$load dri(ers could be fle8ibl! confi ured to

Blue Gene/L simulate a number of different traffic patterns. As #e neared the end of the lo ic (erification process3 #e #anted to ensure that net#or$ performance #as as intended. 'ne of the benchmar$s #e tested #as the All;to; All. The 9:DL simulator #as limited /b! memor!0 to a ma8imum of 1+ nodes3 so #e simulated both a +8+8+ torus and an %8%81 torus and compared the a(era e lin$ utiliFations to those predicted b! the performance simulator. 7hile these a reed to #ithin )T3 the 9:DL /correspondin to the actual net#or$ hard#are0 indicated that 9C buffers #ere fuller than that predicted b! the performance simulator. A close inspection of the arbitration lo ic re(ealed that a one c!cle ap in the arbitration pipeline of the recei(ers could occur #hen all possible out oin lin$s/9Cs #ere bus!. This ap #as sufficient to permit pac$ets from the inEection AIA's to snea$ into the net#or$3 leadin to fuller 9Cs than intended. A simple fi8 to eliminate this possibilit! #as implemented3 and subseGuent 9:DL simulations indicated reatl! reduced le(els of 9C buffer occupation.

Blue Gene/L

Blue Gene/L

A##lication
4achines li$e Blue Gene/L are desi ned to handle data;intensi(e applications li$e content distribution3 simulations3 and modelin 3 #ebser(in 3 data minin or business intelli ence. Another most important application is to predict ho# chains of biochemical buildin bloc$s described b! D"A fold into proteins;;massi(e molecules such as hemo lobin. 4ost biolo ical functions in(ol(e proteins and #hile a proteinCs chemical composition is determined b! a seGuence of amino acids Eoined li$e lin$s of a chain3 a protein folds into a hi hl! comple83 three;dimensional shape such as illustrated in the t#o fi ures belo#.

Ai 1. It is h!pothesiFed that the shape of a protein is the principal determinant of its function. Arbitrar! strin s of amino acids do not3 in eneral3 fold into a #ell;defined three; dimensional structure3 but e(olution has selected out the proteins used in biolo ical processes for their abilit! to fold reproducibl! /sometimes #ith assistance3 sometimes #ithout0 into a particular three;dimensional structure #ithin a relati(el! short time. &ome diseases are actuall! caused b! sli ht misfoldin s of a particular protein. 5nderstandin the mechanisms that cause a strin of amino acids to fold into a specific three; dimensional structure is an outstandin scientific challen e. Appropriate use of lar e scale biomolecular simulation to stud! protein foldin is e8pected to shed si nificant li ht into this process. 68tensi(e collaborations #ith the biolo ical research communit! #ill be needed to find the best #a! of appl!in the uniGue computational resources a(ailable to the Blue Gene proEect to ad(ance our understandin of protein foldin . The le(el of performance pro(ided b! Blue Gene /sufficient to simulate the foldin of a small

Blue Gene/L protein in a !ear of runnin time0 is e8pected to enable a tremendous increase in the scale of simulations that can be carried out as compared #ith e8istin supercomputers. The scientific communit! considers protein foldin one of the most si nificant D rand challen esD ;; a fundamental problem in science or en ineerin that has broad economic and scientific impact and #hose solution can be ad(anced onl! b! appl!in hi h; performance computin technolo ies. *roteins control all cellular processes in the human bod!. Comprisin strin s of amino acids that are Eoined li$e lin$s of a chain3 a protein folds into a hi hl! comple83 three; dimensional shape that determines its function. An! chan e in shape dramaticall! alters the function of a protein3 and e(en the sli htest chan e in the foldin process can turn a desirable protein into a disease. Better understandin of ho# proteins fold #ill i(e scientists and doctors better insi ht into diseases and #a!s to combat them. *harmaceutical companies could desi n hi h; tech prescription dru s customiFed to the specific needs of indi(idual people. And doctors could respond more rapidl! to chan es in bacteria and (iruses that cause them to become dru ;resistant. The human enome is currentl! thou ht to contain appro8imatel! +<<<< enes3 #hich code for a much lar er number of proteins throu h alternati(e splicin and post; translational modification3 a molecular tool$it assembled to handle a hu e di(ersit! of functions. An understandin of ho# proteins function is essential for understandin the cell life c!cle and metabolism3 ho# cells send si nals to their en(ironment3 and ho# cells recei(e and process si nals from their en(ironment. An understandin of protein structure and function can ser(e as a basis for inno(ation in ne# therapies3 dia nostic de(ices3 and e(en industrial applications. 7hen proteins fold into the #ron structure3 the results can be fatal3 e. .3 Lmad co#M disease probabl! results from an autocatal!Fed #ron fold in the prion protein1 and c!stic fibrosis is also connected #ith protein /mis0foldin .

Blue Gene/L *rotein architecture. *rotein architecture% is based on three principlesH The formation of a pol!mer chain . The foldin of this chain into a compact function;enablin structure3 or nati(e structure . *ost;translational modification of the folded structure .

The protein chain /or peptide chain if short in len th0 is a heteropol!mer built up from alpha amino acid monomers3 as sho#n in Ai ure ). The seGuence of amino acid residues in the peptide chain is termed the primar! structure of the protein. The )< different choices for each amino acid in the chain i(e the possibilit! of enormous di(ersit!3 e(en for small proteins. Aor e8ample3 a peptide of 3< residues !ields the astonishin number of about )<3<3 or appro8imatel! 1<3,3 possible uniGue seGuences.

Fig 2

The protein folding problem.

There are t#o important facets to the protein foldin problemH prediction of three; dimensional structure from amino acid seGuence3 and understandin the mechanisms and path#a!s #hereb! the three;dimensional structure forms #ithin biolo icall! rele(ant timescales. The prediction of structure from seGuence data is the subEect of an enormous amount of research and a series of conferences that assess the state of the art in structure prediction., 7hile this area is e8tremel! important3 ood pro ress in the area of structural predictions has been made usin onl! modest amounts of computational po#er. The effort described in this paper is aimed at impro(in our understandin of the mechanisms behind protein foldin 3 rather than at structure prediction. 6(en thou h

Blue Gene/L biolo ists ha(e been most interested in structure prediction3 there has been an increasin reco nition of the role that misfoldin of proteins pla!s in certain disease processes3 notabl! AlFheimerCs disease and mad co# disease.1 The section that follo#s describes some of the fundamental reasons for interest in the process of protein foldin .

Current view of folding mechanisms.

A simplistic but illustrati(e #a! of (ie#in protein foldin is to note that the amino acid - roups /see Ai ure )3 caption0 fall into three main classesH /10 char ed3 /)0 h!drophilic /L#ater;lo(in M03 and /30 h!drophobic /L#ater;hatin M0. In the simplest picture3 the folded state of the peptide chain is stabiliFed primaril! /for a lobular protein in #ater03 b! the seGuestration of much of the h!drophobic roups into the core of the proteinVout of contact #ith #ater3 #hile the h!drophilic and char ed roups remain in contact #ith #ater. The stabilit! can be described in terms of the Gibbs free;ener ! chan e G G I : K T&3 #here : is the enthalp! chan e and & is the entrop! chan e. : is ne ati(e due to the more fa(orable h!drophobic interactions in the folded state3 but so is & because the folded state is much more ordered and has lo#er entrop! than the unfolded state. The balance bet#een the enthalp! and entrop! terms is a delicate one3 and the total free; ener ! chan e is onl! of order 15 $ilocalories per mole. 6(identl! the internal h!drophobic/e8ternal h!drophilic pac$in reGuirement places stron constraints on the amino acid seGuence3 as does the reGuirement that the nati(e state be $ineticall! accessible. It is helpful to thin$ of the ph!sics of the foldin process as a Lfree;ener ! funnel &ince the foldin process is slo# relati(e to motions at atomic scale3 #e can thin$ of partiall! folded confi urations as ha(in a Guasi;eGuilibrium (alue of the free ener !. The free ener ! surface ma! be displa!ed as a function of some reduced dimensionalit! representation of the s!stem confi uration in a i(en state of the protein.1) . The most unfolded confi urations are the most numerous3 but ha(e the hi hest free ener !3 and

Blue Gene/L occur on the rim of the funnel. Goin into the funnel represents a loss of number of confi urations /decrease of entrop!03 but a radual decrease in free ener !3 until the nati(e state #ith (er! fe# confi urations and the lo#est free ener ! is reached at the bottom of the funnel. The #alls of the funnel contain onl! relati(el! shallo# subsidiar! minima3 #hich can trap the foldin protein in non;nati(e states3 but onl! for a short time. "o# the e(olution of the s!stem as it folds can be described in terms of the funnel. The s!stem starts off in a ph!sicall! probable state on the rim of the funnel3 and then ma$es transitions to a series of ph!sicall! accessible states #ithin the funnel3 until the bottom of the funnel is raduall! approached. Ai ure 3 illustrates foldin . :ere the unfolded peptide chain on the left alread! contains some folded secondar! structure3 alpha helices /red03 and a beta hairpin /blue0. It is still a lon #a! from the compact nati(e structure at ri ht. The foldin process in different proteins spans an enormous d!namic ran e from appro8imatel! )< microseconds to appro8imatel! 1 second.

Fig 3.

The scientific $no#led e deri(ed from research on protein foldin can potentiall! be applied to a (ariet! of related life sciences problems of reat scientific and commercial interest3 includin H *rotein;dru interactions /doc$in 0 6nF!me catal!sis /throu h use of h!brid Guantum and classical methods0 -efinement of protein structures created throu h other methods

Blue Gene/L 7e shall also e8plore the use of Blue Gene in other scientific computin areas. 7e e8pect that lessons learned from this proEect #ill appl! to future hi h performance IB4 s!stems in a broader ran e of scientific and commercial applications. 68amples of those applications include the modelin of the a in and properties of materials3 and the modelin of turbulence. This technolo ! opens the door to a number of applications of reat interest to ci(ilian industr! and business3 li$e biolo ! and other life sciences. The future of 5& hi h;performance computin #ill benefit tremendousl! from pursuin both of these paths in parallel. D'ne da!3 !ouCre oin to be able to #al$ into a doctorCs office and ha(e a computer anal!Fe a tissue sample3 identif! the patho en that ails !ou3 and then instantl! prescribe a treatment best suited to !our specific illness and indi(idual enetic ma$eup.D Consider the follo#in three t!pes of protein science studies that mi ht emplo! lar e; scale numerical simulation techniGuesH &tructure prediction Aoldin path#a! characteriFation Aoldin $inetics

*rotein structure prediction can be carried out usin a lar e number of techniGues% and3 as pre(iousl! discussed3 it is unnecessar! to spend a Lpetaflop !earM on the prediction of a sin le protein structure. That said3 there is some reason to belie(e that atomistic simulation techniGues ma! be useful in refinin structures obtained b! other methods. Aoldin path#a! characteriFation t!picall! in(ol(es the stud! of thermod!namic properties of a protein in Guasi;eGuilibrium durin the foldin process. 4appin out the free;ener ! LlandscapeM that the protein tra(erses as it samples conformations durin the foldin process can i(e insi hts into the nature of intermediate states alon the foldin path#a! and into the Lru ednessM of the free;ener ! surface that is tra(ersed durin this process. Because such studies in(ol(e computations of a(era e (alues of selected functions of the s!stemCs state3 one has the choice of either a(era in o(er time as the s!stem samples a lar e number of states /molecular d!namics0 or a(era in o(er

Blue Gene/L confi urations /4onte Carlo0. A ressi(e samplin techniGues that ma! impro(e the

computational efficienc! #ith #hich such a(era es can be computed can be used to ood effect in these studies. &imulation techniGues to compute these a(era es o(er the appropriate thermod!namic ensembles are a(ailable. &imulation studies of foldin $inetics are aimed at understandin the rates at #hich the protein ma$es transitions bet#een (arious conformations. In this case3 the calculation of thermod!namic a(era es is not enou hJ the actual d!namics of the s!stem must be simulated #ith sufficient accurac! to allo# estimation of rates. 'f course3 a lar e number of transition e(ents must be simulated in order to deri(e rate estimates #ith reasonable statistical uncertainties. Another challen e faced in such simulations is that the simulation techniGues used to reproduce thermod!namic a(era es in ensembles other than constant particle number3 (olume3 and ener ! /"960 are3 strictl! spea$in 3 inappropriate for studies of foldin $inetics.

Challenges for co #utational

odeling

The current e8pectation is that it #ill be sufficient to use classical techniGues3 such as molecular d!namics /4D03 to model proteins in the Blue Gene proEect. This is because man! aspects of the protein foldin process do not in(ol(e the ma$in and brea$in of co(alent bonds. 7hile disulfide bonds pla! a role in man! protein structures3 their formation #ill not be addressed b! classical atomistic simulations. In classical atomistic approaches3 a model for the interatomic interactions is used. This is $no#n as a potential3 or force field3 since the forces on all the particles can be computed from it3 if one has its mathematical e8pression and all its parameters. The 4D approach is to compute all the forces on all the atoms of the computer model of the protein and sol(ent3 then use that force to compute the ne# positions of all the atoms a (er! short time later. B! doin this repeatedl!3 a traEector! of the atoms of the s!stem can be traced out3 producin atomic coordinates as a function of time.

Blue Gene/L s!stem samples a lar e number of states /molecular d!namics0 or a(era in o(er confi urations /4onte Carlo0. A ressi(e samplin techniGues that ma! impro(e the computational efficienc! #ith #hich such a(era es can be computed can be used to ood effect in these studies. &imulation techniGues to compute these a(era es o(er the appropriate thermod!namic ensembles are a(ailable. &imulation studies of foldin $inetics are aimed at understandin the rates at #hich the protein ma$es transitions bet#een (arious conformations. In this case3 the calculation of thermod!namic a(era es is not enou hJ the actual d!namics of the s!stem must be simulated #ith sufficient accurac! to allo# estimation of rates. 'f course3 a lar e number of transition e(ents must be simulated in order to deri(e rate estimates #ith reasonable statistical uncertainties. Another challen e faced in such simulations is that the simulation techniGues used to reproduce thermod!namic a(era es in ensembles other than constant particle number3 (olume3 and ener ! /"960 are3 strictl! spea$in 3 inappropriate for studies of foldin $inetics.

Conclusion
DBlue GeneD is an ambitious proEect to e8pand the horiFons of supercomputin 3 #ith the ultimate oal of creatin a s!stem that can perform one Guadrillion calculations per second3 or one petaflop. IB4 is hopin that e8panded performance3 more efficient data access for processors3 and lo#er operational costs #ill i(e Blue Gene a bi le up in the #orld of hi h;performance computin . The Blue Gene proEect represents a uniGue opportunit! to e8plore no(el research into a number of areas3 includin machine architecture3 pro rammin models3 al orithmic techniGues3 and biomolecular simulation science. 6(er! aspect of this hi hl! ad(enturous proEect in(ol(es si nificant challen es. Carr!in out our planned pro ram #ill reGuire a collaborati(e effort across man! disciplines and the in(ol(ement of the #orld#ide scientific and technical communit!. In particular3 the scientific pro ram #ill en a e #ith the life sciences communit! in order to ma$e best use of this uniGue computational resource.

Blue Gene/L

$eferences
###.research.ibm.com/blue ene ###.research.ibm/Eournals/sE/+<)/allen.html ###.bio;it#orld.com/ne#s/<.15<3;report)%,%.html ###.linu8de(ices.com

Adi a et al.3 /)<<)0. An '(er(ie# of the BG/L &upercomputer. Proceedings of the 2002 Supercomputing Conference ###.scconference. or /sc)<<)/ Ben(eniste3 C. and :eidelber er3 *. /1,,50. *arallel &imulation of the IB4 Interconnection "et#or$. In Proceedings of the !!" #inter Simulation Conference. I666 Computer &ociet! *ress3 5%+ K 5%,.

Dall!3 7.=. /1,,)0. 9irtual;Channel Alo# Control. $%%% Transactions on Parallel and &istri'uted S(stems 33 "o. )3 1,+;)<5. *uente3 9.3 Bei(ide3 -.3 Gre orio3 =.A.3 *relleFo3 =.4.3 Duato3 =.3 and IFu3 C. /1,,,0. Adapti(e Bubble -outerH A Desi n to Impro(e *erformance in Torus "et#or$s. In Proceedings of the !!! $nternational Conference on Parallel Processing) 5%;1..

Dall!3 7.=. and &eitF3 C.L. /1,%.0. Deadloc$;Aree 4essa e -outin in 4ultiprocessor Interconnection "et#or$s. $%%% Transactions on Computers C; 313 "o. 53 5+.;553.

Dic$ens3 *.4.3 :eidelber er3 *.3 and "icol3 D.4. /1,,10. *aralleliFed Direct 68ecution &imulation of 4essa e;*assin *arallel *ro rams. $%%% Transactions on Parallel and &istri'uted S(stems .3 "o. 1<3 1<,<;11<5

IB4 -esearch -eport on torus interconnection net#or$ b! 4. Blumrich3 D. Chen3 *. Coteus3 A. Gara3 4. Giampapa and *. :eidelber er.

Blue Gene/L

También podría gustarte