Está en la página 1de 11

The Tiger Video Fileserver

William J. Bolosky, Joseph S. Barrera, III,


Richard P. Draves, Robert P. Fitzgerald,
Garth A. Gibson, Michael B. Jones, Steven P. Levi,
Nathan P. Myhrvold, Richard F. Rashid

April, 1996

TechnicalReport
MSRTR9609

MicrosoftResearch
AdvancedTechnologyDivision
MicrosoftCorporation
OneMicrosoftWay
Redmond,WA98052
PaperpresentedattheSixthInternationalWorkshoponNetworkandOperatingSystemSupportfor
DigitalAudioandVideo(NOSSDAV96),April,1996.
The Tiger Video Fileserver

WilliamJ.Bolosky,JosephS.Barrera,III,RichardP.Draves,RobertP.Fitzgerald,GarthA.Gibson 1,
MichaelB.Jones,StevenP.Levi,NathanP.Myhrvold,RichardF.Rashid

MicrosoftResearch,MicrosoftCorporation
OneMicrosoftWay
Redmond,WA98052

bolosky@microsoft.com,mbj@microsoft.com

Abstract presentedbyusersmaybeuneven. Thecontentonthe


Tigerisadistributed,faulttolerant realtimefileserver. servermustbeeasilychangeable,andtheservermustbe
Itprovidesdatastreamsataconstant,guaranteedratetoa highly available. The server must be inexpensive and
largenumberofclients,inadditiontosupportingmore easytomaintain.
traditionalfilesystemoperations.Itisintendedtobethe The paper describes Tiger, a continuous media
basisformultimedia(videoondemand)fileservers,but fileserver. Tigersgoalistoproduceapotentiallylarge
mayalsobeusedinotherapplicationsneedingconstant number of constant bit rate streams over a switched
ratedatadelivery. Thefundamentalproblemaddressed network. Itaimstobeinexpensive,scalableandhighly
bytheTigerdesignisthatofefficientlybalancinguser available.ATigersystemisbuiltoutofacollectionof
loadagainstlimiteddisk,networkandI/Obusresources. personalcomputerswithattachedhighspeeddiskdrives
Tiger accomplishes this balancing by striping file data connectedbyaswitchedbroadbandnetwork.Becauseit
across all disks and all computers in the (distributed) isbuilt from extremelyhigh volume,commodity parts,
system, and then allocating data streams in a schedule Tigerachieveslowcostbyexploitingthevolumecurves
that rotates across the disks. This paper describes the of the personal computer and networking industries.
Tiger design and an implementation that runs on a Tigerbalancestheloadpresentedbyusersbystripingthe
collectionofpersonalcomputersconnectedbyanATM data of each file across all of the disks and all of the
switch. machines within the system. Mirroring yields fault
tolerance for failures of disks, disk interfaces, network
1.Introduction interfacesorentirecomputers.
As computers and the networks to which they are This paper describes the work done by Microsoft
attached become more capable, the demand for richer Research on the Tiger prototype. Most of the work
data types is increasing. In particular, temporally describedinthispaperwasdonein1993and1994.Any
continuoustypessuchasvideoandaudioarebecoming productsbasedonTigermaydifferfromtheprototype.
popular. Today,mostapplicationsthatusethesetypes
storethedataonlocalmediasuchasCDROMorhard 2.TheTigerArchitecture
diskdrivesbecausetheyaretheonlyavailabledevices Tiger is organized as a realtime distributed system
thatmeetthenecessarystoragecapacity,bandwidthand runningonacollectionofcomputersconnectedbyahigh
latencyrequirements.However,therecentrapidincrease speed network. Each of these computers is called a
inpopularityoftheinternetandtheincreasesincorporate cub.Everycubhassomenumberofdisksdedicatedto
and university networking infrastructure indicate that storingthedataintheTigerfilesystemaswellasadisk
network servers for continuous media will increasing used for running the operating system. All cubs in a
become feasible. Networked servers have substantial particularsystemarethesametypeofcomputer,withthe
advantagesoverlocal storage,principallythattheycan sametypeofnetworkinterface(s)andthesamenumber
storeawiderrangeofcontent,andthatthestoredcontent andtypeofdisks.Tigerstripesitsfiledataacrossallof
canbechangedmuchmoreeasily. thedisksinthesystem,withastripegranularitychosento
Servingcontinuousmediaoveranetworkpresentsa balance disk efficiency, buffering requirements and
host of problems. The network itself may limit streamstartlatency.Faulttoleranceisprovidedbydata
bandwidth, introduce jitter or lose data. The load mirroring.

1
GarthGibsonisatCarnegieMellonUniversity.
The fundamental problem solved by Tiger is to streams.ThistechniqueisimplementedusingtheTiger
efficiently balance the disk storage and bandwidth schedule,whichisdescribedindetailinsection3.
requirements of the users across the system. A disk Becausetheprimary purposeofTigeristosupply
deviceprovidesacertainamountofstorageandacertain timecritical data,networktransmissionprotocols(such
amount of bandwidth that can be used to access its as TCP) that rely on retransmission for reliability are
storage.Sourcinglargenumbersofstreamsrequiresthe inappropriate.Bythetimeanomissionwasdetectedand
server to have a great amount of bandwidth, while thedatawasretransmitted,thedatawouldnolongerbe
holdingalargeamountofcontentrequiresmuchstorage usefultotheclient.Furthermore,whenrunningoveran
capacity. However, most of the bandwidth demanded ATMnetwork,thequalityofserviceguaranteesprovided
maybeforasmallfractionofthestorageused;thatisto by ATM ensure that little if any data is lost in the
say,somefilesmaybemuchmorepopularthanothers. network.TigerthususesUDP,andsoanydatathatdoes
Bystripingallofthesystemsfilesacrossallofitsdisks, notmakeitthroughthenetworkonthefirsttryislost.
Tiger is able to divorce the drives storage from their Astheresultsinsection6demonstrate,inpracticevery
bandwidth,andinsodoingtobalancetheload. littledataislostinthenetwork.
Tigermakesnoassumptionsaboutthecontentofthe
Controller LowBandwidth files it serves, other than that all of the files on a
SCSIBus ControlNetwork particular server have the samebit rate. We have run
MPEG1, MPEG2, uncompressed AVI, compressed
AVI, various audio formats, and debugging files
containing only test data. We commonly run Tiger
serversthatsimultaneouslysendfilesofdifferenttypesto
Cub0 Cub1 ... Cubn differentclients.
WeimplementedTiger onacollection ofpersonal
computers running Microsoft Windows NT. Personal
computershaveanexcellentprice/performanceratio,and
NTturnedouttobeagoodimplementationplatform.
ATMFiber ATMSwitchingFabric 3.Scheduling
ThekeytoTiger'sabilitytoefficientlyexploitdistributed
I/Osystemsisitsschedule.Thescheduleiscomposedof
alistofscheduleslots,whichprovideservicetoatmost
oneviewer.Thesizeofthescheduleisdeterminedbythe
DataOutputs
capacity of the whole system; there is exactly one
Figure1:BasicTigerHardwareLayout schedule slot for each potential simultaneous output
stream. The schedule is distributed between the cubs,
Inadditiontothecubs,theTigersystemhasacentral whichuseittosendtheappropriateblockofdatatoa
controllermachine. Thiscontrollerisusedasacontact vieweratthecorrecttime.
point for clients of the system, as the system clock TheunitofstripinginTigeristheblock;eachblock
master, and for a few other low effort, bookkeeping ofafileisstoredonthedrivefollowingtheoneholding
activities.Noneofthefiledataeverpassesthroughthe itspredecessor.Theblocksizeistypicallyintherangeof
controller,andonceafilehasbegunstreamingtoaclient, 64KBytesto1MByte,andisthesameforeveryblockof
the controller takes no action until the end of file is everyfileonaparticularTigersystem.Thetimeittakes
reachedortheclientrequeststerminationofthestream. toplayablockatthegivenstreamrateiscalledtheblock
In order to ensure that system resources exist to playtime.Ifthedisksarethebottleneckinthesystem,
streamafiletoaclient,Tigermustavoidhavingmore theworstcasetimethatittakesadisktoreadablock,
thanoneclientrequiringservicefromasinglediskata together with some time reserved to cover for failed
giventime.Insertingpausesorglitchesintoarunning disks,isknownastheblockservicetime.Ifsomeother
streamifresourcesareovercommittedwouldviolatethe resourcesuchasthenetworkisthebottleneck,theblock
basicintentofacontinuousmediaserver.Instead,Tiger servicetimeismadetobelargeenoughtonotoverwhelm
brieflydelaysarequesttostartstreamingafilesothatthe thatresource.EverydiskinaTigersystemwalksdown
request isoffset from allothersinprogress,andhence the schedule, processing a schedule slot every block
willnotcompeteforhardwareresourceswiththeother servicetime. Disksareoffset from oneanother inthe
schedule byablockplay time. That is,disksproceed across the disks, disk 2 holds the block after the one
throughthescheduleinsuchawaythateveryblockplay beingreadbydisk1.
timeeachscheduleslotisservicedbyexactlyonedisk. Stripingandschedulinginthiswaytakesadvantage
Furthermore,thediskthatservicesthescheduleslotisthe ofseveralpropertiesofvideostreamsandthehardware
successortothediskthatservicedthescheduleslotone architecture.Namely,thatthereissufficientbufferingin
blockplaytimepreviously. the cubs to speedmatch between the disks and output
Inpractice,disksusuallyrunsomewhataheadofthe streams;thatfiles(movies)usuallyarelongrelativeto
schedule, in order to minimize disruptions due to theproductofthenumberofdisksinthesystemandthe
transientevents. Thatis,byusingamodestamountof block play time; and that all output streams are of the
bufferandtakingadvantageofthefactthattheschedule same(constant)bandwidth.
isknownwellaheadoftime,thedisksareabletodotheir Bufferingisinherentlynecessaryinanysystemthat
readssomewhatbeforetheyredue,andtousetheextra triestomatchaburstydatasourcewithasmoothdata
time to allow for events such as disk thermal consumer. Disks are bursty producers, while video
recalibrations,shorttermbusyperiodsontheI/OorSCSI rendering devices are smooth consumers. Placing the
busses,orothershortdisruptions.Thecubstilldelivers bufferinginthecubsratherthantheclientsmakesbetter
its data to the network at the time appointed by the useofmemoryandsmoothestheloadpresentedtothe
schedule, regardless of how early the disk read network.
completes. Becausefilesarelongrelativetothenumberofdisks
inthesystem,eachfilehasdataoneachdisk.Then,even
Block Service
Time Slot 0/Viewer 4 ifallclientsareviewingthesamefile,theresourcesof
the entire system will be available to serve all of the
clients.Iffilesareshortrelativetothenumberofdisks,
Disk 2 Slot 1/Viewer 3 thenthemaximumnumberofsimultaneousstreamsfora
particular file will be limited by the bandwidth of the
Block Play Slot 2/Free disksonwhichpartsofthefilelie.Becausethestriping
Time unit is typically a second of video, a two hour movie
couldbespreadovermorethan7,000disks,yieldingan
Slot 3/Viewer 0 upper limit on scaling that is the aggregate streaming
Disk 1 capacity of that many disks; for 2 Mbit/s streams on
Slot 4/Viewer 5 Seagate ST15150N (4GByte Barracuda) disks, this is
over60,000streams.Beyondthatlevel,itisnecessaryto
replicate content in order to have more simultaneous
Slot 5/Viewer 2 viewers of a single file. Furthermore, the network
switchinginfrastructuremayimposeatighterboundthan
Disk 0 thedisks.
Slot 6/Free
Havingalloutputstreamsbeofthesamebandwidth
allows the allocation of simple, equal sized schedule
Slot 7/Viewer 1 slots.
WhenaviewerrequestsservicefromaTigersystem,
Figure2:TypicalTigerSchedule the controller machine sends the request to the cub
Figure2showsatypicalTigerscheduleforathree holdingthefirstblockofthefiletobesent,whichthen
disk,eightviewersystem. Thedisksallwalkdownthe adds the viewer to the schedule. When the disk
diskschedulesimultaneously.Whentheyreachtheend containing the first block of data processes the
oftheslotthat theyre processing,thecubdeliversthe appropriatescheduleslot,theviewer'sstreamwillbegin;
datareadforthatviewertothenetwork,whichsendsitto allsubsequentblockswillbedeliveredontime.Whena
the viewer. When a disk reaches the bottom of the cubreceivesarequestfornewservice,itselectsthenext
schedule,itwrapstothetop.Inthisexample,disk1is unusedscheduleslotthatwillbeprocessedbythedisk
abouttodeliverthedataforviewer0(inslot3)tothe holdingtheviewer'sfirstblock.Thistaskisisomorphic
network.Oneblockplaytimefromtheinstantshownin to inserting an element into an open hash table with
thefigure,disk2willbeabouttohandthenextblockfor sequential chaining. Knuth[Knuth 73] reports that, on
viewer 0tothenetwork. Becausethefilesarestriped average,thistakes(1+1/(1a)^2)/2probes,whereaisthe
fractionoftheslotsthatarefull.InTiger,thedurationof
a probe is one block service time. Since service Tigerusesmirroring,whereintwocopiesofthedata
requests may arrive in the middle of a slot on the are stored on two different disks, rather than parity
appropriatedrive,onaverageanadditionalonehalfofa encoding. Thereareseveral reasonsforthisapproach.
blockservicetimeisspentwaitingforthefirstfullslot BecauseTigermustsurvivenotonlythefailureofadisk,
boundary. This delay is in addition to any latency in butalsothefailureofacubornetworkinterfacecard,
sending the start playing request to the server by the usingcommercialRAID[Pattersonetal.88]arrays(in
client,andanylatencyinthenetworkdeliveringthedata whichfaulttoleranceisprovidedonlyinthedisksystem)
from the cub to the client. Knuths model predicts will not meet the requirements. Building RAIDlike
behavior only in the case that viewers never leave the stripe sets across machines and still meeting Tigers
schedule;iftheydo,latencywillbelessthanpredicted timeliness guarantees would consume a very large
bythemodel. amount ofbothnetworkbandwidthandbuffermemory
Tigersupportstraditionalfilesystemreadandwrite whenadiskisfailed. Furthermore,inordertoprovide
operations in addition to continuous, scheduled play. thebandwidthtosupplyalargenumberofstreams,Tiger
These operations are not handled though the schedule, systemswillhavetohavealargenumberofdiskseven
andareoflowerprioritythanallscheduledoperations. when theyre not required for storage capacity.
Theyarepassedtothecubholdingtheblocktobereador Combinedwiththeextremelylowcostpermegabyteof
written.Whenacubhasidlediskandnetworkcapacity, todays disk storage, these factors led us to choose
itprocessesthese nonscheduled operations. Inpractice, mirroringasthebasisforourfaulttolerancestrategy.
because the block service time is based on worst case Aproblemwithmirroringinavideofileserveristhat
performance, because extra capacity is reserved for thedatafromthesecondarycopyisneededatthetime
operation when system components are failed (see theprimarydatawouldhavebeendelivered. Thedisk
section 4), and because not all schedule slots will be holdingthesecondarycopyofthedatamustcontinueto
filled, there is usually capacity left over to complete supplyitsnormalprimaryloadaswellastheextraload
nonscheduledoperations,evenwhenthesystemloadis duetothefaileddisk.Astraightforwardsolutiontothis
nearratedcapacity. problem would result in reserving half of the system
Nonscheduledoperationsareusedtoimplementfast bandwidthforfailurerecovery.Sincesystembandwidth
forwardandfastreverse(FF/FR).Aviewerrequeststhat (unlike disk capacity) is likely to be the primary
partsofafilebesent withinspecifictimebounds,and determinerofoverallsystemcost,Tigerusesadifferent
Tigerissuesnonscheduledoperationstotrytodeliverthe scheme. Tiger declusters its mirrored data; that is, a
requested data to the user. If system capacity is not blockofdatahavingitsprimarycopyondisk i hasits
available to provide the requested data within the secondary copy spread out over disks i+1through i+d,
provided time bounds, the request is discarded. In wheredisthedeclusterfactor.Disksi+1throughi+dare
practice,theclientsFF/FRrequestsrequirelessdiskand calleddisk is secondaries. Thus,whenadiskfails,its
network bandwidth than scheduled play, and almost loadissharedamong d differentdisks. Inordernotto
always complete, even at high system load and when overloadanyotherdiskwhenadiskfails,itisnecessary
componentshavefailed. toguarantee that every scheduled read toafailed disk
uses an equal amount of bandwidth from each of the
4.Availability faileddisk'ssecondaries.So,everyblockissplitinto d
Tigersystemscangrowquitelarge,andareusuallybuilt piecesandspreadoverthenextddisks.
out of offtheshelf personal computer components. Therearetradeoffsinvolvedinselectingadecluster
Large Tiger systems will have large numbers of these factor.Increasingitreducestheamountofnetworkand
components. Asaresult,componentfailureswilloccur I/O bus bandwidth that must be reserved for failure
from timetotime. Becauseofthestripingofallfiles recovery, because it increases the number of machines
acrossalldisks,withoutfaulttoleranceafailureinany overwhichtheworkofafailedcubisspread.Increasing
partofthesystemwouldresultinadisruptionofservice the decluster factor also reduces the amount of disk
toallclients. Toavoidsuchadisruption,Tigerisfault bandwidth that must be reserved for failed mode
tolerant. Tigerisdesignedtosurvivethefailureofany operation, but larger decluster factors result in smaller
cubordisk. Simpleextensionswillallowtoleranceof disk reads, which decrease disk efficiency because the
controllermachinefailures. Tigerdoesnotprovidefor sameseekandrotationoverheadsandcommandinitiation
toleratingfailuresoftheswitchingnetwork.Atanadded timeareamortizedoverfewerbytestransferred.Larger
cost, tiger systems may be constructed using highly decluster factors also increase the window of
availablenetworkcomponents. vulnerability to catastrophic failure: the system fails
completelyonlywhenbothcopiesofsomedataarelost. Figure 3 shows the layout of a three disk Tiger
Whenadiskfails,thenumberofotherdisksthatholdits systemwithadeclusterfactorof2.Here,Primarynis
secondaries,orwhosesecondariesitholdsistwice the theprimarydatafromdiskn,whileSecondaryn.mis
declusterfactor,becauseanydiskholdsthesecondaries part m ofthesecondarycopyofthedatafrom diskn.
fortheprevious d disks;thenext d disksafteranydisk Sinced=2,miseither0orone.InarealTigerlayout,all
hold its secondaries. We expect that decluster factors secondariesstoredonaparticulardisk(like2.0and1.1
from4to8aremostappropriatewithablocksizeof. storedondisk0)areinterleavedbyblock,andareonly
5Mbytesifthesystembottleneckisthedisks.Backplane shownhereasbeingstoredconsecutivelyforillustrative
ornetworklimitedsystemsmaybenefitfromlargerd. purposes.
Bitsstoredonmoderndisksmustoccupyacertain Ifadisksuffersafailureresultinginpermanentloss
linear amount of magnetic oxide (as opposed to of data, Tiger will reconstruct the lost data when the
distendingafixedangle).Sincethetracksontheoutside failed disk is replaced with a working one. After
part of a disk are longer than those on the inner part, reconstruction is completed, Tiger automatically brings
morebitscanbestoredontheoutertracks.Becausethe thedriveonline.
disk rotates at a constant speed, the outer tracks store The failure of an entire cub islogically similar to
moredatathantheinneronesandthedatatransferrate losingallofthedisksonthecubatthesametime.The
fromtheoutertracksisgreaterthanthatfromtheinner only additional problems are detecting the failure and
ones.Tigerexploitsthisfactbyplacingthesecondaries dealingwithconnectioncleanup,aswellasreconnection
on the inner, slower half of the disk. Because of whenthecubcomesbackonline. Failuredetectionis
declustering, during failures at least d times more data accomplished by a deadman protocol: each cub is
willbereadfromtheprimary(outer)regionofanygiven responsibleforwatchingthecubtoitsleftandsending
diskthan from itssecondary(inner) region. So,when periodicpingstothecubtoitsright.Becausedrivesare
computingtheworstcaseblockservicetimeandtaking stripedacrosscubs(i.e.,cub0hasdrives0,n,2n,etc.,
into account the spare capacity that must be left for wherenisthenumberofcubs),drivesononecubdonot
readingsecondaries,theprimarywouldbereadfromthe holdmirrorcopiesofdataforanotherdriveonthatcub
middleofthedisk(theslowestpartoftheprimaryregion) unlessthedeclusterfactorisgreaterthanorequaltothe
and the secondary from the inside. These bandwidth numberofcubs,inwhichcasethesystemcannottolerate
differencescanbeconsiderable:Wemeasured7.1MB/s cubfailures.
readingsequentiallyfromtheoutsideofa4GBSeagate
Barracuda ST15150N formatted to 2Kbyte sectors, 5.NetworkSupport
6.6MB/satthemiddleand4.5MB/satthehub. Given The network components of Tiger provide the crucial
our measured worst case seek and rotation delays of function of combining blocks from different disks into
31msfromendtoendand27msfromoutsidetomiddle, singlestreams. Inotherwords,Tigersnetworkswitch
thetimetoreadbotha1/2MByteblockfromtheouter performs the same function as the I/O backplane on
halfofthediskanda1/8Mbytesecondaryfromtheinner supercomputerbasedvideoservers:itrecombinesthefile
halfis162ms. Readingthembothfromneartheinner datathathasbeenstripedacrossnumerousdisks. Tiger
half of the disk (the worst case without the Tiger systems may be configured with any type of switched
inner/outeroptimization)wouldtake201ms. network typically either ATM orswitched ethernet.
ATMhastheadvantageofprovidingqualityofservice
Disk 0 Disk 1 Disk 2 guaranteestovideostreams,sothattheywillnotsuffer
Rim degradationasotherloadisplacedontheswitch,butis
(fast) Primary Primary Primary notdesignedtohandleasingledatastream originating
frommultiplesources.
0 1 2 Microsoft has evangelized a multipointtopoint
Middle (funnel) ATM virtual circuit standard in the ATM
Secondary Secondary Secondary forum. Inthisstandard,anormalpointtopointvirtual
2.0 0.0 1.0 circuit is established, and after establishment other
Secondary Secondary Secondary machines are able to join in to the circuit. It is the
Hub
1.1 2.1 0.1 responsibilityofthesenderstoassurethatallofthecells
(slow)
ofonepackethavebeentransmittedbeforeanycellsof
Figure3:DiskLayout,Decluster2 any other packet are sent. When Tiger uses an ATM
network, it assures packet integrity by passing a token
amongthesenderstoaparticularfunnel,thuspreventing measurements, in the worst case each of the disks is
multiplesimultaneoussendersonafunnel. capable of delivering about 4.7 primary streams while
Tigers only nonstandard kernelmode code is a doingitspartincoveringforafailedpeer.Thus,the15
UDP network software component that implements the disksinthesystemcandeliveratmost70streams.Each
tokenprotocolaswellassendingallofthedatapackets. oftheForeATMNICsisabletosustain17simultaneous
Thisspecialcomponentallocatesthebuffermemoryused streamsat6Mbits/sbecausesomebandwidthisusedfor
toholdfiledataandwiresitdown.Itisabletoperform control communication and due to limitations in the
networksendswithoutdoinganycopyingorinspection FORE hardware and firmware. Since 20% of the
of the data, but rather simply has the network device capacity is reserved for failed mode operation, the
driverDMAdirectlyoutofthememoryintowhichthe network cards limit the system to 68 streams. We
diskplacesthefiledata.Becausethereisnocopyingor measured the PCI busses on the cubs delivering 65
inspectionofTigerfiledata,eachdatabitpassestheI/O Mbytes/s of data from disk to memory; they are more
andmemorybussesexactlytwice: oncetravelingfrom thanfastenoughtokeepupwiththediskandnetwork
the disk to the memory buffer, and then once again needsinthisTigerconfiguration.ThereforetheNICsare
travelingfrommemorytothenetworkdevice. the system bottleneck, giving a rated capacity of 68
streams.
6.Measurements We ran two experiments: unfailed and failed. The
We have builtmanyTigerconfigurationsat Microsoft, firstexperimentconsistedofloadingupthesystemwith
running video and audio streams at bitrates from noneofthecomponentsfailed. Thesecondexperiment
14.4Kbits/s to 8Mbits/s and using several output hadoneofthecubs(andconsequentlyallofitsdisks)
networksrangingfromtheinternetthroughswitched10 failedfortheentiredurationoftherun. Ineachofthe
and 100 Mbit/s ethernet to 100Mbit/s and OC3 experiments,werampedthesystem uptoitsfullrated
(155Mbit/s) ATM. This paper describes an ATM capacityof68streams.
configurationsetupfor6Mbit/svideostreams. Ituses Bothconsistedofincreasingtheloadontheserver
fivecubs,eachofwhichisaGatewayPentium133MHz byaddingonestreamatatime,waitingforatleast100s
personalcomputerwith48MbytesofRAM,aPCIbus, andthenrecordingvarioussystemloadfactors.Because
three Seagate ST32550W 2Gbyte drives and a single weareusingrelativelyslowmachinesforclients,theyare
FOREsystemsOC3ATMadapter.Thecubseachhave unable toread asmany streamsaswouldfitinonthe
two Adaptec 2940W SCSI controllers, one controller 100Mbit/s ATM lines attached to them. To simulate
havingtwoofthedatadisksandthebootdisk,andthe moreload,weransomeoftheclientsasblackholes
otherhavingonedatadisk. Thedata diskshave been theyreceivemoredatathantheycanprocess,andignore
formattedtohaveasectorsizeof2Kbytesratherthanthe allincomingdatapackets. Theotherclientsgenerated
moretraditional512bytes. Largersectorsimprovethe reportsiftheydidnotseeallthedatathattheyexpected.
drivespeedsaswellasincreasingtheamountofuseful Thesystemwasloadedwith24differentfileswitha
space onthedrive. TheTigercontroller machineisa meanlengthofabout900seconds.Theclientsrandomly
Gateway 486/66. It is not on the ATM network and selected a file, played it from beginning to end and
communicates with the cubs over a 10Mbit/s Ethernet. repeated.Becausetheclientsstartswerestaggeredand
The controller machine is many times slower than the thecubsbuffercacheswererelativelysmall,therewasa
cubs.Thecubscommunicateamongthemselvesoverthe lowprobabilityofabuffercachehit.Theoverallcache
ATM network. Ten 486/66 machines attached to the hitratewaslessthan1%overtheentirerunforeachof
ATMnetworkby100Mbit/sfiberlinksserveasclients. theexperiments.Thediskswerealmostentirelyfull,so
Eachofthese machines iscapableofreceiving several readswere distributed across theentire disksandwere
simultaneous6Mbit/sstreams. Forthepurposeofdata notconcentratedinthefaster,outerportions.
collection,weranaspecial client applicationthat does The most important measurement of Tiger system
notrenderanyvideo,butrather,simplymakessurethat performanceisitssuccessinreliablydeliveringdataon
the expected data arrives on time. This client allows time.Wemeasuredthisintwowaysinourexperiments.
more than one stream to be received by a single Whenthe server doesnot send ablocktothenetwork
computer. becausethediskoperationfailedtocompleteontimeor
This 15 disk Tiger system is capable of storing becausetheserverisrunningbehindscheduleitreports
slightlymorethan6hoursofcontentat6Mbits/s. Itis thatfact.Whenanonblackholeclientfailstoreceivean
configuredfor.75Mbyteblocks(henceablockplaytime expected block, it also reports it. In the no failure
of 1s) and a decluster factor of 4. According to our experiment,neithertheservernortheclientsreportedany
datalossamongthemorethan400,000blocksthatwere Tiger ran its disks at over 75% duty cycle while still
sent.Inthetestwithonecubfailed,theserverfailedto delivering all streams in a timely and reliable fashion.
place94blocksonthenetworkandtheclientsreporteda Eachdiskdelivered4.25Mbytes/swhenrunningatload.
comparablenumberofblocksnotarrivingforalostblock At the highest load, 68 streams at 6 Mbits/s were
ratejustover0.02%. Mostoftheseundeliveredblocks beingdeliveredby4cubs.Thismeansthateachcubwas
happenedinthreedistinctevents,noneofwhichoccurred sustainingasendrateof12.75Mbytes/soffiledata,and
atthehighestloadlevels.Thesystemwasabletodeliver 25.5Mbytes/sover itsI/Obus,aswell ashandlingall
steamsatitsratedcapacitywithoutsubstantiallosses. appropriate overheads. We feel confident that these
Inadditiontomeasuringundeliveredblocks,wealso machinescouldhandleahigherload,givenmorecapable
measured the load on various system components. In networkadapters.
particular,wemeasuredtheCPUloadonthecontroller
machineandcubs,andthediskloading.CPUloadisas 7.RelatedWork
reported bytheperfmon tool from Microsoft Windows Tigersystemsaretypicallybuiltentirelyofcommodity
NT. Perfmon samples the one second NT CPU load hardwarecomponents,allowingthemtotakeadvantage
averageeverysecond,andkeeps100secondsofhistory. ofcommodityhardwarepricecurves.Bycontrast,other
Itreportsthemeanloadfromthelast100seconds.For commercial video servers, such as those produced by
the cubs, the mean of all cubs is reported; the cubs SiliconGraphics[Nelsonetal.95]andOracle[Laursenet
typicallyhadloadsveryclosetooneanother,sothemean al.94],tendtorelyonsupercomputerbackplanespeeds
isrepresentative.Diskloadingisthepercentageoftime ormassivelyparallelmemoryandI/Osystemsinorderto
duringwhichthediskwaswaitingforanI/Ocompletion. providetheneededbandwidth.
Afourthmeasurementisthestartuplatencyofeach Theseserversalsotendtoallocateentirecopiesof
stream.Thislatencyismeasuredfromthetimethatthe movies at single servers, requiring that content be
clientcompletesitsconnectiontotheTigertothetime replicatedacrossanumberofserversproportionaltothe
thatthefirstblockhascompletelyarrivedattheclient. expected demand for the content. Tiger, by contrast,
Becausetheblockplaytimeis1secondandtheATM stripes all content, eliminating the need for additional
networksendstheblocksatjustoverthe6Mbit/srate, replicastosatisfychangingloadrequirements.
onesecondoflatency ineachstart isduetotheblock Also,whileothersystemstendtorelyonRAIDdisk
transmission. Clientsotherthanourtestprogramcould arrays[Patterson et al. 88] to provide fault tolerance,
beginusingthedatabeforetheentireblockarrives. In Tiger will transparently tolerate failures of both
thisconfigurationtheblockservicetimeissetto221ms individualdisksandentireservermachines. Aswellas
soasnottooverrunthenetwork,sothatmuchtimeis stripingacrossdisksinanarrayTigerstripesdataacross
added tothestartup latency inorder toread theblock servermachines.
from the disk. Therefore, the smallest latency seen is [Berson et al. 94] proposes an independently
about1.3s.Themeasuredlatenciesarenotaverages,but developed singlemachine disk striping algorithm with
rather actual latencies of single stream starts, which somesimilaritiestothatusedbyTiger.
explainswhythecurveissojumpy. Wedonotexpect
8.Conclusions
that Tiger servers would be run at very high loads
Tiger is a successful exercise in providing data and
becauseoftheincreasedstartuplatency,sothelatencies
bandwidthintensive computing via a collection of
atthehighestloadsarenottypicalofwhatTigerusers
commodity components. Tiger achieves low cost and
wouldsee.
highreliabilitybyusingcommoditycomputinghardware
Figures4and5showthemeasurednumbersforthe
andadistributed,realtimefaulttolerantsoftwaredesign.
normal operationandonecubfailedtests,respectively.
Stripingofdataacrossmanylooselyconnectedpersonal
The three mean load measurements should be read
computersallowsthesystemtohandlehighlyunbalanced
against the left hand yaxis scale, while the startup
loads without having to replicate data to increase
latencycurveusestherighthandscale.
bandwidth. Tigers scheduling algorithm is able to
Animportantobservationisthatthemachinesloads
provideveryhighqualityofservicewithoutintroducing
increaseasyouwouldexpect: thecubsloadincreases
unacceptablestartuplatency.
linearlyinthenumberofstreams,whilethecontrollersis
relatively constant. Most of the work done by the
controllerisupdatingitsdisplaywithstatusinformation.
Evenwithonecubfailedandthesystematitsrated
maximum,thecubsdidntexceed50%meanCPUusage.
Acknowledgments [Laursen et al. 94] Andrew Laursen, Jeffrey Olkin, Mark
Porter. Oracle Media Server: Providing
YoramBernet,JandeRie,JohnDouceur,CraigDowell,
Consumer Based Interactive Access to
Erik Hedberg and Craig Link also contributed to the MultimediaData.InACMSIGMOD94,pages
Tigerserverdesignandimplementation. 470477.
[Nelson et al. 95] Michael N. Nelson, Mark Linton, Susan
References Owicki. A Highly Available, Scalable ITV
[Bersonetal.94]S.Berson,S.Ghandeharizadeh,R.Muntz,X. System.InSOSP15,pages5467.
Ju. Staggered Striping in Multimedia [Pattersonetal.88]D.Patterson,G.Gibson,R.Katz.ACase
Information Systems. In ACM SIGMOD 94, for Redundant Arrays of Inexpensive Disks
pages7990. (RAID).InACMSIGMOD88,pages109116.
[Knuth73] Donald E. Knuth. The Art of Computer
Programming, Volume 3: Sorting and Someoralloftheworkpresentedinthispapermaybecovered
Searching, pages 520521. AddisonWesley, bypatents,patentspending,and/orcopyright. Publicationof
1973. this paper does not grant any rights to any intellectual
property.Allrightsreserved .

Tiger
Tiger Loads,
Loads, All
All Cubs
Cubs Running
Running
80

60 14
14
70
12
12
50
60

10
10

Latency(seconds)
50
40
Mean Cub CPU
(%)(%)

88

(s)
40 Tiger CPUCPU
Controller
Load

30
Load

Latency
Disk Load
Mean Disk Load
30 66
Startup Latency
20
20 4

10 2
10

00 0
35
00

55

10

15

20

25

30

40
20

55
10

15

25

30

35

40

45

50

60

65

Active
Active Streams
Streams

Figure4:Tigerloadsduringnormaloperations
Tiger Loads, All Cubs Running
Tiger Loads, One Cub Failed

60
80 14
14
70 12
50
12
60
10
40 10

Latency (seconds)
50 MeanCub
Mean CubCPU
CPU
8

Latency (s)
Load (%)
(%)

8 Tiger CPU
Controller CPU
30
40
Load

Disk Disk
Mean LoadLoad
6
6 StartupLatency
Startup Latency
30
20
20 4 4

10
10 2 2

00 0 0
20

65
00

55

10

15

25

30

35

40

45

50

55

60
20

55
10

15

25

30

35

40

45

50

60

65
Active Stream s
Active Streams

Figure5:Tigerloadsrunningwithonecuboffivefailedandadeclusterfactorof4

También podría gustarte