Está en la página 1de 55

Draft

ZFS On-Disk Specification

Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95054 U.S.A

00! Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95054 U.S.A. "#is $ro%uct or %ocument is $rotecte% &y co$yri'#t an% %istri&ute% un%er licenses restrictin' its use, co$yin', %istri&ution, an% %ecom$ilation. No $art o( t#is $ro%uct or %ocument may &e re$ro%uce% in any (orm &y any means wit#out $rior written aut#ori)ation o( Sun an% its licensors, i( any. "#ir%*$arty so(tware, inclu%in' (ont tec#nolo'y, is co$yri'#te% an% license% (rom Sun su$$liers. +arts o( t#e $ro%uct may &e %eri,e% (rom -erkeley -S. systems, license% (rom t#e Uni,ersity o( Cali(ornia. Sun, Sun Microsystems, t#e Sun lo'o, /a,a, /a,aSer,er +a'es, Solaris, an% Stor0%'e are tra%emarks or re'istere% tra%emarks o( Sun Microsystems, Inc. in t#e U.S. an% ot#er countries. U.S. 1o,ernment 2i'#ts Commercial so(tware. 1o,ernment users are su&3ect to t#e Sun Microsystems, Inc. stan%ar% license a'reement an% a$$lica&le $ro,isions o( t#e 4A2 an% its su$$lements. .5CUM0N"A"I5N IS +256I.0. AS IS AN. A77 08+20SS 52 IM+7I0. C5N.I"I5NS, 20+20S0N"A"I5NS AN. 9A22AN"I0S, INC7U.IN1 AN: IM+7I0. 9A22AN": 54 M02C;AN"A-I7I":, 4I"N0SS 452 A +A2"ICU7A2 +U2+5S0 52 N5N*IN42IN10M0N", A20 .ISC7AIM0., 08C0+" "5 ";0 08"0N" ";A" SUC; .ISC7AIM02S A20 ;07. "5 -0 701A77: IN6A7I.. Unless ot#erwise license%, use o( t#is so(tware is aut#ori)e% $ursuant to t#e terms o( t#e license (oun% at< #tt$<==%e,elo$ers.sun.com=&erkeley>license.#tml Ce $ro%uit ou %ocument est $rot?'? $ar un co$yri'#t et %istri&u? a,ec %es licences @ui en restrei'nent lAutilisation, la co$ie, la %istri&ution, et la %?com$ilation. Aucune $artie %e ce $ro%uit ou %ocument ne $eut Btre re$ro%uite sous aucune (orme, $ar @uel@ue moyen @ue ce soit, sans lAautorisation $r?ala&le et ?crite %e Sun et %e ses &ailleurs %e licence, sAil y en a. 7e lo'iciel %?tenu $ar %es tiers, et @ui com$ren% la tec#nolo'ie relati,e auC $olices %e caractDres, est $rot?'? $ar un co$yri'#t et licenci? $ar %es (ournisseurs %e Sun. .es $arties %e ce $ro%uit $ourront Btre %?ri,?es %u systDme -erkeley -S. licenci?s $ar lAUni,ersit? %e Cali(ornie. Sun, Sun Microsystems, le lo'o Sun, /a,a, /a,aSer,er +a'es, Solaris, et Stor0%'e sont %es mar@ues %e (a&ri@ue ou %es mar@ues %?$os?es, %e Sun Microsystems, Inc. auC 0tats*Unis et %ans %Aautres $ays. C0""0 +U-7ICA"I5N 0S" 45U2NI0 0N 7A0"A" 0" AUCUN0 1A2AN"I0, 08+20SS0 5U IM+7ICI"0, NA0S" ACC52.00, : C5M+2IS .0S 1A2AN"I0S C5NC02NAN" 7A 6A70U2 MA2C;AN.0, 7AA+"I"U.0 .0 7A +U-7ICA"I5N A 20+5N.20 A UN0 U"I7ISA"I5N +A2"ICU7I020, 5U 70 4AI" EUA0770 N0 S5I"+AS C5N"204AISAN"0 .0 +25.UI" .0 "I02S. C0 .0NI .0 1A2AN"I0 N0 SAA++7IEU02AI" +AS, .ANS 7A M0SU20 5U I7 S02AI" "0NU /U2I.IEU0M0N" NU7 0" N5N A60NU.

Table of Contents
Intro%uction............................................................ ................................................................5 C#a$ter 5ne F 6irtual .e,ices G,%e,sH, 6%e, 7a&els, an% -oot -lock................................! Section 1.1< 6irtual .e,ices.............................................. ................................................! Section 1. < 6%e, 7a&els........................................................ ..........................................! Section 1. .1< 7a&el 2e%un%ancy.............................................................. ...................I Section 1. . < "ransactional "wo Sta'e% 7a&el U$%ate..............................................I Section 1.J< 6%e, "ec#nical .etails.......................................................... ........................K Section 1.J.1< -lank S$ace............................................................................ ...............K Section 1.J. < -oot -lock ;ea%er.............................................. ..................................K Section 1.J.J< Name*6alue +air 7ist...........................................................................K Section 1.J.4< "#e U&er&lock................................................................... ..................1 Section 1.4< -oot -lock.............................................. .....................................................14 C#a$ter "wo< -lock +ointers an% In%irect -locks................................................................15 Section .1< .6A F .ata 6irtual A%%ress.............................................. .......................15 Section . < 12I............................................. ............................................................1! Section .J< 1AN1.................................................................................... ...................1! Section .4< C#ecksum..................................................................................................1I Section .5< Com$ression..............................................................................................1K Section .! < -lock Si)e.............................................................. ...................................1K Section .I< 0n%ian.................................................................. ......................................19 Section .K< "y$e.............................................................. .............................................19 Section .9< 7e,el............................................................ .............................................. 0 Section .10< 4ill.................................................................................. .......................... 0 Section .11< -irt# "ransaction..................................................................................... 1 Section .1 < +a%%in'................................................ .................................................... 1 C#a$ter "#ree< .ata Mana'ement Unit...................................................... .......................... Section J.1 < 5&3ects.............................................................. .......................................... Section J. < 5&3ect Sets............................................................... .................................... ! C#a$ter 4our F .S7 .................................................. .......................................................... 9 Section 4.1 < .S7 In(rastructure.................................................. .................................... 9 Section 4. < .S7 Im$lementation .etails.......................................................................J1 Section 4.J< .ataset Internals..........................................................................................J Section 4.4< .S7 .irectory Internals..............................................................................J4 C#a$ter 4i,e F LA+.............................................................................. ................................JI Section 5.1< "#e Micro La$............................................ .................................................JK Section 5. < "#e 4at La$...................................................................... ...........................J9 Section 5. .1< )a$>$#ys>t...........................................................................................J9 Section 5. . < +ointer "a&le.................................................. ......................................41 Section 5. .J< )a$>lea(>$#ys>t...................................................................................41 Section 5. .4 < )a$>lea(>c#unk...................................................................................4J C#a$ter SiC F L+7...................................................................... ..........................................45 Section !.1< L+7 4ilesystem 7ayout......................................................... .......................45 Section !. < .irectories an% .irectory "ra,ersal.............................................. ...............45 Section !.J< L4S Access Control 7ists............................................................................4I J

C#a$ter Se,en F L4S Intent 7o'.............................................. ..........................................51 Section I.1< LI7 #ea%er...................................................................................................51 Section I. < LI7 &locks............................................................................ ........................5 C#a$ter 0i'#t F L657 GL4S ,olumeH.............................................................. ....................55

Introduction
ZFS is a new filesystem technology that provides immense capacity (128-bit), provable data integrity, always-consistent on-dis format, self-optimi!ing performance, and real-time remote replication" ZFS departs from traditional filesystems by eliminating the concept of vol#mes" $nstead, ZFS filesystems share a common storage pool consisting of writeable storage media" %edia can be added or removed from the pool as filesystem capacity re&#irements change" Filesystems dynamically grow and shrin as needed witho#t the need to re-partition #nderlying storage" ZFS provides a tr#ly consistent on-dis format, b#t #sing a copy on write ('()) transaction model" *his model ens#res that on dis data is never overwritten and all on dis #pdates are done atomically" *he ZFS software is comprised of seven distinct pieces+ the S,- (Storage ,ool -llocator), the .S/ (.ataset and Snapshot /ayer), the .%0 (.ata %anagement /ayer), the Z-, (ZFS -ttrib#te ,rocessor), the Z,/ (ZFS ,osi1 layer), the Z$/ (ZFS $ntent /og), and Z2(/ (ZFS 2ol#me)" *he on-dis str#ct#res associated with each of these pieces are e1plained in the following chapters+ S,- ('hapters 1 and 2), .S/ ('hapter 3), .%0 ('hapter 4), Z-, ('hapter 5), Z,/ ('hapter 6), Z$/ ('hapter 7), Z2(/ ('hapter 8)"

Chapter One Virtual Devices (vdevs), Vdev Labels, and Boot Block
Section 1.1: Virtual Devices
ZFS storage pools are made #p of a collection of virt#al devices" *here are two types of virt#al devices+ physical virt#al devices (sometimes called leaf vdevs) and logical virt#al devices (sometimes called interior vdevs)" - physical vdev, is a writeable media bloc device (a dis , for e1ample)" - logical vdev is a concept#al gro#ping of physical vdevs" 2devs are arranged in a tree with physical vdev e1isting as leaves of the tree" -ll pools have a special logical vdev called the 8root9 vdev which roots the tree" -ll direct children of the 8root9 vdev (physical or logical) are called top-level vdevs" *he $ll#stration below shows a tree of vdevs representing a sample pool config#ration containing two mirrors" *he first mirror (labeled 8%19) contains two dis , represented by 8vdev -9 and 8vdev :9" /i ewise, the second mirror 8%29 contains two dis s represented by 8vdev '9 and 8vdev .9" 2devs -, :, ', and . are all physical vdevs" 8%19 and %29 are logical vdevs; they are also top-level vdevs since they originate from the 8root vdev9"

Internal=7o'ical 6%e,s Mroot ,%e,N

MM1N ,%e, GMirror A=-H

"o$ 7e,el ,%e,s

MM N ,%e, GMirrorC=.H

+#ysical=7ea( 6%e,s

MAN ,%e, G%iskH

M-N ,%e, G%iskH

MCN ,%e, G%iskH

M.N ,%e, G%iskH

Illustration 1 vdev tree sample configuration

Section 1.2: Vdev Labels


<ach physical vdev within a storage pool contains a 236=: str#ct#re called a vdev label" *he vdev label contains information describing this partic#lar physical vdev and all other vdevs which share a common top-level vdev as an ancestor" For e1ample, the vdev label str#ct#re contained on vdev 8'9, in the previo#s ill#stration, wo#ld contain information describing the following vdevs+ 8'9, 8.9, and 8%29" *he contents of the vdev label are described in greater detail in section 1"4, Vdev Technical Details. !

*he vdev label serves two p#rposes+ it provides access to a pool>s contents and it is #sed to verify a pool>s integrity and availability" *o ens#re that the vdev label is always available and always valid, red#ndancy and a staged #pdate model are #sed" *o provide red#ndancy, fo#r copies of the label are written to each physical vdev within the pool" *he fo#r copies are identical within a vdev, b#t are not identical across vdevs in the pool" .#ring label #pdates, a two staged transactional approach is #sed to ens#re that a valid vdev label is always available on dis " 2dev label red#ndancy and the transactional #pdate model are described in more detail below"

ection !"#"!$ Label %edundanc&


Fo#r copies of the vdev label are written to each physical vdev within a ZFS storage pool" -side from the small time frame d#ring label #pdate (described below), these fo#r labels are identical and any copy can be #sed to access and verify the contents of the pool" )hen a device is added to the pool, ZFS places two labels at the front of the device and two labels at the bac of the device" *he drawing below shows the layo#t of these labels on a device of si!e N: /? and /1 represent the front two labels, /2 and /4 represent the bac two labels"
0 5!O 51 O N*51 O N* 5!O

70

71

7J

Illustration 2 Vdev Label layout on a block device of size N

:ased on the ass#mption that corr#ption (or accidental dis overwrites) typically occ#rs in contig#o#s ch#n s, placing the labels in non-contig#o#s locations (front and bac ) provides ZFS with a better probability that some label will remain accessible in the case of media fail#re or accidental overwrite (eg" #sing the dis as a swap device while it is still part of a ZFS storage pool)"

ection !"#"#$ Transactional Two Staged Label Update


*he location of the vdev labels are fi1ed at the time the device is added to the pool" *h#s, the vdev label does not have copy-on-write semantics li e everything else in ZFS" 'onse&#ently, when a vdev label is #pdated, the contents of the label are overwritten" -ny time on-dis data is overwritten, there is a potential for error" *o ens#re that ZFS always has access to its labels, a staged approach is #sed d#ring #pdate" *he first stage of the #pdate writes the even labels (/? and /2) to dis " $f, at any point in time, the system comes down or fa#lts d#ring this #pdate, the odd labels will still be valid" (nce the even labels have made it o#t to stable storage, the odd labels (/1 and /4) are #pdated and written to dis " *his approach has been caref#lly designed to ens#re that a valid copy of the label remains on dis at all times" I

Section 1.3: Vdev Technical Details


*he contents of a vdev label are bro en #p into fo#r pieces+ 8=: of blan space, 8= of boot header information, 112=: of name-val#e pairs, and 128=: of 1= si!ed #berbloc str#ct#res" *he drawing below shows an e1panded view of the /? label" - detailed description of each components follows+ blan space (section 1"4"1), boot bloc header (section 1"4"2), name@val#e pair list (section 1"4"4), and #berbloc array (section 1"4"5)"
70 71 7 7J

-lank S$ace -oot ;ea%er

Name=6alue +airs

....
1 KO U&er&lock Array 5!O

KO

1!O

Illustration 3 Components of a vdev label (blank space boot block !eader name"value pairs uberblock array#

ection !"'"!$ Blank pace


ZFS s#pports both 2*(' (2ol#me *able of 'ontents) and <F$ dis labels as valid methods of describing dis layo#t"1 )hile <F$ labels are not written as part of a slice (they have their own reserved space), 2*(' labels m#st be written to the first 8= of slice ?" *h#s, to s#pport 2*(' labels, the first 8 of the vdevAlabel is left empty to prevent potentially overwriting a 2*(' dis label"

ection !"'"#$ Boot Block (eader


*he boot bloc header is an 8= str#ct#re that is reserved for f#t#re #se" *he contents of this bloc will be described in a f#t#re appendi1 of this paper"

ection !"'"'$ )a*e+Value ,air List


*he ne1t 112=: of the label holds a collection of name-val#e pairs describing this vdev and all of it>s related vdevs" Belated vdevs are defined as all vdevs within the s#btree rooted at this vdev>s top-level vdev" For e1ample, the vdev label on device 8-9 (seen in the ill#stration below) wo#ld contain information describing the s#btree highlighted+ incl#ding vdevs 8-9, 8:9, and 8%19 (top-level vdev)"

1 .isk la&els %escri&e %isk $artition an% slice in(ormation. See (%iskG1MH an%=or (ormatG1MH (or more in(ormation on %isk $artitions an% slices. It s#oul% &e note% t#at %isk la&els are a com$letely se$arate entity (rom ,%e, la&els an% w#ile t#eir namin' is similar, t#ey s#oul% not &e con(use% as &ein' similar.

Internal=7o'ical 6%e,s Mroot ,%e,N

MM1N ,%e, GMirror A=-H

"o$ 7e,el ,%e,s

MM N ,%e, GMirrorC=.H

+#ysical=7ea( 6%e,s

MAN ,%e, G%iskH

M-N ,%e, G%iskH

MCN ,%e, G%iskH

M.N ,%e, G%iskH

Illustration $ vdev tree s!o%ing related vdevs in !ig!lig!ted circle

-ll name-val#e pairs are stored in C.B encoded nvlists" For more information on C.B encoding or nvlists, see the libnvpair(4/$:) and nvlistAfree(4D2,-$B) man pages" *he following name-val#e pairs are contained within this 112=: portion of the vdevAlabel" Version< Dame+ 8version9 2al#e+ .-*-A*E,<A0$D*65 .escription+ (n dis format version" '#rrent val#e is 819" Name: Dame+ 8name9 2al#e+ .-*-A*E,<AS*B$DF .escription+ Dame of the pool in which this vdev belongs" State: Dame+ 8state9 2al#e+ .-*-A*E,<A0$D*65 .escription+ State of this pool" *he following table shows all e1isting pool states"

State ,((/AS*-*<A-'*$2< ,((/AS*-*<A<C,(B*<. ,((/AS*-*<A.<S*B(E<.

Value ? 1 2

"a&le 1 +ool states an% ,alues. Transaction Dame+ 8t1g9 2al#e+ .-*-A*E,<A0$D*65 .escriptions+ *ransaction gro#p n#mber in which this label was written to dis " ool !"id Dame+ 8poolAg#id9 2al#e+ .-*-A*E,<A0$D*65 .escription+ Flobal #ni&#e identifier (g#id) for the pool" Top !"id Dame+ 8topAg#id9 2al#e+ .-*-A*E,<A0$D*65 .escription+ Flobal #ni&#e identifier for the top-level vdev of this s#btree" !"id Dame+ 8g#id9 2al#e+ .-*-A*E,<A0$D*65 .escription+ Flobal #ni&#e identifier for this vdev" Vde# Tree Dame+ 8vdevAtree9 2al#e+ .-*-A*E,<AD2/$S* .escription+ *he vdevAtree is a nvlist str#ct#re which is #sed rec#rsively to describe the hierarchical nat#re of the vdev tree as seen in ill#strations one and fo#r" *he vdevAtree rec#rsively describes each 8related9 vdev within this vdev>s s#btree" *he ill#stration below shows what the 8vdevAtree9 entry might loo li e for 8vdev -9 as shown in ill#strations one and fo#r earlier in this doc#ment"

10

type='mirror' vdev_tree id=1 guid=1659 !!966!"!1 516#6 metasla$%array = 1 metasla$%shi&t = ## ashi&t = 9 asi'e =519569"!( children)!* type='dis+' vdev_tree id=# guid=66"99(159695 "1#9," path='-dev-ds+-c"t!d!' devid='id1.sd/001232T1%0T , "5 45% 65!7!87!!!!,"!"1"N0-a' children)1* type='dis+' vdev_tree id= guid= 6"(!"! !!19 #91"!5 path='-dev-ds+-c"t1d!' devid='id1.sd/001232T1%0T , "5 45% 65!6425!!!,"!"D69N-a'

Illustration ( vdev tree nvlist entry for )vdev *) as seen in Illustrations 1 and $

<ach vdevAtree nvlist contains the following elements as described in the section below" Dote that not all nvlist elements are applicable to all vdevs types" *herefore, a vdevAtree nvlist may contain only a s#bset of the elements described below" Dame+ 8type9 2al#e+ .-*-A*E,<AS*B$DF .escription+ String val#e indicating type of vdev" *he following vdevAtypes are valid"
Type M%iskN M(ileN MmirrorN Mrai%)N Mre$lacin'N Description 7ea( ,%e,< &lock stora'e 7ea( ,%e,< (ile stora'e Interior ,%e,< mirror Interior ,%e,< rai%) Interior ,%e,< a sli'#t ,ariation on t#e mirror ,%e,P use% &y L4S w#en re$lacin' one %isk wit# anot#er Interior ,%e,< t#e root o( t#e ,%e, tree

MrootN

&able 2 Vdev &ype 'trings

Dame+ 8id9 2al#e+ .-*-A*E,<A0$D*65 .escription+ *he id is the inde1 of this vdev in its parent>s children array" Dame+ 8g#id9 2al#e+ .-*-A*E,<A0$D*65 .escription+ Flobal 0ni&#e $dentifier for this vdevAtree element" 11

Dame+ 8path9 2al#e+ .-*-A*E,<AS*B$DF .escription+ .evice path" (nly #sed for leaf vdevs" Dame+ 8devid9 2al#e+ .-*-A*E,<AS*B$DF .escription+ .evice $. for this vdevAtree element" (nly #sed for vdevs of type dis " Dame+ 8metaslabAarray9 2al#e+ .-*-A*E,<A0$D*65 .escription+ (bGect n#mber of an obGect containing an array of obGect n#mbers" <ach element of this array (maHiI) is, in t#rn, an obGect n#mber of a space map for metaslab >i'" Dame+ 8metaslabAshift9 2al#e+ .-*-A*E,<A0$D*65 .escription+ log base 2 of the metaslab si!e Dame+ 8ashift9 2al#e+ .-*-A*E,<A0$D*65 .escription+ /og base 2 of the minim#m allocatable #nit for this top level vdev" *his is c#rrently >1?> for a B-$.! config#ration, >J> otherwise" Dame+ 8asi!e9 2al#e+ .-*-A*E,<A0$D*65 .escription+ -mo#nt of space that can be allocated from this top level vdev Dame+ 8children9 2al#e+ .-*-A*E,<AD2/$S*A-BB-E .escription+ -rray of vdevAtree nvlists for each child of this vdevAtree element"

ection !"'"-$ The .berblock


$mmediately following the nvpair lists in the vdev label is an array of #berbloc s" *he #berbloc is the portion of the label containing information necessary to access the contents of the pool2" (nly one #berbloc in the pool is active at any point in time" *he #berbloc with the highest transaction gro#p n#mber and valid SK--236 chec s#m is the active #berbloc " *o ens#re constant access to the active #berbloc , the active #berbloc is never
"#e u&er&lock is similar to t#e su$er&lock in U4S.

overwritten" $nstead, all #pdates to an #berbloc are done by writing a modified #berbloc to another element of the #berbloc array" 0pon writing the new #berbloc , the transaction gro#p n#mber and timestamps are incremented thereby ma ing it the new active #berbloc in a single atomic action" 0berbloc s are written in a ro#nd robin fashion across the vario#s vdevs with the pool" *he ill#stration below has an e1panded view of two #berbloc s within an #berbloc array"

70 71

7J

-lank S$ace

-oot ;ea%er

Name=6alue +airs

....
u&>ma'ic u&>,ersion uint!4>t u&>ma'ic u&>tC' uint!4>t u&>,ersion u&>'ui%>sum uint!4>t u&>tC' u&>timestam$ uint!4>t u&>root&$ u&>'ui%>sum uint!4>t u&>timestam$ &lk$tr>t u&>root&$ acti,e u&er&lock u&er&lock>$#ys>t

uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t &lk$tr>t

Illustration - +berblock array s!o%ing uberblock contents

Uberblock Tec$nical Details *he #berbloc is stored in the machine>s native endian format and has the following contents+ "b%magic *he #berbloc magic n#mber is a 65 bit integer #sed to identify a device as containing ZFS data" *he val#e of the #bAmagic is ?1??bab1?c (oo-ba-bloc )" *he following table shows the #bAmagic n#mber as seen on dis " Machine Endianness Uberblock Value
:ig <ndian /ittle <ndian &able 3 +berblock values per mac!ine endian type, ?1??bab1?c ?1?cb1ba??

"b%#ersion *he version field is #sed to identify the on-dis format in which this data is laid o#t" *he c#rrent on-dis format version n#mber is &'(" *his field contains the same val#e as the 8version9 element of the name@val#e pairs described in section 1"4"4" "b%t'g 1J

-ll writes in ZFS are done in transaction gro#ps" <ach gro#p has an associated transaction gro#p n#mber" *he #bAt1g val#e reflects the transaction gro#p in which this #berbloc was written" *he #bAt1g n#mber m#st be greater than or e&#al to the 8t1g9 n#mber stored in the nvlist for this label to be valid" "b%g"id%s"m *he #bAg#idAs#m is #sed to verify the availability of vdevs within a pool" )hen a pool is opened, ZFS traverses all leaf vdevs within the pool and totals a r#nning s#m of all the F0$.s (a vdev>s g#id is stored in the guid nvpair entry, see section 1"4"4) it enco#nters" *his comp#ted s#m is chec ed against the #bAg#idAs#m to verify the availability of all vdevs within this pool" "b%timestamp 'oordinated 0niversal *ime (0*') when this #berbloc was written in seconds since Lan#ary 1st 1J7? (F%*)" "b%rootbp *he #bArootbp is a bl ptr str#ct#re containing the location of the %(S" :oth the %(S and bl ptr str#ct#res are described in later chapters of this doc#ment+ 'hapters 5 and 2 respectively"

Section 1.4: Boot Block


$mmediately following the /? and /1 labels is a 4"3%: ch#n reserved for f#t#re #se" *he contents of this bloc will be described in a f#t#re appendi1 of this paper"
0 5!O 51 O 4M N*51 O N* 5!O

70

71

-oot -lock

7J

Illustration . Vdev label layout including boot block reserved space,

14

Chapter T/o$ Block ,ointers and Indirect Blocks


.ata is transferred between dis and main memory in #nits called bloc s" - bloc pointer (bl ptrAt) is a 128 byte ZFS str#ct#re #sed to physically locate, verify, and describe bloc s of data on dis " *he 128 byte bl ptrAt str#ct#re layo#t is shown in the ill#stration below"
65 ? 1 F 2 4 F 5 3 F 6 < lvl 7 8 J a b c d e f type vdev4 offset4 c s#m comp padding padding padding birth t1g fill co#nt chec s#mH?I chec s#mH1I chec s#mH2I chec s#mH4I ,S$Z< /S$Z< vdev2 offset2 FB$. -S$Z< 36 vdev1 offset1 FB$. -S$Z< 58 5? 42 M FB$. 25 16 8 -S$Z< ?

Illustration / 0lock pointer structure s!o%ing byte by byte usage,

Section 2.1: DVA Data Virtual Address


*he data virt#al address is the name given to the combination of the vdev and o&&set portions of the bloc pointer, for e1ample the combination of vdev1 and o&&set1 ma e #p a .2- (dva1)" ZFS provides the capability of storing #p to three copies of the data pointed to by the bloc pointer, each pointed to by a #ni&#e .2- (dva1, dva2, or dva4)" *he data stored in each of these copies is identical" *he n#mber of .2-s #sed per bloc pointer is p#rely a policy decision and is called the 8wideness9 of the bloc pointer+ 15

single wide bloc pointer (1 .2-), do#ble wide bloc pointer (2 .2-s), and triple wide bloc pointer (4 .2-s)" *he vdev portion of each .2- is a 42 bit integer which #ni&#ely identifies the vdev $. containing this bloc " *he o&&set portion of the .2- is a 64 bit integer val#e holding the offset (starting after the vdev labels (/? and /1) and boot bloc ) within that device where the data lives" *ogether, the vdev and o&&set #ni&#ely identify the bloc address of the data it points to" *he val#e stored in o&&set is the offset in terms of sectors (312 byte bloc s)" *o find the physical bloc byte offset from the beginning of a slice, the val#e inside o&&set m#st be shifted over (NN) by J (2J O312) and this val#e m#st be added to ?15????? (si!e of two vdevAlabels and boot bloc )" physical $loc+ address = :o&&set ;; 9< P ?15????? (5%:)

Section 2.2 :

!"D

Baid-Z layo#t information, reserved for f#t#re #se"

Section 2.3:

A#

- gang bloc is a bloc whose contents contain bloc pointers" Fang bloc s are #sed when the amo#nt of space re&#ested is not available in a contig#o#s bloc " $n a sit#ation of this ind, several smaller bloc s will be allocated (totaling #p to the si!e re&#ested) and a gang bloc will be created to contain the bloc pointers for the allocated bloc s" - pointer to this gang bloc is ret#rned to the re&#ester, giving the re&#ester the perception of a single bloc " Fang bloc s are identified by the 839 bit"
G bit value ? 1 &able $ 1ang 0lock Values Description non-gang bloc gang bloc

Fang bloc s are 312 byte si!ed, self chec s#mming bloc s" - gang bloc contains #p to 4 bloc pointers followed by a 42 byte chec s#m" *he format of the gang bloc is described by the following str#ct#res"

1!

typedef str#ct !ioAgbh Q

bl ptrAt !gAbl ptrHS,-AF:KAD:/=,*BSI; #int65At !gAfillerHS,-AF:KAF$//<BI; !ioAbloc AtailAt !gAtail"; R !ioAgbhAphysAt;

)g%blkptr: array of bloc pointers" <ach 312 byte gang bloc can hold #p to 4 bloc pointers" )g%filler: *he filler fields pads o#t the gang bloc so that it is nicely byte aligned"
typedef str#ct !ioAbloc Atail Q #int65At !btAmagic; !ioAc s#mAt !btAc s#m; R

)bt%magic: Z$( bloc tail magic n#mber" *he val#e is 0x210da7ab10c7a11 :'io=data=$loc=tail<.
typedef !ioAc s#m Q uint!4>t )c>wor%R4SP T)io>cksum>tP

zc_word: (our K &yte wor%s containin' t#e c#ecksum (or t#is 'an' &lock.

Section 2.4: $hecksu%


:y defa#lt ZFS chec s#ms all of its data and metadata" ZFS s#pports several algorithms for chec s#mming incl#ding fletcher2, fletcher5 and SK--236 (236-bit Sec#re Kash -lgorithm in F$,S 18?-2, available at http+@@csrc"nist"gov@cryptval)" *he algorithm #sed to chec s#m this bloc is identified by the 8 bit integer stored in the c+sum portion of the bloc pointer" *he following table pairs each integer with a description and algorithm #sed to chec s#m this bloc >s contents"

1I

Description on off label gang header !ilog fletcher2 fletcher5 SK--236

Value 1 2 4 5 3 6 7 8

Algorithm fletcher2 none SK--236 SK--236 fletcher2 fletcher2 fletcher5 SK--236

&able ( C!ecksum Values and associated c!ecksum algorit!ms,

- 236 bit chec s#m of the data is comp#ted for each bloc #sing the algorithm identified in c+sum" $f the c s#m val#e is 2 (off), a chec s#m will not be comp#ted and chec+sum)!*. chec+sum)1*. chec+sum)#*. and chec+sum) * will be !ero" (therwise, the 236 bit chec s#m comp#ted for this bloc is stored in the chec+sum)!*. chec+sum)1*. chec+sum)#*. and chec+sum) * fields. Note: The computed chec+sum is al>ays o& the data. even i& this is a gang $loc+. 3ang $loc+s :see a$ove< and 'ilog $loc+s :see ?hapter ,< are sel& chec+summing.

Section 2.&: $o%'ression


ZFS s#pports several algorithms for compression" *he type of compression #sed to compress this bloc is stored in the comp portion of the bloc pointer.
Description on off l!Gb &able - Compression Values and associated algorit!m, Value 1 2 4 Algorithm l!Gb none l!Gb

Section 2.( : Block Si)e


*he si!e of a bloc is described by three different fields in the bloc pointer; psi'e. lsi'e. and asi'e. lsi'e: /ogical si!e" *he si!e of the data witho#t compression, raid! or gang overhead" psi'e+ physical si!e of the bloc on dis after compression 1K

asi'e: allocated si!e, total si!e of all bloc s allocated to hold this data incl#ding any gang headers or raid-Z parity information $f compression is t#rned off and ZFS is not on Baid-Z storage, lsi!e, asi!e, and psi!e will all be e&#al" -ll si!es are stored as the n#mber of 312 byte sectors (min#s one) needed to represent the si!e of this bloc "

Section 2.*: +ndian


ZFS is an adaptive-endian filesystem (providing the restrictions described in 'hapter (ne) that allows for moving pools across machines with different architect#res+ little endian vs" big endian" *he 81@ portion of the bloc pointer indicates which format this bloc has been written o#t in" :loc are always written o#t in the machine>s native endian format"
ndian /ittle <ndian :ig <ndian &able . 2ndian Values Value 1 ?

$f a pool is moved to a machine with a different endian format, the contents of the bloc are byte swapped on read"

Section 2.,: T-'e


*he type portion of the bloc pointer indicates what type of data this bloc holds" *he type can be the following val#es" %ore detail is provided in chapter 4 regarding obGect types"

19

Type .MU>5">N5N0 .MU>5">5-/0C">.I20C"52: .MU>5">5-/0C">A22A: .MU>5">+ACO0.>N67IS" .MU>5">N67IS">SIL0 .MU>5">-+7IS" .MU>5">-+7IS">;.2 .MU>5">S+AC0>MA+>;0A.02 .MU>5">S+AC0>MA+ .MU>5">IN"0N">751 .MU>5">.N5.0 .MU>5">5-/S0" .MU>5">.S7>.A"AS0" .MU>5">.S7>.A"AS0">C;I7.>MA+ .MU>5">5-/S0">SNA+>MA+ .MU>5">.S7>+25+S .MU>5">.S7>5-/S0" .MU>5">LN5.0 .MU>5">AC7 .MU>5">+7AIN>4I70>C5N"0N"S .MU>5">.I20C"52:>C5N"0N"S .MU>5">MAS"02>N5.0 .MU>5">.070"0>EU0U0 .MU>5">L657 .MU>5">L657>+25+

Value 0 1

J 4 5 ! I K 9 10 11 1 1J 14 15 1! 1I 1K 19 0 1

J 4

&able / 3b4ect &ypes

Section 2..: Level


*he level portion of the bloc pointer is the n#mber of levels (n#mber of bloc pointers which need to be traversed to arrive at this data")" See 'hapter 4 for a more complete definition of level"

Section 2.1/: 0ill


*he fill co#nt describes the n#mber of non-!ero bloc pointers #nder this bloc pointer" *he fill co#nt for a data bloc pointer is 1, as it does not have any bloc pointers beneath it" *he fill co#nt is #sed slightly differently for bloc pointers of type .%0A(*A.D(.<" For bloc pointers of this type, the fill co#nt contains 0

the n#mber of free dnodes beneath this bloc pointer" For more information on dnodes see 'hapter 4"

Section 2.11: Birth Transaction


*he birth transaction stored in the 8birth t1g9 bloc pointer field is a 65 bit integer containing the transaction gro#p n#mber in which this bloc pointer was allocated"

Section 2.12: 1addin2


*he three padding fields in the bloc pointer are space reserved for f#t#re #se"

Chapter Three$ Data 0ana1e*ent .nit


*he .ata %anagement 0nit (.%0) cons#mes bloc s and gro#ps them into logical #nits called obGects" (bGects can be f#rther gro#ped by the .%0 into obGect sets" :oth obGects and obGect sets are described in this chapter"

Section 3.1 : 3b4ects


)ith the e1ception of a small amo#nt of infrastr#ct#re, described in chapters 1 and 2, everything in ZFS is an obGect" *he following table lists e1isting ZFS obGect types; many of these types are described in greater detail in f#t#re chapters of this doc#ment" Type
.%0A(*AD(D< .%0A(*A(:L<'*A.$B<'*(BE .%0A(*A(:L<'*A-BB-E .%0A(*A,-'=<.AD2/$S* .%0A(*AS,-'<A%-, .%0A(*A$D*<D*A/(F .%0A(*A.D(.< .%0A(*A(:LS<* .%0A(*A.S/A.-*-S<*A'K$/.A%-, .%0A(*A.S/A(:LS<*ASD-,A%-, .%0A(*A.S/A,B(,S .%0A(*A:,/$S*

Description
0nallocated obGect .S/ obGect directory Z-, obGect (bGect #sed to store an array of obGect n#mbers" ,ac ed nvlist obGect" S,- dis bloc #sage list" $ntent /og (bGect of dnodes (metadnode) 'ollection of obGects" .S/ Z-, obGect containing child .S/ directory information" .S/ Z-, obGect containing snapshot information for a dataset" .S/ Z-, properties obGect containing properties for a .S/ dir obGect" :loc pointer list S #sed to store the 8deadlist9 + list of bloc pointers deleted since the last snapshot, and the 8deferred free list9 #sed for sync to convergence" :,/$S* header+ stores the bplistAphysAt str#ct#re" -'/ (-ccess 'ontrol /ist) obGect Z,/ ,lain file Z,/ .irectory Z-, (bGect Z,/ %aster Dode Z-, obGect+ head obGect #sed to identify root directory, delete &#e#e, and version for a filesystem"

.%0A(*A:,/$S*AK.B .%0A(*A-'/ .%0A(*A,/-$DAF$/< .%0A(*A.$B<'*(BEA'(D*<D*S .%0A(*A%-S*<BAD(.<

Type
.%0A(*A.</<*<AT0<0<

Description
*he delete &#e#e provides a list of deletes that were in-progress when the filesystem was force #nmo#nted or as a res#lt of a system fail#re s#ch as a power o#tage" 0pon the ne1t mo#nt of the filesystem, the delete &#e#e is processed to remove the files@dirs that are in the delete &#e#e" *his mechanism is #sed to avoid lea ing files and directories in the filesystem" ZFS vol#me (Z2(/) Z2(/ properties

.%0A(*AZ2(/ .%0A(*AZ2(/A,B(, &able 5 67+ 3b4ect &ypes

(bGects are defined by 312 bytes str#ct#res called dnodes4" - dnode describes and organi!es a collection of bloc s ma ing #p an obGect" *he dnode (dnodeAphysAt str#ct#re), seen in the ill#stration below, contains several fi1ed length fields and two variable length fields" <ach of these fields are described in detail below"
%no%e>$#ys>t
uintK>t uintK>t uintK>t uintK>t uintK>t uintK>t uintK>t uintK>t uint1!>t uint1!>t uintK>t uint!4>t uint!4>t uint!4>t &lk$tr>t uintK>t %n>ty$eP %n>in%&lks#i(tP %n>nle,els %n>n&lk$trP %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP %n>&lk$trRNSP %n>&onusR-5NUS70NS

(iCe% len't# (iel%s

,aria&le len't# (iel%s

Illustration 5 dnode8p!ys8t structure

dn%t*pe -n 8-bit n#meric val#e indicating an obGect>s type" See *able 8 for a list of valid obGect types and their associated 8 bit identifiers" dn%indblks$ift and dn%datablks)sec ZFS s#pports variable data and indirect (see dnAnlevels below for a description of indirect bloc s) bloc si!es ranging from 312 bytes to 128 =bytes"
J A %no%e is similar to an ino%e in U4S.

dn!indbl"shi#t: (-bit n#meric val#e containing the log (base 2) of the si!e (in bytes) of an indirect bloc for this obGect" dn!databl"s$sec: 16-bit n#meric val#e containing the data bloc si!e (in bytes) divided by 312 (si!e of a dis sector)" *his val#e can range between 1 (for a 312 byte bloc ) and 236 (for a 128 =byte bloc )" dn%nblkptr and dn%blkptr dnAbl ptr is a variable length field that can contains between one and three bloc pointers" *he n#mber of bloc pointers that the dnode contains is set at obGect allocation time and remains constant thro#gho#t the life of the dnode" dn!nbl"ptr + 8 bit n#meric val#e containing the n#mber of bloc pointers in this dnode" dn!bl"ptr% bloc pointer array containing dn%n$l+ptr bloc pointers dn%nle#els dnAnlevels is an 8 bit n#meric val#e containing the n#mber of levels that ma e #p this obGect" *hese levels are often referred to as levels of indirection" +ndirection - dnode has a limited n#mber (dnAnbl ptr, see above) of bloc pointers to describe an obGect>s data" For a dnode #sing the largest data bloc si!e (128=:) and containing the ma1im#m n#mber of bloc pointers (4), the largest obGect si!e it can represent (witho#t indirection) is 485 =:+ 4 1 128=: O 485=:" *o allow for larger obGects, indirect bloc s are #sed" -n indirect bloc is a bloc containing bloc pointers" *he n#mber of bloc pointers that an indirect bloc can hold is dependent on the indirect bloc si!e (represented by dn%ind$l+shi&t< and can be calc#lated by dividing the indirect bloc si!e by the si!e of a bl ptr (128 bytes)" *he largest indirect bloc (128=:) can hold #p to 1?25 bloc pointers" -s an obGect>s si!e increases, more indirect bloc s and levels of indirection are created" - new level of indirection is created once an obGect grows so large that it e1ceeds the capacity of the c#rrent level" ZFS provides #p to si1 levels of indirection to s#pport files #p to 265 bytes long" *he ill#stration below shows an obGect with 4 levels of bloc s (level ?, level 1, and level 2)" *his obGect has triple wide bloc pointers (dva1, dva2, and dva4) for metadata and single wide bloc pointers for its data (see 'hapter two for a description of bloc pointer wideness)" *he bloc s at level ? are data bloc s"

%no%e>$#ys>t
uintK>t uintK>t uint8_t uint8_t uintK>t uintK>t uintK>t uintK>t uint1!>t uint1!>t uintK>t uint!4>t uint!4>t uint!4>t blkptr_t uintK>t %n>ty$eP %n>in%&lks#i(tP dn_nlevels = 3 dn_nblkptr = 3 %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP dn_blkptr[3]; %n>&onusR-5NUS70NS

%n>&lk$trRS
%,aJ %,a %,a1 %,aJ %,a %,a1 %,aJ %,a %,a1

...

7e,el

... ... ... .....

7e,el 1

7e,el 0

Illustration 19 3b4ect %it! 3 levels, &riple %ide block pointers used for metadata: single %ide block pointers used for data,

dn%ma'blkid -n obGect>s bloc s are identified by bloc ids" *he bloc s in each level of indirection are n#mbered from ? to D, where the first bloc at a given level is given an id of ?, the second an id of 1, and so forth" *he dn%maA$l+id field in the dnode is set to the val#e of the largest data (level !ero) bloc id for this obGect" Dote on :loc $ds+ Fiven a bloc id and level, ZFS can determine the e1act branch of indirect bloc s which contain the bloc " *his calc#lation is done #sing the bloc id, bloc level, and n#mber of bloc pointers in an indirect bloc " For e1ample, ta e an obGect which has 128=: si!ed indirect bloc s" -n indirect bloc of this si!e can hold 1?25 bloc pointers" Fiven a level ? bloc id of 1646?, it can be determined that bloc 13 (bloc id 13) of level 1 contains the bloc pointer for level ? bl id 1646?" level 1 bl id O 1646?U1?25 O 13 *his calc#lation can be performed rec#rsively #p the tree of indirect bloc s #ntil the top level of indirection has been reached"

dn%secp$*s *he s#m of all asi'e val#es for all bloc pointers (data and indirect) for this obGect" dn%bon"s, dn%bon"slen, and dn%bon"st*pe *he bon#s b#ffer (dnAbon#s) is defined as the space following a dnode>s bloc pointer array (dnAbl ptr)" *he amo#nt of space is dependent on obGect type and can range between 65 and 42? bytes" dn!bonus% dnAbon#slen si!ed ch#n of data " *he format of this data is defined by dnAbon#stype" dn!bonuslen: /ength (in bytes) of the bon#s b#ffer" dn!bonust&pe: 8 bit n#meric val#e identifying the type of data contained within the bon#s b#ffer" *he following table shows valid bon#s b#ffer types and the str#ct#res which are stored in the bon#s b#ffer" *he contents of each of these str#ct#res will be disc#ssed later in this specification" 'onus (&pe
.%0A(*A,-'=<.AD2/$S*AS$Z<

Description

)etadata Structure

Value

:on#s b#ffer type containing #int65At si!e in bytes of a .%0A(*A,-'=<.AD2/$S* obGect" Spa space map header" .S/ .irectory obGect #sed to define relationships and properties between related datasets" spaceAmapAobGAt dslAdirAphysAt

5 7

.%0A(*AS,-'<A%-,AK<-.<B .%0A(*A.S/A.$B

12

.%0A(*A.S/A.-*-S<*

.S/ dataset obGect #sed to dslAdatasetAphysAt organi!e snapshot and #sage static information for obGects of type .%0A(*A(:LS<*" Z,/ metadata !nodeAphysAt

16 17

.%0A(*AZD(.<

&able 19 0onus 0uffer &ypes and associated structures,

Section 3.2: 3b4ect Sets


*he .%0 organi!es obGects into gro#ps called obGect sets" (bGect sets are #sed in ZFS to gro#p related obGects, s#ch as obGects in a filesystem, snapshot, clone, or vol#me" (bGect sets are represented by a 1= byte o$Bset%phys%t str#ct#re" <ach member of this str#ct#re is defined in detail below"

o&3set>$#ys>t %no%e>$#ys>t meta%no%e )il>#ea%er>t os>)il>#ea%er uint!4>t os>ty$e

Illustration 11 ob4set8p!ys8t structure

os%t*pe *he .%0 s#pports several types of obGect sets, where each obGect set type has it>s own well defined format@layo#t for its obGects" *he obGect set>s type is identified by a 65 bit integer, os%type" *he table below lists available .%0 obGect set types and their associated os%type integer val#e" *b+ect Set (&pe
.%0A(S*AD(D< .%0A(S*A%<*.%0A(S*AZFS .%0A(S*AZ2(/

Description
0ninitiali!ed (bGect Set .S/ (bGect Set , See 'hapter 5 Z,/ (bGect Set, See 'hapter 6 Z2(/ (bGect Set, See 'hapter 8

Value
? 1 2 4

&able 11 67+ 3b4ect 'et &ypes

os%)il%$eader *he Z$/ header is described in detail in 'hapter 7 of this doc#ment" metadnode -s described earlier in this chapter, each obGect is described by a dnodeAphysAt" *he collection of dnodeAphysAt str#ct#res describing the obGects in this obGect set are stored as an obGect pointed to by the metadnode" *he data contained within this obGect is formatted as an array of dnodeAphysAt str#ct#res (one for each obGect within the obGect set)" <ach obGect within an obGect set is #ni&#ely identified by a 65 bit integer called an obGect n#mber" -n obGect>s 8obGect n#mber9 identifies the array element, in the dnode array, containing this obGect>s dnodeAphysAt" *he ill#stration below shows an obGect set with the metadnode e1panded" *he metadnode contains three bloc pointers, each of which have been e1panded to show their contents" (bGect n#mber 5 has been f#rther e1panded to show the details of the dnodeAphysAt and the bloc str#ct#re referenced by this dnode"

o&3set>$#ys>t

dn_type DM _!"_D#!D$ %n>in%&lks#i(tP dn_nlevels % dn_nblkptr 3 %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP dn_blkptr[3]; %n>&onusR-5NUS70NS

%no%e>$#ys>t meta%no%e )il>#ea%er>t os>)il>#ea%er uint!4>t os>ty$e c#ar os>$a%RJI!S

%n>&lk$trRS

5 ! I K 9

10 10 10 10 10

0 1

J 4

10 J 10 4

04I 04K

...
%no%e>$#ys>t
uintK>t uintK>t uint8_t uint8_t uintK>t uintK>t uintK>t uintK>t uint1!>t uint1!>t uintK>t uint!4>t uint!4>t uint!4>t blkptr_t uintK>t %n>ty$eP %n>in%&lks#i(tP dn_nlevels = 3 dn_nblkptr = 3 %n>&onusty$eP %n>c#ecksumP %n>com$ressP %n>$a%R1SP %n>%ata&lks)secP %n>&onuslenP %n>$a% R4SP %n>maC&lki%P %n>sec$#ysP %n>$a%JR4SP dn_blkptr[3]; %n>&onusR-5NUS70NS

...

...

%n>&lk$trRS

7e,el

... .....
Illustration 12 3b4ect 'et

7e,el 1

7e,el 0

Chapter 2our D L
*he .S/ (.ataset and Snapshot /ayer) provides a mechanism for describing and managing relationships-between and properties-of obGect sets" :efore describing the .S/ and the relationships it describes, a brief overview of the vario#s flavors of obGect sets is necessary" Ob-ect Set O#er#iew ZFS provides the ability to create fo#r inds of obGect sets+ filesystems, clones, snapshots, and vol#mes" ZFS filesystem+ - filesystem stores and organi!es obGects in an easily accessible, ,(S$C compliant manner" ZFS clone+ - clone is identical to a filesystem with the e1ception of its origin" 'lones originate from snapshots and their initial contents are identical to that of the snapshot from which it originated" ZFS snapshot+ - snapshot is a read-only version of a filesystem, clone, or vol#me at a partic#lar point in time" ZFS vol#me+ - vol#me is a logical vol#me that is e1ported by ZFS as a bloc device" ZFS s#pports several operations and@or config#rations which ca#se interdependencies amongst obGect sets" *he p#rpose of the .S/ is to manage these relationships" *he following is a list of s#ch relationships" 'lones+ - clone is related to the snapshot from which it originated" (nce a clone is created, the snapshot in which it originated can not be deleted #nless the clone is also deleted" Snapshots+ - snapshot is a point-in-time image of the data in the obGect set in which it was created" - filesystem, clone, or vol#me can not be destroyed #nless its snapshots are also destroyed" 'hildren+ ZFS s#pport hierarchically str#ct#red obGect sets; obGect sets within obGect sets" - child is dependent on the e1istence of its parent" - parent can not be destroyed witho#t first destroying all children"

Section 4.1 : DSL "n5rastructure


<ach obGect set is represented in the .S/ as a dataset" - dataset manages space cons#mption statistics for an obGect set, contains obGect set location information, and eeps trac of any snapshots inter-dependencies" .atasets are gro#ped together hierarchically into collections called .ataset .irectories" 9

.ataset .irectories manage a related gro#ping of datasets and the properties associated with that gro#ping" - .S/ directory always has e1actly one 8active dataset9" -ll other datasets #nder the .S/ directory are related to the 8active9 dataset thro#gh snapshots, clones, or child@parent dependencies" *he following pict#re shows the .S/ infrastr#ct#re incl#ding a pictorial view of how obGect set relationships are described via the .S/ datasets and .S/ directories" *he top level .S/ .irectory can be seen at the top@center of this fig#re" .irectly below the .S/ .irectory is the 8active dataset9" *he active dataset represents the live filesystem" (riginating from the active dataset is a lin ed list of snapshots which have been ta en at different points in time" <ach dataset str#ct#re points to a .%0 (bGect Set which is the act#al obGect set containing obGect data" *o the left of the top level .S/ .irectory is a child Z-,5 obGect containing a listing of all child@parent dependencies" *o the right of the .S/ directory is a properties Z-, obGect containing properties for the datasets within this .S/ directory" - listing of all properties can be seen in *able 12 below" - detailed description of .atasets and .S/ .irectories are described in the Dataset Cnternals and D04 Directories Cnternals sections below"
('ild Dataset )nfor*ation

.S7 C#il% .ataset LA+ 5&3ect

.S7 .irectory

.S7 +ro$erties LA+ 5&3ect

.S7 .irectory Gc#il%1H

.S7 .irectory Gc#il% H

.S7 .S7 .ataset .ataset Gacti,eH

.S7 .ataset Gsna$s#otH

.S7 .ataset Gsna$s#otH

&naps'ots

.MU 5&3ect Set Gacti,eH

.MU 5&3ect Set Gsna$s#otH

.MU 5&3ect Set Gsna$s#otH

Illustration 13 6'L Infrastructure

4 "#e LA+ is eC$laine% in C#a$ter 5.

.S7 In(rastructure
J0

Section 4.2: DSL "%'le%entation Details


*he .S/ is implemented as an obGect set of type .%0A(S*A%<*-" *his obGect set is often called the %eta (bGect Set, or %(S" *here is only one %(S per pool and the #berbloc (see 'hapter (ne) points to it directly" *here is a single disting#ished obGect in the %eta (bGect Set" *his obGect is called the obGect directory and is always located in the second element of the dnode array (inde1 1)" -ll obGects, with the e1ception of the obGect directory, can be located by traversing thro#gh a set of obGect references starting at this obGect" T$e ob-ect director* *he obGect directory is a Z-, obGect (an obGect containing name@val#e pairs -see chapter 3 for a description of Z-, obGects) containing three attrib#te pairs (name@val#e) named+ root%dataset. con&ig. and sync%$plist. root%dataset: *he 8root%dataset@ attrib#te contains a 65 bit integer val#e identifying the obGect n#mber of the root .S/ directory for the pool" *he root .S/ directory is a special obGect whose contents reference all top level datasets within the pool" *he 8rootAdataset9 directory, is an obGect of type .%0A(*A.S/A.$B and will be e1plained in greater detail in 0ection ".": D04 Directory Cnternals" con&ig: *he 8con&ig@ attrib#te contains a 65 bit integer val#e identifying the obGect n#mber for an obGect of type .%0A(*A,-'=<.AD2/$S*" *his obGect contains C.BA<D'(.<. name val#e pairs describing this pools vdev config#ration" $ts contents are similar to those described in section 1"4"4+ name@val#e pairs list" sync%$plist: *he Dsync%$plist@ attrib#te contains a 65 bit integer val#e identifying the obGect n#mber for an obGect of type .%0A(*ASED'A:,/$S*" *his obGect contains a list of bloc pointers which need to be freed d#ring the ne1t transaction" *he ill#stration below shows the meta obGect set (%(S) in relation to the #berbloc and label str#ct#res disc#ssed in 'hapter 1"

J1

70 71

-oot

7J

u&er&lock>$#ys>t array -lank -oot S$ace ;%r

Name=6alue +airs

....
%no%e>$#ys>t

uint!4>t u&>ma'ic uint!4>t u&>,ersion uint!4>t u&>tC' uint!4>t u&>,%e,>sum uint!4>t u&>timestam$ &lk$tr>t u&>root&$

%no%e>$#ys>t meta%no%e )il>#ea%er>t os>)il>#ea%er uint!4>t os>ty$e U .MU>5S">M0"A

uint8_t dn_type =DM _!"_D#!D$ uintK>t %n>in%&lks#i(tP uintK>t %n>nle,els uintK>t %n>n&lk$trP uintK>t %n>&onusty$eP uintK>t %n>c#ecksumP uintK>t %n>com$ressP uintK>t %n>$a%R1SP uint1!>t %n>%ata&lks)secP uint1!>t %n>&onuslenP uintK>t %n>$a% R4SP uint!4>t %n>maC&lki%P uint!4>t %n>sec$#ysP uint!4>t %n>$a%JR4SP blkptr_t dn_blkptr[3]; uintK>t %n>&onusR-5NUS70NS

4 5 ! I K

04! 04I

04K 049 050 051 05

o&3ect>%irectory root>%ataset

10 10 10 10 10

0 1

J 4

10

10 J sync>&$list

uint8_t dn_type= DM _!"_!./$("_D)0$("!01 uintK>t %n>in%&lks#i(tP uint8_t dn_nlevels = % uint8_t dn_nblkptr = %; uintK>t %n>&onusty$eP uintK>t %n>c#ecksumP uintK>t %n>com$ressP uintK>t %n>$a%R1SP uint1!>t %n>%ata&lks)secP uint1!>t %n>&onuslenP uintK>t %n>$a% R4SP uint!4>t %n>maC&lki%P uint!4>t %n>sec$#ysP uint!4>t %n>$a%JR4SP blkptr_t dn_blkptr[%]P uintK>t %n>&onusR-5NUS70NS

con(i'

...

...

...

root>%ataset U con(i' U 4 sync>&$list U 10 J

Illustration 1$ 7eta 3b4ect 'et

Section 4.3: Dataset "nternals


.atasets are store% as an o&3ect o( ty$e .MU>5">.S7>.A"AS0". "#is o&3ect ty$e uses t#e &onus &u((er in t#e %no%e>$#ys>t to #ol% a dsl8dataset8p!ys8t structure. "#e contents o( t#e %sl>%ataset>$#ys>t structure are s#own &elow. uint+,_t ds_dir_ob-: 5&3ect num&er o( t#e .S7 %irectory re(erencin' t#is %ataset. uint+,_t ds_prev_snap_ob-: I( t#is %ataset re$resents a (ilesystem, ,olume, or clone, t#is (iel% contains t#e !4 &it o&3ect num&er (or t#e most recent sna$s#ot takenP t#is (iel% is )ero i( no sna$s#ots #a,e &een taken. I( t#is %ataset re$resents a sna$s#ot, t#is (iel% contains t#e !4 &it o&3ect num&er (or t#e sna$s#ot taken $rior to t#is sna$s#ot. "#is (iel% is )ero i( t#ere are no $re,ious J

J0I0 J0I1

sna$s#ots. uint+,_t ds_prev_snap_t23: "#e transaction 'rou$ num&er w#en t#e $re,ious sna$s#ot G$ointe% to &y ds8prev8snap8ob4H was taken. uint+,_t ds_ne2t_snap_ob-: "#is (iel% is only use% (or %atasets re$resentin' sna$s#ots. It contains t#e o&3ect num&er o( t#e %ataset w#ic# is t#e most recent sna$s#ot. "#is (iel% is always )ero (or %atasets re$resentin' clones, ,olumes, or (ilesystems. uint+,_t ds_snapna*es_zapob-: 5&3ect num&er o( a LA+ o&3ect Gsee C#a$ter 5H containin' name ,alue $airs (or eac# sna$s#ot o( t#is %ataset. 0ac# $air contains t#e name o( t#e sna$s#ot an% t#e o&3ect num&er associate% wit# itAs .S7 %ataset structure. uint+,_t ds_nu*_c'ildren: Always )ero i( not a sna$s#ot. 4or sna$s#ots, t#is is t#e num&er o( re(erences to t#is sna$s#ot< 1 G(rom t#e neCt sna$s#ot taken, or (rom t#e acti,e %ataset i( no sna$s#ots #a,e &een takenH V t#e num&er o( clones ori'inatin' (rom t#is sna$s#ot. uint+,_t ds_creation_ti*e: Secon%s since /anuary 1st 19I0 G1M"H w#en t#is %ataset was create%. uint+,_t ds_creation_t23: "#e transaction 'rou$ num&er in w#ic# t#is %ataset was create%. uint+,_t ds_deadlist_ob-: "#e o&3ect num&er o( t#e %ea%list Gan array o( &lk$trAs %elete% since t#e last sna$s#otH. uint+,_t ds_used_bytes: uni@ue &ytes use% &y t#e o&3ect set re$resente% &y t#is %ataset uint+,_t ds_co*pressed_bytes: num&er o( com$resse% &ytes in t#e o&3ect set re$resente% &y t#is %ataset uint+,_t ds_unco*pressed_bytes: num&er o( uncom$resse% &ytes in t#e o&3ect set re$resente% &y t#is %ataset uint+,_t ds_uni4ue_bytes: 9#en a sna$s#ot is taken, its initial contents are i%entical to t#at o( t#e acti,e co$y o( t#e %ata. As t#e %ata c#an'es in t#e acti,e co$y, more an% more %ata &ecomes uni@ue to t#e sna$s#ot Gt#e %ata %i,er'es (rom t#e sna$s#otH. As t#at #a$$ens, t#e amount o( %ata uni@ue to t#e sna$s#ot increases. "#e amount o( uni@ue sna$s#ot %ata is store% in t#is (iel%< it is )ero (or clones, ,olumes, an% (ilesystems. uint+,_t ds_fsid_3uid: !4 &it I. t#at is 'uarantee% to &e uni@ue amon'st all JJ

currently o$en %atasets. Note, t#is I. coul% c#an'e &etween successi,e %ataset o$ens. uint+,_t ds_3uid: !4 &it 'lo&al i% (or t#is %ataset. "#is ,alue ne,er c#an'es %urin' t#e li(etime o( t#e o&3ect set. uint+,_t ds_restorin3: "#e (iel% is set to M1N i( L4S is in t#e $rocess o( restorin' to t#is %ataset t#rou'# A)(s restoreA5 blkptr_t ds_bp: -lock $ointer containin' t#e location o( t#e o&3ect set t#at t#is %ataset re$resents.

Section 4.4: DSL Director- "nternals


"#e .S7 .irectory o&3ect contains a dsl8dir8p!ys8t structure in its &onus &u((er. "#e contents o( t#is structure are %escri&e% in %etail &elow5 uint+,_t dd_creation_ti*e: Secon%s since /anuary 1st, 19I0 G1M"H w#en t#is .S7 %irectory was create%. uint+,_t dd_'ead_dataset_ob-: !4 &it o&3ect num&er o( t#e acti,e %ataset o&3ect uint+,_t dd_parent_ob-:!4 &it o&3ect num&er o( t#e $arent .S7 %irectory uint+,_t dd_clone_parent_ob-: 4or cloned o&3ect sets, t#is (iel% contains t#e o&3ect num&er o( sna$s#ot use% to create t#is clone. uint+,_t dd_c'ild_dir_zapob-: 5&3ect num&er o( a LA+ o&3ect containin' name* ,alue $airs (or eac# c#il% o( t#is .S7 %irectory. uint+,_t dd_used_bytes: Num&er o( &ytes use% &y all %atasets wit#in t#is %irectory< inclu%es any sna$s#ot an% c#il% %ataset use% &ytes. uint+,_t dd_co*pressed_bytes: Num&er o( com$resse% &ytes (or all %atasets wit#in t#is .S7 %irectory. uint+,_t dd_unco*pressed_bytes: Num&er o( uncom$resse% &ytes (or all %atasets wit#in t#is .S7 %irectory. uint+,_t dd_4uota: .esi'nate% @uota, i( any, w#ic# can not &e eCcee%e% &y t#e %atasets wit#in t#is .S7 %irectory. uint+,_t dd_reserved: "#e amount o( s$ace reser,e% (or consum$tion &y t#e %atasets wit#in t#is .S7 %irectory.
5 See t#e L4S A%min 1ui%e (or in(ormation a&out t#e )(s comman%.

J4

uint+,_t dd_props_zapob-: !4 &it o&3ect num&er o( a LA+ o&3ect containin' t#e $ro$erties (or all %atasets wit#in t#is .S7 %irectory. 5nly t#e non*in#erite% = locally set ,alues are re$resente% in t#is LA+ o&3ect. .e(ault, in#erite% ,alues are in(erre% w#en t#ere is an a&sence o( an entry. "#e (ollowin' ta&le s#ows ,ali% $ro$erty ,alues. Property Description Values
aclinherit 'ontrols inheritance behavior discard O ? for datasets" noallow O 1 passthro#gh O 4 sec#re O 5 (defa#lt) aclmode 'ontrols chmod and file@dir discard O ? creation behavior for datasets" gro#pmas O 2 (defa#lt) passthro#gh O 4 atime 'ontrols whether atime is #pdated on obGects within a dataset " 'hec s#m algorithm for all datasets within this .S/ .irectory" off O ? on O 1 (defa#lt) on O 1 (defa#lt) off O ?

chec s#m

compression

'ompression algorithm for all on O 1 datasets within this .S/ off O ? (defa#lt) .irectory" 'ontrols whether device nodes can be opened on datasets" 'ontrols whether files can be e1ec#ted on a dataset" %o#ntpoint path for datasets within this .S/ .irectory" devices O ? nodevices O 1 (defa#lt) e1ec O 1 (defa#lt) noe1ec O ? string

devices

e1ec mo#ntpoint &#ota

/imits the amo#nt of space all &#ota si!e in bytes or datasets within a .S/ !ero for no &#ota (defa#lt) directory can cons#me" 'ontrols whether obGects can be modified on a dataset" :loc Si!e for all obGects within the datasets contained in this .S/ .irectory readonly O 1 readwrite O ? (defa#lt) recordsi!e in bytes

readonly recordsi!e

reservation

-mo#nt of space reserved for reservation si!e in bytes this .S/ .irectory, incl#ding all child datasets and child .S/ .irectories"

J5

Property
set#id sharenfs

Description

Values

'ontrols whether the set-0$. set#id O 1 (defa#lt) bit is respected on a dataset" noset#id O ? 'ontrols whether the datasets string S any valid nfs share in a .S/ .irectory are shared options by DFS" 'ontrols whether "!fs is hidden or visible in the root filesystem" hidden O ? visible O 1 (defa#lt)

snapdir

volbloc si!e

For vol#mes, specifies the between 312 to 128=, powers bloc si!e of the vol#me" *he of two" blocksize cannot be .efa#lts to 8= changed once the vol#me has been written, so it sho#ld be set at vol#me creation time" 2ol#me si!e, only applicable to vol#mes" vol#me si!e in bytes

volsi!e !oned

'ontrols whether a dataset is on O 1 managed thro#gh a local !one" off O ? (defa#lt)

&able 12 2ditable ;roperty Values stored in t!e dd8props8zabob4

J!

Chapter 2ive 34,


*he Z-, (ZFS -ttrib#te ,rocessor) is a mod#le which sits on top of the .%0 and operates on obGects called Z-, obGects" - Z-, obGect is a .%0 obGect #sed to store attrib#tes in the form of name-val#e pairs" *he name portion of the attrib#te is a !ero-terminated string of #p to 236 bytes (incl#ding terminating D0//)" *he val#e portion of the attrib#te is an array of integers whose si!e is only limited by the si!e of a Z-, data bloc " Z-, obGects are #sed to store properties for a dataset, navigate filesystem obGects, store pool properties and more" *he following table contains a list of Z-, obGect types"
,A- *b+ect (&pe .%0A(*A(:L<'*A.$B<'*(BE .%0A(*A.S/A.$BA'K$/.A%-, .%0A(*A.S/A.SASD-,A%-, .%0A(*A.S/A,B(,S .%0A(*A.$B<'*(BEA'(D*<D*S .%0A(*A%-S*<BAD(.< .%0A(*A.</<*<AT0<0< .%0A(*AZ2(/A,B(, &able 13 <*; 3b4ect &ypes

Z-, obGects come in two forms; micro!ap obGects and fat!ap obGects" %icro!ap obGects are a lightweight version of the fat!ap and provide a simple and fast loo #p mechanism for a small n#mber of attrib#te entries" *he fat!ap is better s#ited for Z-, obGects containing large n#mbers of attrib#tes" *he following g#idelines are #sed by ZFS to decide whether or not to #se a fat!ap or a micro!ap obGect" - micro!ap obGect is #sed if all three conditions below are met+ all name-val#e pair entries fit into one bloc " *he ma1im#m data bloc si!e in ZFS is 128=: and this si!e bloc can fit #p to 2?57 micro!ap entries" *he val#e portion of all attrib#tes are of type #int65At" *he name portion of each attrib#te is less than or e&#al to 3? characters in length (incl#ding D0// terminating character)" $f any of the above conditions are not met, a fat!ap obGect is #sed" *he first 65 bit word in each bloc of a Z-, obGect is #sed to identify the type of Z-, contents contained within this bloc " *he table below shows these val#es"

JI

.denti#ier Z:*A%$'B( Z:*AK<-.<B

Description *his bloc contains micro!ap entries *his bloc is #sed for the fat!ap" *his identifier is only #sed for the first bloc in a fat!ap obGect" *his bloc is #sed for the fat!ap" *his identifier is #sed for all bloc s in the fat!ap with the e1ception of the first"

Value (10// NN 64) P 4 (10// NN 64) P 1

Z:*A/<-F

(10// NN 64) P ?

&able 1$ <*; 3b4ect 0lock &ypes

Section &.1: The 6icro 7a'


*he micro!ap implements a simple mechanism for storing a small n#mber of attrib#tes" micro!ap obGect consists of a single bloc containing an array of micro!ap entries (m'ap%ent%phys%t str#ct#res)" <ach attrib#te stored in a micro!ap obGect is represented by one of these micro!ap entry str#ct#res" - micro!ap bloc is laid o#t as follows+ the first 128 bytes of the bloc contain a micro!ap header str#ct#re called the m!apAphysAt" *his str#ct#re contains a 65 bit Z:*A%$'B( val#e indicating that this bloc is #sed to store micro!ap entries" Following this val#e is a 65 bit salt val#e that is stirred into the hash so that the hash f#nction is different for each Z-, obGect" *he ne1t 52 bytes of this header is intentionally left blan and the last 65 bytes contain the first micro!ap entry (a str#ct#re of type m!apAentAphysAt)" *he remaining bytes in this bloc are #sed to store an array of m!apAentAphysAt str#ct#res" *he ill#stration below shows the layo#t of this bloc "
(irst 1 K &ytes

micro)a$ &lock

$a%%in'

...
m)a$>ent>$#ys>t array

Illustration 1( 7icrozap block layout

*he m!apAentAphysAt str#ct#re and associated Vdefines are shown below"


Vdefine %Z-,A<D*A/<D Vdefine %Z-,AD-%<A/<D 65 (%Z-,A<D*A/<D - 8 S 5 - 2)

typedef str#ct m!apAentAphys Q #int65At m!eAval#e; #int42At m!eAcd; #in16At m!eApad; char m!eAnameH%Z-,AD-%<A/<DI; R m!apAentAphysAt;

salt

JK

m)e%#al"e: 65 bit integer m)e%cd: 42 bit collision differentiator (8'.9)+ associated with an entry whose hash val#e is the same as another entry within this Z-, obGect" )hen an entry is inserted into the Z-, obGect, the lowest '. which is not already #sed by an entry with the same hash val#e is assigned" $n the absence of hash collisions, the '. val#e will be !ero" m)e%pad: reserved for f#t#re #se m)e%name: D0// terminated string less than or e&#al to 3? characters in length

Section &.2: The 0at 7a'


*he fat!ap implements a fle1ible architect#re for storing large n#mbers of attrib#tes, and@or attrib#tes with long names or comple1 val#es (not #int65At)" *his section begins with an e1planation of the basic str#ct#re of a fat!ap obGect and is followed by a detailed e1planation of each component of a fat!ap obGect" -ll entries in a fat!ap obGect are arranged based on a 65 bit hash of the attrib#te>s name" *he hash is #sed to inde1 into a pointer table (as can be seen on the left side of the ill#stration below)" *he n#mber of bits #sed to inde1 into this table (sometimes called the pre&iA< is dependent on the n#mber of entries in the table" *he n#mber of entries in the table can change over time" -s policy stands today, the pointer table will grow if the n#mber of entries hashing to a partic#lar b#c et e1ceeds the capacity of one leaf bloc (e1plained in detail below)" *he pointer table entries reference a chain of fat!ap bloc s called leaf bloc s, represented by the !apAleafAphys str#ct#re" <ach leaf bloc is bro en #p into some n#mber of ch#n s (!apAleafAch#n s) and each attrib#te is stored in one or more of these leaf ch#n s" *he ill#stration below shows the basic fat!ap str#ct#res, each component is e1plained in detail in the following sections"
)a$>$#ys>t
4irst -lock in LA+ 5&3ect

$ointer ta&le

)a$>lea(>$#ys>t
)a$ lea( c#unks

)a$>lea(>$#ys>t
)a$ lea( c#unks

)a$>lea(>$#ys>t
)a$ lea( c#unks

Illustration 1- fatzap structure overvie%

ection 5"#"!$ 6ap7ph&s7t


*he first bloc of a fat!ap obGect contains a 128=: !apAphysAt str#ct#re" .epending on the J9

si!e of the pointer table, this str#ct#re may contain the pointer table" $f the pointer table is too large to fit in the space provided by the !apAphysAt, some information abo#t where it can be fo#nd is store in the !apAtableAphys portion of this str#ct#re" *he definitions of the !apAphysAt contents are as follows+
)a$>$#ys>t uint!4>t )a$>&lock>ty$e uint!4>t )a$>ma'ic struct )a$>ta&le>$#ys Q uint!4>t )t>&lk uint!4>t )t>num&lks uint!4>t )t>s#i(t uint!4>t )t>neCt&lk uint!4>t )t>&lk>co$ie% T )a$>$trt&lP uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t )a$>(ree&lk )a$>num>lea(s )a$>num>entries )a$>salt )a$>$a%RK1K1S )a$>lea(sRK19 S

Illustration 1. zap8p!ys8t structure

)ap%block%t*pe: 65 bit integer identifying type of Z-, bloc " -lways set to Z:*AK<-.<B (see *able 15) for the first bloc in the fat!ap" )ap%magic: 65 bit integer containing the Z-, magic n#mber+ !A#85#2E#2E :'&s='ap='ap< )ap%table%p$*s: str#ct#re whose contents are #sed to describe the pointer table )t%blk: :l id for the first bloc of the pointer table" *his field is only #sed when the pointer table is e1ternal to the !apAphysAt str#ct#re; !ero otherwise" )t%n"mblks: D#mber of bloc s #sed to hold the pointer table" *his field is only #sed when the pointer table is e1ternal to the !apAphysAt str#ct#re; !ero otherwise" )t%s$ift: D#mber of bits #sed from the hash val#e to inde1 into the pointer table" $f the pointer table is contained within the !apAphys, this val#e will be 14" "int./%t )t%ne'tblk: "int./%t )t%blks%copied: 40

*he above two fields are #sed when the pointer table changes si!es" )ap%freeblk: 65 bit integer containing the first available Z-, bloc that can be #sed to allocate a new !apAleaf" )ap%n"m%leafs: D#mber of !apAleafAphysAt str#ct#res (described below) contained within this Z-, obGect" )ap%salt: *he salt val#e is a 65 bit integer that is stirred into the hash f#nction, so that the hash f#nction is different for each Z-, obGect" )ap%n"m%entries: D#mber of attrib#tes stored in this Z-, obGect" )ap%leafs01(234+ *he !apAleaf array contains 214 (81J2) slots" $f the pointer table has fewer than 214 entries, the pointer table will be stored here" $f not, this field is #n#sed"

ection 5"#"#$ ,ointer Table


*he pointer table is a hash table which #ses a chaining method to handle collisions" <ach hash b#c et contains a 65 bit integer which describes the level !ero bloc id (see 'hapter 4 for a description of bloc ids) of the first element in the chain of entries hashed here" -n entries hash b#c et is determined by #sing the first few bits of the 65 bit Z-, entry hash comp#ted from the attrib#te>s name" *he val#e #sed to inde1 into the pointer table is called the pre&iA and is the 't%shi&t high order bits of the 65 bit comp#ted hash"

ection 5"#"'$ 6ap7leaf7ph&s7t


*he !apAleafAphysAt is the str#ct#re referenced by the pointer table" 'ollisions in the pointer table res#lt in !apAleafAphysAt str#ct#res being str#ng together in a lin list fashion" *he !apAleafAphysAt str#ct#re contains a header, a hash table, and some n#mber of ch#n s"
typedef str#ct !apAleafAphys Q str#ct !apAleafAheader Q #int65At lhrAbloc Atype; #int65At lhrAne1t; #int65At lhrAprefi1; #int42At lhrAmagic; #int16At lhrAnfree; #int16At lhrAnentries; #int16At lhrAprefi1Alen; #int16At lhAfreelist; #int8At lhApad2H12I; R lAhdr; @W 2 25-byte ch#n s W@

41

#int16At lAhashHZ-,A/<-FAK-SKAD0%<D*B$<SI; #nion !apAleafAch#n Q str#ct !apAleafAentry Q #int8At leAtype; #int8At leAintAsi!e; #int16At leAne1t; #int16At leAnameAch#n ; #int16At leAnameAlength; #int16At leAval#eAch#n ; #int16At leAval#eAlength; #int16At leAcd; #int8At leApadH2I; #int65At leAhash; R lAentry; str#ct !apAleafAarray Q #int8At laAtype; #int8At laAarrayHZ-,A/<-FA-BB-EA:E*<SI; #int16At laAne1t; R lAarray; str#ct !apAleafAfree Q #int8At lfAtype; #int8At lfApadHZ-,A/<-FA-BB-EA:E*<SI; #int16At lfAne1t; R lAfree; R lAch#n HZ-,A/<-FAD0%'K0D=SI; R !apAleafAphysAt;

5eader *he header for the Z-, leaf is stored in a !apAleafAheader str#ct#re" $t>s description is as follows+ l$r%block%t*pe: always Z:*A/<-F (see *able 15 for val#es) l$r%ne't: 65 bit integer bloc id for the ne1t leaf in a bloc chain" l$r%prefi' and l$r%prefi'%len: <ach leaf (or chain of leafs) stores the Z-, entries whose first lhrAprefi1len bits of their hash val#e e&#als lhrAprefi1" lhrAprefi1len can be e&#al to or less than !tAshift (the n#mber of bits #sed to inde1 into the pointer table) in which case m#ltiple pointer table b#c ets reference the same leaf" l$r%magic: leaf magic n#mber OO 0x2A'1 A/ (!ap-leaf) l$r%nfree: n#mber of free ch#n s in this leaf (ch#n s described below) l$r%nentries: n#mber of Z-, entries stored in this leaf l$r%freelist: head of a list of free ch#n s, 16 bit integer #sed to inde1 into the !apAleafAch#n array

Leaf 5as$ *he ne1t 8=: of the !apAleafAphysAt is the !ap leaf hash table" *he entries in the has table reference ch#n s of type !apAleafAentry" *welve bits (the twelve following the lhrAprefi1Alen #sed to #ni&#ely identify this bloc ) of the attrib#te>s hash val#e are #sed to inde1 into the this table" Kash table collisions are handled by chaining entries" <ach b#c et in the table contains a 16 bit integer which is the inde1 into the !apAleafAch#n array"

ection 5"#"- $ 6ap7leaf7chunk


<ach leaf contains an array of ch#n s" *here are three types of ch#n s+ !apAleafAentry, !apAleafAarray, and !apAleafAfree" <ach attrib#te is represented by some n#mber of these ch#n s+ one !apAleafAentry and some n#mber of !apAleafAarray ch#n s" *he ill#stration below shows how these ch#n s are arranged" - detailed description of each ch#n type follows the ill#stration"
one )a$ entry

)a$>lea(>entry
uintK>t uintK>t uint1!>t uint1!>t uint1!>t uint1!>t uint1!>t uintJ >t uint!4>t le>ty$e U 676 le>int>si)e le>neCt le>name>c#unk le>name>len't# le>,alue>c#unk le>,alue>len't# le>c% le>#as#

)a$>lea(>entry
uintK>t uintK>t uint1!>t uint1!>t uint1!>t uint1!>t uint1!>t uintJ >t uint!4>t le>ty$e U 676 le>int>si)e le>neCt le>name>c#unk le>name>len't# le>,alue>c#unk le>,alue>len't# le>c% le>#as#

)a$>lea(>entry
uintK>t uintK>t uint1!>t uint1!>t uint1!>t uint1!>t uint1!>t uintJ >t uint!4>t le>ty$e U 676 le>int>si)e le>neCtU82ffff le>name>c#unk le>name>len't# le>,alue>c#unk le>,alue>len't# le>c% le>#as#

)a$>lea(>array
uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCt

)a$>lea(>array
uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCtU82ffff

)a$>lea(>array
uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCt

)a$>lea(>array
uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCt

)a$>lea(>array
uintK>t uintK>t uint1!>t la>ty$e U 67% la>arrayR 1S la>neCtU82ffff

Illustration 1/ zap leaf structure

)ap%leaf%entr*+ *he leaf hash table (described above) points to ch#c s of this type" *his entry contains pointers to ch#n s of type !apAleafAarray which hold the name and val#e for the attrib#tes being stored here" le!t&pe: Z-,A/<-FA<D*BE OO 232 le!int!si$e: Si!e of integers in bytes for this entry" le!next% De1t entry in the !apAleafAch#n chain" 'hains occ#r when there are collisions in the hash table" *he end of the chain is designated by a leAne1t val#e of ?1ffff" le!name!chun": 16 bit integer identifying the ch#n of type 4J

!apAleafAarray which contains the first 21 characters of this attrib#te>s name" le!name!length% *he length of the attrib#te>s name, incl#ding the D0// character" le!value!chun"%16 bit integer identifying the first ch#n (type !apAleafAarray) containing the first 21 bytes of the attrib#te>s val#e" le!value!length% *he length, in integer increments (le%int%si'e) le!cd% *he collision differentiator (8'.9) is a val#e associated with an entry whose hash val#e is the same as another entry within this Z-, obGect" )hen an entry is inserted into the Z-, obGect, the lowest '. which is not already #sed by an entry with the same hash val#e is assigned" $n the absence of hash collisions, the '. val#e will be !ero" le!hash% 65 bit hash of this attrib#te>s name" )ap%leaf%arra*+ 'h#n s of the !apAleafAarray hold either the name or the val#e of the Z-, attrib#te" *hese ch#n s can be str#ng together to provide for long names or large val#es" !apAleafAarray ch#n s are pointed to by a !apAleafAentry ch#n " la!t&pe% Z-,A/<-FA-BB-E OO 231 la!arra&% 21 byte array containing the name or val#e>s val#e" 2al#es of type 8integer9 are always stored in big endian format, regardless of the machine>s native endianness" la!next% 16 bit integer #sed to inde1 into the !apAleafAch#n array and references the ne1t !apAleafAarray ch#n for this attrib#te; a val#e of ?1ffff ('K-$DA<D.) is #sed to designate the end of the chain )ap%leaf%free: 0n#sed ch#n s are ept in a chained free list" *he root of the free list is stored in the leaf header" l#!t&pe% Z-,A/<-FAFB<< OO 234 l#!next% 16 bit integer pointing to the ne1t free ch#n "

44

Chapter i8 3,L
*he Z,/, ZFS ,(S$C /ayer, ma es .%0 obGects loo li e a ,(S$C filesystem" ,(S$C is a standard defining the set of services a filesystem m#st provide" ZFS filesystems provide all of these re&#ired services" *he Z,/ represents filesystems as an obGect set of type .%0A(S*AZFS" -ll snapshots, clones and filesystems are implemented as an obGect set of this type" *he Z,/ #ses a well defined format for organi!ing obGects in its obGect set" *he section below describes this layo#t"

Section (.1: 71L 0iles-ste% La-out


- Z,/ obGect set has one obGect with a fi1ed location and fi1ed obGect n#mber" *his obGect is called the 8master node9 and always has an obGect n#mber of 1" *he master node is a Z-, obGect containing three attrib#tes+ .</<*<AT0<0<, 2<BS$(D, and B((*" Dame+ .</<*<AT0<0< 2al#e+ 65 bit obGect n#mber for the delete &#e#e obGect .escription+ *he delete &#e#e provides a list of deletes that were in-progress when the filesystem was force #nmo#nted or as a res#lt of a system fail#re s#ch as a power o#tage" 0pon the ne1t mo#nt of the filesystem, the delete &#e#e is processed to remove the files@dirs that are in the delete &#e#e" *his mechanism is #sed to avoid lea ing files and directories in the filesystem" Dame+ 2<BS$(D 2al#e+ '#rrently a val#e of 819" .escription+ Z,/ version #sed to lay o#t this filesystem" Dame+ B((* 2al#e+ 65 bit obGect n#mber .escription+ *his attrib#te>s val#e contains the obGect n#mber for the top level directory in this filesystem, the root directory"

Section (.2: Directories and Director- Traversal


Filesystem directories are implemented as Z-, obGects (obGect type .%0A(*A.$B<'*(BE)" <ach directory holds a set of name-val#e pairs which contain the names and obGect n#mbers for each directory entry" *raversing thro#gh a directory tree is as simple as loo ing #p the val#e for an entry and reading that obGect n#mber" -ll filesystem obGects contain a !nodeAphysAt str#ct#re in the bon#s b#ffer of it>s dnode" *his str#ct#re stores the attrib#tes for the filesystem obGect" *he !nodeAphysAt str#ct#re is shown below"

45

typedef str#ct !nodeAphys Q #int65At !pAatimeH2I; #int65At !pAmtimeH2I; #int65At !pActimeH2I; #int65At !pAcrtimeH2I; #int65At !pAgen; #int65At !pAmode; #int65At !pAsi!e; #int65At !pAparent; #int65At !pAlin s; #int65At !pA1attr; #int65At !pArdev; #int65At !pAflags; #int65At !pA#id; #int65At !pAgid; #int65At !pApadH5I; !fsA!nodeAaclAt !pAacl; R !nodeAphysAt

)p%atime: *wo 65 bit integers containing the last file access time in seconds (!pAatimeH?I) and nanoseconds (!pAatimeH1I) since Lan#ary 1st 1J7? (F%*)" )p%mtime: *wo 65 bit integers containing the last file modification time in seconds (!pAmtimeH?I) and nanoseconds (!pAmtimeH?I) since Lan#ary 1st 1J7? (F%*)" )p%ctime: *wo 65 bit integers containing the last file change time in seconds (!pActimeH?I) and nanoseconds (!pActimeH1I) since Lan#ary 1st 1J7? (F%*)" )p%crtime: *wo 65 bit integers containing the file>s creation time in seconds (!pAcrtimeH?I) and nanoseconds (!pAcrtimeH1I) since Lan#ary 1st 1J7? (F%*)" )p%gen: 65 bit generation n#mber, contains the transaction gro#p n#mber of the creation of this file" )p%mode: 65 bit integer containing file mode bits and file type" *he lower 8 bits of the mode contain the access mode bits, for e1ample 733" *he Jth bit is the stic y bit and can be a val#e of !ero or one" :its 14-16 are #sed to designate the file type" *he file types can be seen in the table below"

4!

(&pe SA$F$F( SA$F'KB SA$F.$B SA$F:/= SA$FB<F SA$F/D= SA$FS('= SA$F.((B SA$F,(B* Fifo

Description 'haracter Special .evice .irectory :loc special device Beg#lar file Symbolic /in Soc et .oor <vent ,ort

Value in bits 10112 ?11 ?12 ?15 ?16 ?18 ?1?1' ?1. ?1<

&able 1( =ile &ypes and t!eir associated mode bits

)p%si)e: si!e of file in bytes )p%parent: obGect id of the parent directory containing this file )p%links: n#mber of hard lin s to this file )p%'attr: o&3ect I. o( a LA+ o&3ect w#ic# is t#e #i%%en attri&ute %irectory. It is treate% like a normal %irectory in L4S, eCce$t t#at its #i%%en an% an a$$lication will nee% to WtunnelW into t#e (ile ,ia o$enatGH to 'et to it. )p%rde#: devAt for files of type SA$F'KB or SA$F:/= )p%flags: ,ersistent flags set on the file" *he following are valid flag val#es"
/lag ZFSAC-**B ZFSA$DK<B$*A-'< &able 1- zp8flag values Value ?11 ?12

)p%"id: !4 &it inte'er Gui%>tH o( t#e (iles owner. )p%gid: 65 bit integer (gidAt) owning gro#p of the file" )p%acl: !fsA!nodeAacl str#ct#re containing any -'/ entries set on this file" *he !fsA!nodeAacl str#ct#re is defined below"

Section (.3: 70S Access $ontrol Lists


-ccess control lists (-'/) serve as a mechanism to allow or restrict #ser access privileges on a ZFS obGect" -'/s are implemented in ZFS as a table containing -'<s (-ccess 'ontrol <ntries)" *he !nodeAphysAt contains a !fsA!nodeAacl str#ct#re" *his str#ct#re is shown below"
Vdefine -'<AS/(*A'D* typedef str#ct !fsA!nodeAacl Q #int65At !AaclAe1ternAobG; 6

4I

#int42At !AaclAco#nt; #int16At !AaclAversion; #int16At !AaclApad; aceAt !AaceAdataH-'<AS/(*A'D*I; R !fsA!nodeAaclAt;

)%acl%e'tern%ob-: 0sed for holding -'/s that won>t fit in the !node" $n other words, its for -'/s great than 6 -'<s" *he obGect type of an e1tern -'/ is .%0A(*A-'/" )%acl%co"nt: n#mber of -'< entries that ma e #p an -'/ )%acl%#ersion: reserved for f#t#re #se" )%acl%pad: reserved for f#t#re #se" )%ace%data: -rray of #p to 6 -'<s" -n -'< specifies an access right to an individ#al #ser or gro#p for a specific obGect"
typedef str#ct ace Q #idAt aAwho; #int42At aAaccessAmas ; #int16At aAflags; #int16At aAtype; R aceAt;

a%w$o: *his field is only meaningf#l when the AC0>59N02, AC0>125U+ an% AC0>0602:5N0 (la's Gset in a8flags %escri&e% &elowH are not asserte%. *he aAwho field contains a 0$. or F$." $f the -'<A$.<D*$F$<BAFB(0, flag is set in a%&lags (see below), the aAwho field will contain a F$." (therwise, this field will contain a 0$." a%access%mask: 42 bit access mas " *he table below shows the access attrib#te associated with each bit"

4K

Attribute -'<AB<-.A.-*-'<A/$S*A.$B<'*(BE -'<A)B$*<A.-*-'<A-..AF$/< -'<A-,,<D.A.-*-'<A-..AS0:.$B<'*(BE -'<AB<-.AD-%<.A-**BS -'<A)B$*<AD-%<.A-**BS -'<A<C<'0*< -'<A.</<*<A'K$/. -'<AB<-.A-**B$:0*<S -'<A)B$*<A-**B$:0*<S -'<A.</<*< -'<AB<-.A-'/ -'<A)B$*<A-'/ -'<A)B$*<A()D<B -'<ASED'KB(D$Z< &able 1. *ccess 7ask Values

Value ?1???????1 ?1???????1 ?1???????2 ?1???????2 ?1???????5 ?1???????5 ?1???????8 ?1??????1? ?1??????2? ?1??????5? ?1??????8? ?1?????1?? ?1???1???? ?1???2???? ?1???5???? ?1???8???? ?1??1?????

a%flags: 16 bit integer whose val#e describes the -'/ entry type and inheritance flags"
A3 #lag -'<AF$/<A$DK<B$*A-'< -'<A.$B<'*(BEA$DK<B$*A-'< -'<AD(A,B(,-F-*<A$DK<B$*A-'< -'<A$DK<B$*A(D/EA-'< -'<AS0''<SSF0/A-''<SSA-'<AF/-F -'<AF-$/<.A-''<SSA-'<AF/-F -'<A$.<D*$F$<BAFB(0, -'<A()D<B -'<AFB(0, -'<A<2<BE(D< &able 1/ 2ntry &ype and In!eritance =lag Value Value ?1???1 ?1???2 ?1???5 ?1???8 ?1??1? ?1??2? ?1??5? ?11??? ?12??? ?15???

a%t*pe: *he type of this ace" *he following types are listed in the table below"

49

(&pe -'<A-''<SSA-//()<.A-'<A*E,< -'<A-''<SSA.<D$<.A-'<A*E,< -'<ASES*<%A-0.$*A-'<A*E,<

Value ?1???? ?1???1 ?1???2

Description Frants access as described in aAaccessAmas " .enies access as described in aAaccessAmas " -#dit the s#ccessf#l or failed accesses (depending on the presence of the s#ccessf#l@failed access flags) as defined in the aAaccessAmas " 6 -larm the s#ccessf#l of failed accesses as defined in the aAaccessAmas "7

-'<ASES*<%A-/-B%A-'<A*E,<

?1???4

&able 15 *C2 &ypes and Values

! "#e action taken as an e((ect o( tri''erin' an au%it is currently un%e(ine% in Solaris. I "#e action taken as an e((ect o( tri''erin' an alarm is currently un%e(ine% in Solaris.

50

Chapter even 32 Intent Lo1


*he ZFS intent log (Z$/) saves transaction records of system calls that change the file system in memory with eno#gh information to be able to replay them" *hese are stored in memory #ntil either the .%0 transaction gro#p (t1g) commits them to the stable pool and they can be discarded, or they are fl#shed to the stable log (also in the pool) d#e to a fsync, (A.SED' or other synchrono#s re&#irement" $n the event of a panic or power fail#re, the log records (transactions) are replayed" *here is one Z$/ per file system" $ts on-dis (pool) format consists of 4 parts+ - Z$/ header - Z$/ bloc s - Z$/ records - log record holds a system call transaction" /og bloc s can hold many log records and the bloc s are chained together" <ach Z$/ bloc contains a bloc pointer in the trailer(bl ptrAt) to the ne1t Z$/ bloc in the chain" /og bloc s can be different si!es" *he Z$/ header points to the first bloc in the chain" Dote there is not a fi1ed place in the pool to hold bloc s" *hey are dynamically allocated and freed as needed from the bloc s available" *he ill#stration below shows the Z$/ str#ct#re showing log bloc s and log records of different si!es+
7o' -lock ;ea%er 7o' 2ecor% 7o' 2ecor% "railer "railer 7o' -lock 7o' 2ecor%

...

Illustration 15 3vervie% of <IL 'tructure

%ore details of the c#rrent Z$/ on dis str#ct#res are given below"

Section *.1: 7"L header


*here is one of these per Z$/ and it has a simple str#ct#re+
ty$e%e( struct )il>#ea%er Q uint!4>t )#>claim>tC'P =X tC' in w#ic# lo' &locks were claime% X= uint!4>t )#>re$lay>se@P =X #i'#est re$laye% se@uence num&er X= &lk$tr>t )#>lo'P =X lo' c#ain X= T )il>#ea%er>tP

51

Section *.2: 7"L blocks


Z$/ bloc s contain Z$/ records" *he bloc s are allocated on demand and are of a variable si!e according to need" *he si!e field is part of the bl ptrAt which points to a log bloc " <ach bloc is filled with records and contains a !ilAtrailerAt at the end of the bloc + Z+L Trailer
ty$e%e( struct )il>trailer Q &lk$tr>t )it>neCt>&lkP =X neCt &lock in c#ain X= uint!4>t )it>nuse%P =X &ytes in lo' &lock use% X= )io>&lock>tail>t )it>&tP =X &lock trailer X= T )il>trailer>tP

Z+L records Z+L record common str"ct"re Z$/ records all start with a common section followed by a record (transaction) specific str#ct#re" *he common log record str#ct#re and record types (val#es for lrcAt1type) are+
ty$e%e( struct Q uint!4>t lrc>tCty$eP uint!4>t lrc>reclenP uint!4>t lrc>tC'P uint!4>t lrc>se@P T lr>tP Y%e(ine "8>C20A"0 Y%e(ine "8>MO.I2 Y%e(ine "8>MO8A""2 Y%e(ine "8>S:M7INO Y%e(ine "8>20M560 Y%e(ine "8>2M.I2 Y%e(ine "8>7INO Y%e(ine "8>20NAM0 Y%e(ine "8>92I"0 Y%e(ine "8>"2UNCA"0 Y%e(ine "8>S0"A""2 Y%e(ine "8>AC7 1 J 4 5 ! I K 9 10 11 1 =X common lo' recor% #ea%er X= =X intent lo' transaction ty$e X= =X transaction recor% len't# X= =X %mu transaction 'rou$ num&er X= =X intent lo' se@uence num&er X= =X Create (ile X= =X Make %irectory X= =X Make 8A""2 %irectory X= =X Create sym&olic link to a (ile X= =X 2emo,e (ile X= =X 2emo,e %irectory X= =X Create #ar% link to a (ile X= =X 2ename a (ile X= =X 4ile write X= =X "runcate a (ile X= =X Set (ile attri&utes X= =X Set acl X=

Z+L record specific str"ct"res For each of the record (transaction) types listed above there is a specific str#ct#re which embeds the common str#ct#re" )ithin each record eno#gh information is saved in order to be able to replay the transaction (#s#ally one 2(, call)" *he 2(, layer will pass in-memory pointers to vnodes" *hese have to be converted to stable pool obGect identifiers (oids)" )hen replaying the transaction the 2(, layer is called again" *o do this we reopen the obGect and pass it>s vnode" Some of the record specific str#ct#res are #sed for more than one transaction type" *he lrAcreateAt record specific str#ct#re is #sed for+ *CA'B<-*<, *CA%=.$B, *CA%=C-**B and *CASE%/$D=, and lrAremoveAt is #sed for both 5

*CAB<%(2< and *CAB%.$B" -ll fields (other than strings and #ser data) are 65 bits wide" *his provides for a well defined alignment which allows for easy compatibility between different architect#res, and easy endianness conversion if necessary" Kere>s the definition of the record specific str#ct#res+
ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>%oi%P =X o&3ect i% o( %irectory X= uint!4>t lr>(oi%P =X o&3ect i% o( create% (ile o&3ect X= uint!4>t lr>mo%eP =X mo%e o( o&3ect X= uint!4>t lr>ui%P =X ui% o( o&3ect X= uint!4>t lr>'i%P =X 'i% o( o&3ect X= uint!4>t lr>'enP =X 'eneration GtC' o( creationH X= uint!4>t lr>crtimeR SP =X creation time X= uint!4>t lr>r%e,P =X r%e, o( o&3ect to create X= =X name o( o&3ect to create (ollows t#is X= =X (or symlinks, link content (ollows name X= T lr>create>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>%oi%P =X o&3 i% o( %irectory X= =X name o( o&3ect to remo,e (ollows t#is X= T lr>remo,e>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>%oi%P =X o&3 i% o( %irectory X= uint!4>t lr>link>o&3P =X o&3 i% o( link X= =X name o( o&3ect to link (ollows t#is X= T lr>link>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>s%oi%P =X o&3 i% o( source %irectory X= uint!4>t lr>t%oi%P =X o&3 i% o( tar'et %irectory X= =X strin's< names o( source an% %estination (ollow t#is X= T lr>rename>tP ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>(oi%P =X (ile o&3ect to write X= uint!4>t lr>o((setP =X o((set to write to X= uint!4>t lr>len't#P =X user %ata len't# to write X= uint!4>t lr>&lko((P =X o((set re$resente% &y lr>&lk$tr X= &lk$tr>t lr>&lk$trP =X s$a &lock $ointer (or re$lay X= =X write %ata will (ollow (or small writes X= T lr>write>tP ty$e%e( struct Q lr>t uint!4>t uint!4>t uint!4>t lr>commonP lr>(oi%P lr>o((setP lr>len't#P =X common $ortion o( lo' recor% X= =X o&3ect i% o( (ile to truncate X= =X o((set to truncate (rom X= =X len't# to truncate X=

5J

T lr>truncate>tP ty$e%e( struct Q lr>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t uint!4>t T lr>setattr>tP lr>commonP lr>(oi%P lr>maskP lr>mo%eP lr>ui%P lr>'i%P lr>si)eP lr>atimeR SP lr>mtimeR SP =X common $ortion o( lo' recor% X= =X (ile o&3ect to c#an'e attri&utes X= =X mask o( attri&utes to set X= =X mo%e to set X= =X ui% to set X= =X 'i% to set X= =X si)e to set X= =X access time X= =X mo%i(ication time X=

ty$e%e( struct Q lr>t lr>commonP =X common $ortion o( lo' recor% X= uint!4>t lr>(oi%P =X o&3 i% o( (ile X= uint!4>t lr>aclcntP =X num&er o( acl entries X= =X lr>aclcnt num&er o( ace>t entries (ollow t#is X= T lr>acl>tP

54

Chapter 9i1ht 3VOL (32 volu*e)


Z2(/ (ZFS 2ol#mes) provides a mechanism for creating logical vol#mes" ZFS vol#mes are e1ported as bloc devices and can be #sed li e any other bloc device" Z2(/s are represented in ZFS as an obGect set of type .%0A(S*AZ2(/ (see *able 11)" - Z2(/ obGect set has a very simple format consisting of two obGects+ a properties obGect and a data obGect, obGect type .%0A(*AZ2(/A,B(, and .%0A(*AZ2(/ respectively" :oth obGects have statically assigned obGect $ds" <ach obGect is described below" FVG4 Hroperties G$Bect *ype+ .%0A(*AZ2(/A,B(, (bGect V+ 2 .escription+*he Z2(/ property obGect is a Z-, obGect containing attrib#tes associated with this vol#me" - partic#lar attrib#te of interest is the 8volsi'e@ attrib#te" *his attrib#te contains the si!e, in bytes, of the vol#me" FVG4 Data *ype+ .%0A(*AZ2(/ (bGect V+ 1 .escription+ *his obGect stores the contents of this virt#al bloc device"

55

También podría gustarte