Está en la página 1de 49

Solaris 10 Deep Dive ZFS

Bob Netherton
Technical Specialist, Solaris Adoption Sun Microsystems, Inc. http://blogs.sun.com/bobn

What is ZFS? Why a new file system? What's different about it? What can I do with it? How much does it cost? Where does ZFS go from here?
2

What is ZFS?
End-to End Data Integrity

A new way to manage data


Immense Data Software Capacity Developer

With checksumming and copy-on-write transactions


Easier Administration

The world's first 128-bit file system

Huge Performance Gains

Pooled storage model No volume manager

Especially architected for speed


3

Why a New File System?

Data Management Costs are High

The Value of Data is Becoming Even More Critical

The Amount of Storage is EverIncreasing

Trouble with Existing File Systems?


Good for the time they were designed, but...
No Defense Against Silent Data Corruption Difficult to AdministerNeed a Volume Manager Older/Slower Data Management Techniques

Any defect in datapath can corrupt data... undetected

Volumes, labels, partitions, provisioning and lots of limits

Fat locks, fixed block size, naive pre-fetch, dirty region logging

ZFS Design Principles


Start with a new design around today's requirements Pooled storage
> Eliminate the notion of volumes > Do for storage what virtual memory did for RAM

End-to-end data integrity


> Historically considered too expensive. > Now, data is too valuable not to protect

Transactional operation
> Maintain consistent on-disk format > Reorder transactions for performance gains big

performance win

Evolution of Disks and Volumes


Intially, we had simple disks Abstraction of disks into volumes to meet requirements Industry grew around HW / SW volume management
File System Volume Manager File System Volume Manager File System Volume Manager

Lower 1GB

Upper 1GB

Even 1GB

Odd 1GB

Left 1GB

Right 1GB

Concatenated 2GB

Striped 2GB

Mirrored 1GB
7

FS/Volume Model vs. ZFS


Traditional Volumes ZFS Pooled Storage
> > > >

1:1 FS to Volume Grow / shrink by hand Limited bandwidth Storage fragmented


FS

> > > >


ZFS

No partitions / volumes Grow / shrink automatically All bandwidth always available All storage in pool is shared
ZFS ZFS

Volume Manager

FS / Volume Model vs. ZFS


FS / Volume I/O Stack FS to Volume
> Block device interface > Write a block, write a block, ... > Loss of power = loss of

ZFS I/O Stack ZFS to Data Mgmt Unit


> Object-based transactions > Make these changes to these

consistency > Workaround: journaling slow & complex

objects > All or nothing


> > > >

DMU to Storage Pool


Transaction group commit All or nothing Always consistent on disk Journal not needed

Volume to Disk
> Block device interface > Write each block to each disk

immediately to sync mirrors > Loss of power = resync > Synchronous & slow

SP to Disk
> Schedule, aggregate, and issue I/O

at will runs at platter speed > No resync if power lost

DATA

INTEGRITY

10

ZFS Data Integrity Model


Everything is copy-on-write
> Never overwrite live data > On-disk state always valid no fsck

Everything is transactional
> Related changes succeed or fail as a whole > No need for journaling

Everything is checksummed
> No silent corruptions > No panics from bad metadata

Enhanced data protection


> Mirrored pools, RAID-Z, disk scrubbing
11

Copy-on-Write and Transactional


Uber-block Original Data New Data

Initial block tree


Original Pointers

Writes a copy of some changes New Uber-block

New Pointers

Copy-on-write of indirect blocks

Rewrites the Uber-block


12

End-to-End Checksums
Checksums are separated from the data

Entire I/O path is self-validating (uber-block)

Prevents: > Silent data corruption > Panics from corrupted metadata > Phantom writes > Misdirected reads and writes > DMA parity errors > Errors from driver bugs > Accidental overwrites

13

Self-Healing Data
ZFS can detect bad data using checksums and heal the data using its mirrored copy.
Application ZFS Mirror Application ZFS Mirror Application ZFS Mirror

Detects Bad Data

Gets Good Data from Mirror

Heals Bad Copy


14

Disk Scrubbing
Uses checksums to verify the integrity of all the data Traverses metadata to read every copy of every block Finds latent errors while they're still correctable It's like ECC memory scrubbing but for disks Provides fast and reliable re-silvering of mirrors

15

RAID-Z Protection
RAID-5 and More

ZFS provides better than RAID-5 availability


> Copy-on-write approach solves historical problems

Striping uses dynamic widths


> Each logical block is its own stripe

All writes are full-stripe writes


> Eliminates read-modify-write (So it's fast!)

Eliminates RAID-5 write hole


> No need for NVRAM

16

128-bit File System No Practical Limitations on File Size, Directory Entries, etc. All metadata is dynamic Concurrent Everything

Immense Data Capacity


17

EASIER

ADMINISTRATION

18

Easier Administration
Pooled Storage Design makes for Easier Administration
No need for a Volume Manager!

Straightforward Commands and a GUI > Snapshots & Clones > Quotas & Reservations > Compression > Pool Migration > ACLs for Security
19

No More Volume Manager!


Application 1 Application 2

Automatically add capacity to shared storage pool


Application 3

ZFS
ZFS

Storage Pool
20

ZFS File systems are Hierarchical


File system properties are inherited Inheritance makes administration a snap File systems become control points Manage logically related file systems as a group

21

Create ZFS Pools and File Systems


Create a ZFS pool consisting of two mirrored drives
# zpool create tank mirror c9t42d0 c13t11d0 # df -h -F zfs Filesystem size used avail capacity Mounted on tank 33G 1K 33G 1% /tank

Create home directory file system


# zfs create tank/home # zfs set mountpoint=/export/home tank/home # df -h -F zfs Filesystem size used avail capacity Mounted on tank 33G 24K 33G 1% /tank tank/home 33G 27K 33G 1% /export/home

22

Create ZFS Pools and File Systems


Create home directories for users
# zfs create tank/home/ahrens # zfs create tank/home/bonwick # zfs create tank/home/billm # df -h -F zfs Filesystem size used avail capacity Mounted on tank 33G 24K 33G 1% /tank tank/home 33G 27K 33G 1% /export/home tank/home/ahrens 33G 24K 33G 1% /export/home/ahrens tank/home/bonwick 33G 24K 33G 1% /export/home/bonwick tank/home/billm 33G 24K 33G 1% /export/home/billm

Add space to the pool

# zpool add tank mirror c9t43d0 c13t12d0 # df -h -F zfs Filesystem size used avail capacity Mounted on tank 66G 24K 66G 1% /tank tank/home 66G 27K 66G 1% /export/home

23

Quotas and Reservations


To control pooled storage usage, administrators can set a quota or reservation on a per file system basis
# df -h -F zfs Filesystem size tank/home 66G tank/home/ahrens 66G avail capacity Mounted on 66G 1% /export/home 66G 1% /export/home/ahrens tank/home/bonwick 66G 24K 66G 1% /export/home/bonwick # zfs set quota=10g tank/home/ahrens # zfs set reservation=20g tank/home/bonwick # df -h -F zfs Filesystem size used avail capacity Mounted on tank/home 66G 28K 46G 1% /export/home tank/home/ahrens 10G 24K 10G 1% /export/home/ahrens tank/home/bonwick 66G 24K 66G 1% /export/home/bonwick
24

used 28K 24K

File System Attributes


Attributes are set for the file system and inherited by child file systems in the tree
# zfs set compression=on tank # zfs set sharenfs=rw tank/home # zfs get all tank NAME PROPERTY VALUE tank type filesystem tank creation Fri Sep 1 tank used 20.0G tank available 46.4G tank compressratio 1.00x tank mounted yes tank quota none tank reservation none tank recordsize 128K tank mountpoint /tank tank sharenfs off tank compression on tank atime on ... SOURCE default default default default default local default
25

9:38 2006

ZFS Snapshots
Provide a read-only point-in-time copy of file system Copy-on-write makes them essentially free Very space efficient only changes are tracked And instantaneous just doesn't delete the copy
New Uber-block

Snapshot Uber-block

Current Data

26

ZFS Snapshots
Simple to create and rollback with snapshots
# zfs list -r tank NAME USED tank 20.0G tank/home 20.0G tank/home/ahrens 24.5K tank/home/billm 24.5K tank/home/bonwick 24.5K AVAIL 46.4G 46.4G 10.0G 46.4G 66.4G REFER 24.5K 28.5K 24.5K 24.5K 24.5K MOUNTPOINT /tank /export/home /export/home/ahrens /export/home/billm /export/home/bonwick

# zfs snapshot tank/home/billm@s1 # zfs list -r tank/home/billm NAME USED AVAIL REFER tank/home/billm 24.5K 46.4G 24.5K tank/home/billm@s1 0 - 24.5K

MOUNTPOINT /export/home/billm -

# cat /export/home/billm/.zfs/snapshot/s1/foo.c # zfs rollback tank/home/billm@s1 # zfs destroy tank/home/billm@s1

27

ZFS Clones
A clone is a writable copy of a snapshot
> Created instantly, unlimited number

Perfect for read-mostly file systems source directories, application binaries and configuration, etc.
# zfs list -r tank/home/billm NAME USED AVAIL tank/home/billm 24.5K 46.4G tank/home/billm@s1 0 REFER 24.5K 24.5K MOUNTPOINT /export/home/billm -

# zfs clone tank/home/billm@s1 tank/newbillm # zfs list -r tank/home/billm tank/newbillm NAME USED AVAIL REFER MOUNTPOINT tank/home/billm 24.5K 46.4G 24.5K /export/home/billm tank/home/billm@s1 0 - 24.5K tank/newbillm 0 46.4G 24.5K /tank/newbillm
28

ZFS Send / Receive (Backup / Restore)


Backup and restore ZFS snapshots
> Full backup of any snapshot > Incremental backup of differences between snapshots

Create full backup of a snapshot


# zfs send tank/fs@snap1 > /backup/fs-snap1.zfs

Create incremental backup


# zfs send -i tank/fs@snap1 tank/fs@snap2 > \ /backup/fs-diff1.zfs

Replicate ZFS file system remotely


# zfs send -i tank/fs@11:31 tank/fs@11:32 | \ ssh host zfs receive -d /tank/fs
29

Adaptive Endian-ness - Hosts always write in their native endian-ness Opposite Endian Systems - Write and copy operations will eventually byte
swap all data!

Config Data is Stored within the Data - When the data moves, so does its config info

Storage Pool Migration


30

ZFS Data Migration


Host-neutral format on-disk
> Move data from SPARC to x86 transparently > Data always written in native format, reads reformat data

if needed

ZFS pools may be moved from host to host


> ZFS handles device ids & paths, mount points, etc.

Export pool from original host


source# zfs export tank

Import pool on new host


destination# zfs import tank
31

Data Compression
Reduces the amount of disk space used Reduces the amount of data transferred to disk increasing data throughput
ZFS

Data Compression
32

Data Security

ACLs and Checksums

ACLs based on NFSv4 NT style


> Full allow / deny semantics with inheritance > Fine grained privilege control model (17 attributes)

The uber-block checksum can serve as a digital signature for the entire filesystem
> 256 bit, military grade checksum (SHA-256) available

Encrypted filesystem support coming soon Secure deletion (scrubbing) coming soon
33

ZFS and Zones


Two great tastes that go great together > You've got ZFS data in my zone! > Hey, you've got your zone on my ZFS! ZFS datasets (pools or file systems) can be delegated to zones > Zone administrator controls contents of dataset Zoneroot may (soon) be placed on ZFS > Separate ZFS filesystem per zone > Snapshots and clones make zone creation fast
34

ZFS Pools and Zones


Zone A tank/a Zone B tank/b Zone C tank/c

tank

Global Zone

35

Framework for Examples


Zones
> z1 sparse root, zoneroot on ZFS > z2 full root, zoneroot on ZFS > z4 sparse root, zoneroot on UFS

ZFS Pools & Filesystems


> p1 mirrored ZFS pool, mounted as /zones > p2 mirrored ZFS pool, mounted as /p2 > p3 unmirrored ZFS pool, mounted as /p3

36

Adding ZFS as Mounted File System


Mount ZFS filesystem into a zone like any other loopback filesystem
# zfs create p2/z1a # zfs set mountpoint=legacy p2/z1a # zonecfg -z z1 zonecfg:z3> add fs zonecfg:z1:fs> set type=zfs zonecfg:z1:fs> set dir=/z1a zonecfg:z1:fs> set special=p2/z1a zonecfg:z1:fs> end zonecfg:z1> verify zonecfg:z1> commit zonecfg:z1> exit

Must set mountpoint to legacy so that the zone manages the mount

37

Adding ZFS as Delegated File System


Delegate ZFS dataset to a zone
# zfs create p2/z1b # mkdir /zones/z1/root/z1b # zonecfg -z z1 zonecfg:z3> add dataset zonecfg:z1:dataset> set name=p2/z1b zonecfg:z1:dataset> end zonecfg:z1> commit zonecfg:z1> exit # zoneadm -z z1 boot # zlogin z1 df -h Filesystem size used p2/z1b 12G 24K # zlogin z1 zfs list NAME USED AVAIL p2 136K 11.5G p2/z1b 24.5K 11.5G

> Zone administrator manages file systems within the zone

avail capacity 12G 1% REFER 25.5K 24.5K

Mounted on /p2/z1b

MOUNTPOINT /p2 /p2/z1b


38

zoned Property for a ZFS File System


Once a FS is delegated to a zone, the zoned property is set. If set, the FS can no longer be managed in the global zone.
> Zone admin might have changed things in incompatible

ways (mountpoint, for example).

39

Zoneroot on ZFS (Soon)


# cat z5.conf create set zonepath=/zones/z5 set autoboot=false add net set address=192.168.100.1/25 set physical=nge0 end commit # zonecfg -z z5 -f z5.conf # zoneadm -z z5 install A ZFS file system has been created for this zone. Preparing to install zone <z5>. Creating list of files to copy from the global zone. Copying <2587> files to the zone. Initializing zone product registry. Determining zone package initialization order. Preparing to initialize <957> packages on the zone. Initialized <957> packages on zone. Zone <z5> is initialized.
40

Zoneroot on ZFS (Soon)


# zfs list NAME USED AVAIL p1 3.44G 8.06G p1/z5 81.1M 8.06G # zlogin z5 zfs list no datasets available # zfs set quota=500m p1/z5 # zfs list NAME USED AVAIL p1 3.45G 8.06G p1/z5 81.1M 419M # zfs set reservation=500m p1/z5 # zfs list NAME USED AVAIL p1 3.45G 7.65G p1/z5 81.1M 419M REFER 38K 81.1M MOUNTPOINT /zones /zones/z5

REFER 38K 81.1M REFER 38K 81.1M

MOUNTPOINT /zones /zones/z5 MOUNTPOINT /zones /zones/z5

41

Cloning Zones with ZFS

# zfs list NAME USED AVAIL REFER MOUNTPOINT p1 3.37G 8.14G 36K /zones p1/z1 127M 8.14G 127M /zones/z1 p1/z2 3.24G 8.14G 3.24G /zones/z2 # cp z2.conf z3.conf <make changes necessary for z3 identity> # zonecfg -z z3 -f z3.conf # zoneadm -z z3 clone z2 Cloning snapshot p1/z2@SUNWzone1 Instead of copying, a ZFS clone has been created for this zone. # zfs list NAME USED AVAIL REFER MOUNTPOINT p1 3.37G 8.14G 37K /zones p1/z1 127M 8.14G 127M /zones/z1 p1/z2 3.24G 8.14G 3.24G /zones/z2 p1/z2@SUNWzone1 94.5K - 3.24G p1/z3 116K 8.14G 3.24G /zones/z3
42

ZFS Object-Based Storage


DMU provides a general purpose object store zvol interface allows creation of raw devices
> Use for DB, create UFS in them, etc.

zvol ZFS Posix Interface

iSCSI

Swap

Raw

ZFS Volume Emulator

Data Management Unit (DMU) Storage Pool Allocator (SPA)


43

ZFS ZVOL Interface


Create zvol interfaces just as any other zfs file system Devices are located in /dev/zvol/
> /dev/zvol/rdsk/<poolname>/<volname>

# zfs create -V 4g tank/v1 # newfs /dev/zvol/rdsk/tank/v1 <newfs output> # mount /dev/zvol/dsk/tank/v1 /mnt # df -h /mnt Filesystem size used /dev/zvol/dsk/tank/v1 3.9G 4.0M

avail capacity 3.9G 1%

Mounted on /mnt
44

BREATHTAKING

PERFORMANCE

45

Copy-on-Write Design Multiple Block Sizes Pipelined I/O Dynamic Striping Intelligent Pre-Fetch

Architected for Speed


46

Cost and Source Code

ZFS is FREE*
*Free
USD0 EUR0 GBP0 SEK0 YEN0 YUAN0
47

ZFS source code is included in Open Solaris > 47 ZFS patents added to CDDL patent commons

And for the Future


More Flexible

Pool resize and device removal Booting / root file system Integration with Solaris Containers

More Secure

Encryption Secure delete overwriting for absolute deletion

More Reliable

Fault Management Architecture Integration Hot spares DTrace providers


48

Solaris 10 Deep Dive ZFS


Bob Netherton
bob.netherton@sun.com

También podría gustarte