Você está na página 1de 27

Cheap Clustering with OCFS2

Mark Fasheh Oracle August 14, 2006


What is OCFS2

General purpose cluster file system


Shared disk model Symmetric architecture Almost POSIX compliant

fcntl(2) locking Shared writable mmap

Cluster stack

Small, suitable only for a file system

Why use OCFS2?

Versus NFS

Fewer points of failure Data consistency OCFS2 nodes have direct disk access

Higher performance

Widely distributed, supported


In Linux kernel Novell SLES9, SLES10 Oracle support for RAC customers

OCFS2 Uses

File Serving

FTP NFS

Web serving (Apache) Xen image migration Oracle Database

Why do we need cheap clusters?

Shared disk hardware can be expensive

Fibre Channel as a rough example


Switches: $3,000 - $20,000 Cards: $500 - $2,000 Cables, GBIC Hundreds of dollars Disk(s): The sky's the limit

Networks are getting faster and faster

Gigabit PCI card: $6 Performance not necessarily critical

Some want to prototype larger systems

Hardware

Cheap commodity hardware is easy to find:

Refurbished from name brands (Dell, HP, IBM, etc) Large hardware stores (Fry's Electronics, etc) Online Ebay, Amazon, Newegg, etc Dual core CPUs running at 2GHz and up Gigabit network SATA, SATA II

Impressive Performance

Hardware Examples - CPU

2.66GHz, Dual Core w/MB: $129

Built in video, network

Hardware Examples - RAM

1GB DDR2: $70

Hardware Examples - Disk

100GB SATA: $50

Hardware Examples - Network

Gigabit network card: $6

Can direct connect rather than buy a switch, buy two!

Hardware Examples - Case

400 Watt Case: $70

Hardware Examples - Total

Total hardware cost per node: $326


3 node cluster for less than $1,000! One machine exports disk via network

Dedicated gigabit network for the storage At $50 each, simple to buy an extra, dedicated disk Generally, this node cannot mount the shared disk

Spend slightly more for nicer hardware


PCI-Express Gigabit: $30 Athlon X2 3800+, MB (SATA II, DDR2): $180

Shared Disk via iSCSI

SCSI over TCP/IP


Can be routed Support for authentication, many enterprise features iSCSI server Can run on any disks, regular files Kernel / User space components iSCSI client

iSCSI Enterprise Target (IETD)


Open iSCSI Initiator


Kernel / User space components

Trivial ISCSI Target Config.

Name the target

iqn.YYYY-MM.com.example:disk.name Lun definitions describe disks to export fileio type for normal disks Special nullio type for testing

Create Target stanza in /etc/ietd.conf


Target iqn.2006-08.com.example:lab.exports Lun 0 Path=/dev/sdX,Type=fileio Lun 1 Sectors=10000,Type=nullio


Trivial ISCSI Initiator Config.

Recent releases have a DB driven config.


Use iscsiadm program to manipulate rm -f /var/db/iscsi/* to start fresh 3 steps


Add discovery address Log into target When done, log out of target

$ iscsiadm -m discovery --type sendtargets portal examplehost [cbb01c] 192.168.1.6:3260,1 iqn.2006-08.com.example:lab.exports $ iscsiadm -m node --record cbb01c -login
$ iscsiadm -m node --record cbb01c -logout

Shared Disk via SLES10

Easiest option

No downloading all packages included Very simple setup using YAST2


Simple to use, GUI configuration utility Text mode available

Supported by Novell/Suse OCFS2 also integrated with Linux-HA software Demo on Wednesday

Visit Oracle booth for details

Shared Disk via AoE

ATA over Ethernet


Very simple standard 6 page spec! Lightweight client

Less CPU overhead than iSCSI

Very easy to set up auto configuration via Ethernet broadcast Not routable, no authentication

Targets and clients must be on the same Ethernet network

Disks addressed by shelf and slot #'s

AoE Target Configuration

Virtual Blade (vblade) software available for Linux, FreeBSD


Very small, user space daemon Buffered I/O against a device or file

Useful only for prototyping O_DIRECT patches available

Stock performance is not very high vbladed <shelf> <slot> <ethn> <device>

Very simple command

AoE Client Configuration

Single kernel module load required


Automatically finds blades Optional load time option, aoe_iflist

List of interfaces to listen on

Aoetools package

Programs to get AoE status, bind interfaces, create devices, etc

OCFS2

1.2 tree

Shipped with SLES9/SLES10 RPMS for other distributions available online Builds against many kernels Feature freeze, bug fix only Active development tree Included in Linux kernel Bug fixes and features go to -mm first.

1.3 tree

OCFS2 Tools

Standard set of file system utilities


mkfs.ocfs2, mount.ocfs2, fsck.ocfs2, etc Cluster aware o2cb to start/stop/configure cluster Work with both OCFS2 trees Can create entire cluster configuration Can distribute configuration to all nodes

Ocfs2console GUI configuration utility


RPMS for non SLES distributions available online

OCFS2 Configuration

Major goal for OCFS2 was simple config.

/etc/ocfs2/cluster.conf

Single file, identical on all nodes Can configure to start at boot

Only step before mounting is to start cluster

$ /etc/init.d/o2cb online <cluster name> Loading module "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading module "ocfs2_nodemanager": OK Loading module "ocfs2_dlm": OK Loading module "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Starting O2CB cluster ocfs2: OK

Sample cluster.conf
node: ip_port = 7777 ip_address = 192.168.1.7 number = 0 name = keevan cluster = ocfs2 node: ip_port = 7777 ip_address = 192.168.1.2 number = 1 name = opaka cluster = ocfs2 cluster: node_count = 2 name = ocfs2

OCFS2 Tuning - Heartbeat

Default heartbeat timeout tuned very low for our purposes

May result in node reboots for lower performance clusters Timeout must be same on all nodes Increase O2CB_HEARTBEAT_THRESHOLD value in /etc/sysconfig/o2cb

OCFS2 Tools 1.2.3 release will add this to the configuration script.

SLES10 users can use Linux-HA instead

OCFS2 Tuning mkfs.ocfs2

OCFS2 uses cluster and block sizes

Clusters for data, range from 4K-1M

Use -C <clustersize> option Use -b <blocksize> option

Blocks for meta data, range from .5K-4K

More meta data updates -> larger journal

-Jsize=<journalsize> to pick different size


-Tmail option for meta data heavy workloads
-Tdatafiles for file systems with very large files

mkfs.ocfs2 -T filesystem-type

OCFS2 Tuning - Practices

No indexed directories yet

Keep directory sizes small to medium Read only access is not a problem Try to keep writes local to a node

Reduce resource contention


Each node has it's own directory Each node has it's own logfile

Spread things out by using multiple file systems

Allows you to fine tune mkfs options depending on file system target usage

References

http://oss.oracle.com/projects/ocfs2/ http://oss.oracle.com/projects/ocfs2-tools/ http://www.novell.com/linux/storage_foundation/ http://iscsitarget.sf.net/ http://www.open-iscsi.org/ http://aoetools.sf.net/ http://www.coraid.com/ http://www.frys-electronics-ads.com/ http://www.cdw.com/

Você também pode gostar