Está en la página 1de 30

Running

High Performance Computing


Workloads on
Red Hat Enterprise Linux
Imed Chihi
Senior Technical Account Manager
Red Hat Global Support Services
21 January 2014
KAU! WEP "#$% & Imed Chihi 2
Agenda

The case o H!" on co##odity plator#s

$ro# irrelevance to ubi%uity

The co#pute #odel

Tuning Red Hat &nterprise 'inu(

!rocess place#ent

Me#ory place#ent

)nterrupt place#ent
KAU! WEP "#$% & Imed Chihi *
!he case of HPC
!he case of HPC
on commodit'
on commodit'
platforms
platforms
KAU! WEP "#$% & Imed Chihi 4
HPC on commodit' hard(are

H!" on a personal co#puter sounded weird

+If you want HPC, get a Cray,

&arly -or. on +cycle harvesting, since /00s

+Beowulf, clusters in #id1to1late 200s spread the


popularity o the concept
KAU! WEP "#$% & Imed Chihi 3
)rom irrele*ance
)rom irrele*ance
to u+i,uit'
to u+i,uit'
KAU! WEP "#$% & Imed Chihi 4
)rom irrele*ance to u+i,uit'
-./
of the top0## super
computers run Linux
5ove#ber 201* listing
KAU! WEP "#$% & Imed Chihi 6
!he compute model
!he compute model
!uning examples
!uning examples
KAU! WEP "#$% & Imed Chihi /
!uning1

There is no #agical recipe

This is only a tuning pri#er

5eed to .no- ho- the syste# -or.s7 ho- the


application -or. and -hat are your priorities
KAU! WEP "#$% & Imed Chihi 2
!he traditional example
for (i=0; i<MAX_DEPTH; i++)
if (i==DIRAC_POINT)
process_dirac();
else
process_non_dirac();
for (i=0; i<MAX_DEPTH; i++)
if (i!=DIRAC_POINT)
process_non_dirac();
else
process_dirac();
J
u
m
p
!
J
u
m
p
!
KAU! WEP "#$% & Imed Chihi 10
!he compute model
8e need s#art
algorith#s to place
the -or.load on the
co#puting resources9
Those place#ent
decisions are
revie-ed continuously9
Processes
(workload)
M
e
m
o
r
y
P
r
o
c
e
s
s
o
r
s
External events
(interrupts workload)
KAU! WEP "#$% & Imed Chihi 11
Process placement and scheduling

!rocessor ainity

!rocess #igrations

$orcing process place#ent

taskset

sched_setaffinity()

!rocess priority

nice

chrt and sched_setscheduler()


KAU! WEP "#$% & Imed Chihi 12
2emor' placement and 3U2A

:nior# vs9 5on1:nior#


Me#ory Access

$orcing #e#ory place#ent


-ith numactl

Me#ory page #igration in


5:MA syste#s
CPU3 CPU2 CPU CPU!
CPU!
CPU
CPU2
CPU3
Memory (uni"orm access)
#
U
M
$

n
o
d
e

!
#
U
M
$

n
o
d
e

#
U
M
$

n
o
d
e

2
#
U
M
$

n
o
d
e

3
KAU! WEP "#$% & Imed Chihi 1*
2emor' management

!age cache considerations

Huge pages

;verco##it

;ut o #e#ory .iller


KAU! WEP "#$% & Imed Chihi 14
Interrupts placement

)R<1to1"!: ainity

ir%balance

[ksoftirqd/X] .ernel
threads

Multi1%ueue net-or.ing

=ero1copy );

;loading engines
KAU! WEP "#$% & Imed Chihi 13
)#ed "hihi
)#ed "hihi
http>??people9redhat9co#?ichihi?p?
http>??people9redhat9co#?ichihi?p?
ichihi@redhat.com
ichihi@redhat.com
Red Hat Global Support Services
Red Hat Global Support Services
19
19
http>??---9redhat9co#?training?courses?rh442?
http>??---9redhat9co#?training?courses?rh442?
29 Red Hat &nterprise 'inu( 4 @ !eror#ance Tuning Guide
29 Red Hat &nterprise 'inu( 4 @ !eror#ance Tuning Guide
*9 Red Hat Su##it 201*> !eror#ance Analysis and Tuning
*9 Red Hat Su##it 201*> !eror#ance Analysis and Tuning
o Red Hat &nterprise 'inu(
o Red Hat &nterprise 'inu(



Running
High Performance Computing
Workloads on
Red Hat Enterprise Linux
Imed Chihi
Senior Technical Account Manager
Red Hat Global Support Services
21 January 2014
%alaam and &ood mornin&' My name is (med C)i)i and ( am a %enior *ec)nical $ccount Mana&er at +ed
,at' ( am part o" t)e %upport and En&ineerin& or&anisation wit)in t)e company'


KAU! WEP "#$% & Imed Chihi 2
Agenda

The case o H!" on co##odity plator#s

$ro# irrelevance to ubi%uity

The co#pute #odel

Tuning Red Hat &nterprise 'inu(

!rocess place#ent

Me#ory place#ent

)nterrupt place#ent
*)e purpose o" today-s presentation is to talk a.out t)e use o" +ed ,at Enterprise /inux in ,PC
environments and t)e common tunin& areas to w)ic) t)e ,PC user and administrator needs to pay
attention'


KAU! WEP "#$% & Imed Chihi *
!he case of HPC !he case of HPC
on commodit' on commodit'
platforms platforms


KAU! WEP "#$% & Imed Chihi 4
HPC on commodit' hard(are

H!" on a personal co#puter sounded weird

+If you want HPC, get a Cray,

&arly -or. on +cycle harvesting, since /00s

+Beowulf, clusters in #id1to1late 200s spread the


popularity o the concept
$lt)ou&) it sounds like a de "acto standard nowadays0 t)e idea o" runnin& ,PC workloads on PC1like
arc)itectures )as not always .een mainstream'
2! to 3! years a&o0 t)is idea was totally stran&e as t)e rule was to run ,PC on custom proprietary
)ardware0 namely Cray +esearc) and ot)ers'
(n t)e 2!-s some early researc) started to consider options to take advanta&e o" t)e idle cycles on personal
workstations to run demandin& processin& in a distri.uted "as)ion'
(n early 3!-s appeared a trend called 45eowul" clusters6 w)ic) were collections o" /inux personal computers
connected to a /$# and runnin& scienti"ic processin& tasks in a distri.uted "as)ion' 7ork on 5eowul" )as0
"or instance0 .een t)e main drive "or t)e proli"eration o" Et)ernet #(C drivers on /inux as early as 338'


KAU! WEP "#$% & Imed Chihi 3
)rom irrele*ance )rom irrele*ance
to u+i,uit' to u+i,uit'


KAU! WEP "#$% & Imed Chihi 4
)rom irrele*ance to u+i,uit'
-./
of the top0## super
computers run Linux
5ove#ber 201* listing
*oday0 t)is is no lon&er a 4clever idea60 .ut 4t)e standard6'
*)e "act t)at 39: o" t)e top "aster super1computers on t)e planet run /inux speaks volumes a.out t)e
success o" t)e concept'
*)e operatin& system-s kernel tries to make t)e most optimal decisions "or &eneral purpose workloads'
,owever0 we will see in t)e rest o" t)is talk t)at t)ere areas o" optimisation and t)at it-s important to
understand )ow t)e )ardware works and )ow t)e operatin& system works .ot) to analyse pro.lems and
optimise execution'


KAU! WEP "#$% & Imed Chihi 6
!he compute model !he compute model
!uning examples !uning examples


KAU! WEP "#$% & Imed Chihi /
!uning1

There is no #agical recipe

This is only a tuning pri#er

5eed to .no- ho- the syste# -or.s7 ho- the


application -or. and -hat are your priorities
(t is important to realise t)at tunin& is not a ma&ic recipe w)ic)0 some)ow0 makes my ;o.s run twice as "ast'
<ependin& on t)e workloads c)aracteristics and t)e availa.le plat"orm0 some ad;ustments could .e done to
improve execution times'
=t)er workloads )ave ot)er priorities t)an executions time> t)ose could .e low latency "or "inancial trade
en&ines "or instance or relia.ility "or data.ase transactions'


KAU! WEP "#$% & Imed Chihi 2
!he traditional example
for (i=0; i<MAX_DEPTH; i++)
if (i==DIRAC_POINT)
process_dirac();
else
process_non_dirac();
for (i=0; i<MAX_DEPTH; i++)
if (i!=DIRAC_POINT)
process_non_dirac();
else
process_dirac();
J
u
m
p
!
J
u
m
p
!
( am startin& )ere wit) t)e simplest example ( could t)ink o" a.out t)e impact o" t)e plat"orm on t)e execution per"ormance'
*)e two loops )ere are semantically identical' ,owever0 t)ey would run in very di""erent ways on t)e )ardware resultin& in di""erent
per"ormance results'
*)e i" test would result in a code ;ump i" t)e condition is "alse' ,owever0 it would cost nearly not)in& i" t)e condition is true' *)is is
t)e reason w)y t)e "irst loop will incur M$?@<EP*,1 additional ;umps compared to t)e second one'
$ cat dirac.c
int i;
#define MAX_DEPTH 99999
#define DIRAC_POINT 500
void process_dirac() { return;}
void process_non_dirac() { return;}
int main(int argc, char **argv) {
for (i=0; i<MAX_DEPTH; i++)
if (i!=DIRAC_POINT)
process_non_dirac();
else
process_dirac();
return(0);}
The co#piler generates this>
$ gcc -S dirac.c
$ more dirac.s
(..)
.L9:
movl i(%rip), %eax
cmpl $500, %eax
je .L7
movl $0, %eax
call process_non_dirac
jmp .L8
.L7:
movl $0, %eax
call process_dirac
.L8:
movl i(%rip), %eax
addl $1, %eax
movl %eax, i(%rip)
(..)


KAU! WEP "#$% & Imed Chihi 10
!he compute model
8e need s#art
algorith#s to place
the -or.load on the
co#puting resources9
Those place#ent
decisions are
revie-ed continuously9
Processes
(workload)
M
e
m
o
r
y
P
r
o
c
e
s
s
o
r
s
External events
(interrupts workload)
$s stated previously0 in order to do any kind o" optimisation or tunin&0 we will need to understand )ow t)e
computin& plat"orm works' Aiven t)e excessive complexity o" modern computers0 we-ll use t)is simplistic
model to present t)e core tunin& concepts'
$ computin& plat"orm provides computin& resources (memory and processors)> t)is is t)e .asic von
#euman model on w)ic) modern computers are .uilt' =n top o" t)ose resources we-ll try to run a
workload' *)is workload is typically our user processes w)ic) are a seBuence o" instructions' *)ere-s also
an 4async)ronous6 workload w)ic) is &enerated .y t)e events "rom t)e attac)ed devices' *)ose events
are implemented usin& interrupts to reBuest processin& "or network and stora&e operations in an
async)ronous manner'


KAU! WEP "#$% & Imed Chihi 11
Process placement and scheduling

!rocessor ainity

!rocess #igrations

$orcing process place#ent

taskset

sched_setaffinity()

!rocess priority

nice

chrt and sched_setscheduler()


7e will start )ere wit) t)e "irst ri&)t )and side arrow on t)e previous slide w)ic) is a.out placin& processes
on CPU'
*)ere-s a key concept )ere w)ic) is 4processor a""inity6 w)ic) is t)e tendency o" t)e sc)eduler to keep
sc)edulin& a process on t)e same CPU as muc) as possi.le' 7)en a process runs and access memory
locations0 it would load t)e most "reBuently used ones into t)e various processor cac)es0 t)ere"ore0
sc)edulin& t)e process on a new processor would make it lose t)e .ene"it o" rapid cac)ed access until t)e
cac)es o" t)e new processor are "illed'
(n some cases0 )owever0 t)e sc)eduler is "orced to 4mi&rate6 a process .ecause im.alance .etween t)e
load o" processors increases too muc)' *)ere are dedicated kernel t)reads to per"orm t)is mi&ration>
Cmi&rationD?E
*)ose sc)edulin& decisions could .e overidden .y manually "orcin& task assi&nments or limitin& t)em to
particular CPUs' #amely0 you may use t)e taskset command to .ind processes to particular CPUs'
$pplications0 could do t)e same .y callin& sched_setaffinity() to .ind t)emselves to particular
CPUs'
# taskset -c 2 memho -r!!!! "
*)e workload process would typically run alon& multiple ot)er supportin& tasks "or system mana&ement'
*)ere"ore0 it mi&)t make sense to &ive a )i&)er priority "or t)e workload process to take lon&er time slices
on t)e CPUs and to &et t)e sc)edulin& priority on ot)er tasks' *)is could .e implemented usin& t)e
traditional Unix nice command andDor nice() system call'
*)e real1time capa.ilities o" t)e /inux kernel could allow tasks to &ain even )i&)er privile&es to use t)e
CPU .y markin& t)em as real time tasks as in>
# chrt -r !! memho -r!!!! "


KAU! WEP "#$% & Imed Chihi 12
2emor' placement and 3U2A

:nior# vs9 5on1:nior#


Me#ory Access

$orcing #e#ory place#ent


-ith numactl

Me#ory page #igration in


5:MA syste#s
CPU3 CPU2 CPU CPU!
CPU!
CPU
CPU2
CPU3
Memory (uni"orm access)
#
U
M
$

n
o
d
e

!
#
U
M
$

n
o
d
e

#
U
M
$

n
o
d
e

2
#
U
M
$

n
o
d
e

3
Memory mana&ement is o"ten t)e most intricate part o" an operatin& system' (t is very di""icult to implement a virtual memory
mana&er w)ic) can work properly .ot) on a sin&le CPU wit) 82M5 o" +$M and on 82 CPUs wit) 2*5 o" +$M'
*)e traditional PC arc)itecture uses a linear memory )ardware w)ic) can .e accessed "rom all CPUs at t)e same cost> accessin&
a &iven memory location takes t)e same time re&ardless "rom w)ic) CPU t)e access takes place'
,owever0 t)is arc)itecture model does not scale to matc) t)e reBuirements o" t)e modern plat"orms w)ic) tend to )ave tens o"
CPUs and )undreds o" &i&a.ytes o" memory' *)ere"ore0 modern servers are .uilt around a #on Uni"orm Memory $ccess model
w)ere t)e system is comprised o" multiple &roupin&s o" memory modules and CPUs> t)ose &roupin&s are called 4#UM$ nodes6'
=n t)ose models0 access to a memory location "rom CPU! takes muc) less time t)an "rom CPU' *)e #UM$ arc)itecture can .e
viewed wit) numactl as in>
# numactl --hard#are
a$aila%le& ' nodes ((-))
node ( c*us& ( " 2 ) ' + 2' 2+ 2, 2- 2. 2!
node ( si/e& ,++2" 01
node ( free& ,2"(, 01
node " c*us& , - . ! "( "" )( )" )2 )) )' )+
node " si/e& ,++), 01
node " free& ,2!-- 01
node 2 c*us& "2 ") "' "+ ", "- ), )- ). )! '( '"
node 2 si/e& ,++), 01
node 2 free& ,)'+) 01
node ) c*us& ". "! 2( 2" 22 2) '2 ') '' '+ ', '-
node ) si/e& ,++), 01
node ) free& ,)(2. 01
node distances&
node ( " 2 )
(& "( 2" 2" 2"
"& 2" "( 2" 2"
2& 2" 2" "( 2"
)& 2" 2" 2" "(
Just like process mi&ration t)reads exist to move a process to a di""erent CPU0 recent kernels implement #UM$ pa&e mi&ration in
order to 4move6 memory allocated to a process to a #UM$ node 4closer6 to w)ere t)e process is runnin&'


KAU! WEP "#$% & Imed Chihi 1*
2emor' management

!age cache considerations

Huge pages

;verco##it

;ut o #e#ory .iller


*)e pa&e cac)e is a dynamic memory space used primarily "or cac)in& (=' *)e /inux kernel typically uses
lar&e amounts o" p)ysical memory "or cac)in& w)en it-s "ree'
*)ose allocations are not initiated .y user processes0 so t)ey could .e con"usin& as it-s not o.vious to track
t)em wit) precision' *)e /inux kernel implements controls to restrict )ow muc) memory could .e used'
,owever0 t)e pa&e cac)e s)all not a""ect )ow muc) memory applications could allocate since t)e memory
it uses is usually reclaima.le w)en t)ere-s a user demand'
,u&ePa&es are an implementation o" lar&er pa&e siFes t)an t)e typical GH5 siFe common on commodity
plat"orms' *)e main motive is0 a&ain0 a concern o" scala.ility as t)e use o" GH5 pa&es would imply pa&e
ta.les and memory mana&ement data structures wit) millions o" entries'
*raversin& linked lists wit) millions o" entries could )ave a serious impact on per"ormance' *)ere"ore0
considerin& t)e use o" ,u&ePa&es is no lon&er an option as t)e typical compute node is likely to )ave lar&e
amounts o" memory'
=vercommit is a "eature o" virtual memory mana&ers .y w)ic) t)ey can satis"y more memory allocation
reBuests t)an it )as availa.le in p)ysical memory' (t is very common to "ind t)at t)e a&&re&ate amount o"
virtual memory allocations is muc) )i&)er t)an t)e actual availa.le p)ysical memory' C)eck
/*roc/meminfo "or details' *)is is very analo&ous to over.ookin& done .y some airlines> it provides "or
optimal utilisation o" t)e resources0 .ut0 we may run t)e risk o" )avin& users really claimin& to use w)at t)ey
)ave allocated w)ic) would yield to a pro.lematic situation'
*)e typical .e)aviour o" virtual memory mana&ers is to rely on a swap space as a slow extension o"
p)ysical memory' ,owever0 w)en applications demand &oes .eyond t)e actual availa.le memory
situations o" 4t)ras)in&6 are likely to occur' $ 4t)ras)in&6 is a situation w)ere t)e memory mana&er keeps
movin& pa&es .etween +$M and t)e swap space inde"initely endin& up in a near1lockup situation'
*)e 4out o" memory killer6 is a kernel mec)anism w)ic) attempts to kill processes in order to .rin& t)e
system out o" t)ras)in& situations' *)ere"ore0 it is important to watc) "or excessive memory usa&e as it
may tri&&er t)e out o" memory killer'


KAU! WEP "#$% & Imed Chihi 14
Interrupts placement

)R<1to1"!: ainity

ir%balance

[ksoftirqd/X] .ernel
threads

Multi1%ueue net-or.ing

=ero1copy );

;loading engines
(nterrupts are async)ronous events w)ic) need to .e processed .y CPUs' *)ose are async)ronous
.ecause t)ey are not initiated .y t)e user and t)eir timin& cannot .e controlled' (nterrupts are t)e main
met)od o" communicatin& wit) external devices0 namely networkin& and stora&e'
7it) )i&) speed network inter"aces at !A.E and G!A.E or "i.re c)annel links at IA.ps per port0 t)e
num.er o" interrupts could reBuire a )u&e processin& power "rom CPUs' *)ere"ore0 t)e assi&nment o"
interrupts to CPUs could .e tuned "or optimal processin&' *)e irB.alance service could .e used to
distri.ute t)e interrupts load amon& processors' ,owever0 t)is may not .e t)e optimal c)oice "or certain
workloads'
(nterrupt )andlin& is actually done in two p)ases> a "irst sync)ronous p)ase w)ere t)e CPU receives t)e
interrupt and acknowled&es it0 t)en is sc)edules t)e remainder o" t)e processin& to .e completed later on0
t)is needs to .e done t)e moment t)e interrupt is raised and wit)out delay ot)erwise0 packet loss could
occur' $ second p)ase is processed .y kernel t)reads called Ckso"tirBdD?E async)ronously' *)ose t)reads
are sc)eduled ;ust like any ot)er process as t)ey are not under time constraints'
Modern network devices and device drivers are capa.le o" deliverin& incomin& packets to multiple receive
Bueues' *)is allows "or multiple processors to pick and process packets in parallel .ecause receive
Bueues can only .e accessed under a CPU lock'
$not)er more common optimisation is o""loadin& en&ines w)ic) are )ardware1implemented processin& on
network tra""ic' $ctions like packet re1assem.ly or c)ecksum calculation w)ic) are usually done .y t)e
CPU would .e o""loaded to .e processed .y t)e network inter"ace'
*)e Unix pro&rammin& model expects t)at transmission and reception o" data "rom network or stora&e
inter"aces is done wit) two copies> t)e kernel copies t)e data "rom user space to a .u""er in kernel space0
t)en "rom t)is kernel .u""er to t)e transmission device' *)is dou.le1copy )as pla&ued t)e per"ormance o"
)i&) demandin& applications on /inuxDUnix especially t)at CPU speed and network speed )as &rown muc)
"aster t)an memory speed w)ic) mostly sta&nated over t)e past 2! years' Jero copy is a mec)anism
w)ic) permits direct transmission "rom user .u""ers directly to t)e )ardware w)ic) improves per"ormance0
)owever0 t)ere are still no standard and common inter"aces to do t)is and it still reBuires some )ackin& to
.e implemented'


KAU! WEP "#$% & Imed Chihi 13
)#ed "hihi )#ed "hihi
http>??people9redhat9co#?ichihi?p? http>??people9redhat9co#?ichihi?p?
ichihi@redhat.com ichihi@redhat.com
Red Hat Global Support Services Red Hat Global Support Services
19 19 http>??---9redhat9co#?training?courses?rh442? http>??---9redhat9co#?training?courses?rh442?
29 Red Hat &nterprise 'inu( 4 @ !eror#ance Tuning Guide 29 Red Hat &nterprise 'inu( 4 @ !eror#ance Tuning Guide
*9 Red Hat Su##it 201*> !eror#ance Analysis and Tuning *9 Red Hat Su##it 201*> !eror#ance Analysis and Tuning
o Red Hat &nterprise 'inu( o Red Hat &nterprise 'inu(

También podría gustarte