Está en la página 1de 26

UNIVERSIDAD AUTNOMA DE MADRID

MASTERS PROGRAM IN RESEARCH AND INNOVATION IN INFORMATION


2
AND COMMUNICATION TECHNOLOGIES (I -ICT)
Numerical and Data-Intensive Computing (Course 2012/2013)

Laboratory 1: PROFILERS (perf and gprof)


Proceed carefully through the following steps, completing the lab report as requested. The
report must be delivered as a PDF document through the Moodle portal by the next lab class.
The material within this report can be used for the final term report, which is to be delivered by
the end of the ordinary evaluation period.
1. Download associated material (profilers.tar.gz) from Moodles course page into personal
working directory.
2. Uncompress and untar associated material:
gunzip profilers.tar.gz
tar xvf profilers.tar
PERF (I)
1. Go to directory task1:
cd profilers/task1
2. Edit and understand example task1.c:
gedit task1.c &
3. Compile task1.c:
make
4. Run task1 with the performance analysis tool (perf), reporting:
a. CPU cycles at user level: -e cycles:u
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e
cycles:u ./task1
CPU = 34.900000 ms

Performance counter stats for './task1':

10660168914
M/sec

3.505056003

cycles

seconds time elapsed

0.000

Estadisticas de cuantas veces se hizo lago


b. Machine code instructions at user level: e instructions:u
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e
instructions:u ./task1
CPU = 34.900000 ms

Performance counter stats for './task1':

4002106745

instructions

0.000

IPC

3.508743022

seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$

c. First-level data cache loads at user level: e l1-dcache-loads:u


d. First-level data cache load misses at user level: e l1-dcache-loadmisses:u
e. First-level data cache stores at user level: e l1-dcache-stores:u
f. First-level data cache store misses at user level: e l1-dcache-storemisses:u
g. Last-level cache loads at user level: e llc-loads:u
h. Last-level cache load misses at user level: e llc-load-misses:u
i. Last-level cache stores at user level: e llc-stores:u
j. Last-level cache store misses at user level: e llc-store-misses:u

perf
e
-e
-e
-e

stat e cycles:u e instructions:u \


l1-dcache-loads:u e l1-dcache-load-misses:u \
l1-dcache-stores:u e l1-dcache-store-misses:u \
llc-loads:u e llc-load-misses:u \
llc-stores:u e llc-store-misses:u ./task1

numero de fallos a partir de nivel1


cuantas veces almaceno en la cache 1
fallo de cahe de nivel1
Last level Cache3 LLC
fallos y lecturas en L1 y en LLC
L! memoria cache 1
l2 memoria cache 2
l3 memoria cache 3 last nivel cache
luego va a la memoria ram
preguntar a la CPU cuantos fallos, almacenamientos
perf provee informacion estadistica de CPU
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u ./task1
CPU = 34.800000 ms

Performance counter stats for './task1':

10657835906 cycles
4002106780 instructions

0.000 M/sec

0.376 IPC

3.502425458 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u


./task1
CPU = 34.900000 ms

Performance counter stats for './task1':

10652615063 cycles
4002106781 instructions

0.000 M/sec

400036820 L1-dcache-loads

0.376 IPC
#

0.000 M/sec

3.506406766 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e


l1-dcache-load-misses:u ./task1
CPU = 34.800000 ms

Performance counter stats for './task1':

10644786361 cycles
4002106769 instructions

0.000 M/sec

400036808 L1-dcache-loads

0.376 IPC
#

935423583 L1-dcache-load-misses

0.000 M/sec
#

0.000 M/sec

3.497828986 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e


l1-dcache-load-misses:u -e l1-dcache-stores:u ./task1
CPU = 34.800000 ms

Performance counter stats for './task1':

10651824488 cycles
4002106765 instructions

0.000 M/sec

400036804 L1-dcache-loads

0.376 IPC
#

935870744 L1-dcache-load-misses
800018046 L1-dcache-stores

0.000 M/sec
#

0.000 M/sec
0.000 M/sec

3.502611901 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e


l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u ./task1
CPU = 34.800000 ms

Performance counter stats for './task1':

10651317588 cycles
4002106782 instructions

0.000 M/sec

400036821 L1-dcache-loads

0.376 IPC
#

936246092 L1-dcache-load-misses
800018063 L1-dcache-stores
0 L1-dcache-store-misses #

0.000 M/sec
#

0.000 M/sec
0.000 M/sec

0.000 M/sec

3.506921655 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e


l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u ./task1
CPU = 34.900000 ms

Performance counter stats for './task1':

10673775578 cycles
4003634110 instructions

0.000 M/sec (scaled from 71.43%)

398972589 L1-dcache-loads

0.375 IPC
#

466453707 L1-dcache-load-misses
799469226 L1-dcache-stores

0.000 M/sec (scaled from 85.73%)


#

0.000 M/sec (scaled from 85.74%)


0.000 M/sec (scaled from 85.74%)

196830114 L1-dcache-store-misses #
543983919 LLC-loads

(scaled from 85.73%)

0.000 M/sec (scaled from 85.73%)

0.000 M/sec (scaled from 57.06%)

3.505318415 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e


l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u ./task1
CPU = 34.800000 ms

Performance counter stats for './task1':

10646911534 cycles
4003947805 instructions

0.000 M/sec (scaled from 71.39%)

400485285 L1-dcache-loads

0.376 IPC
#

467534052 L1-dcache-load-misses
799711711 L1-dcache-stores

0.000 M/sec (scaled from 85.70%)


#

0.000 M/sec (scaled from 85.70%)


0.000 M/sec (scaled from 85.70%)

198216756 L1-dcache-store-misses #
533935204 LLC-loads

(scaled from 85.69%)

0.000 M/sec (scaled from 85.81%)

0.000 M/sec (scaled from 57.10%)

3.495701801 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e


l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u -e llc-loadmisses:u ./task1

CPU = 34.900000 ms

Performance counter stats for './task1':

10666020714 cycles
4004466241 instructions

0.000 M/sec (scaled from 62.35%)

398378214 L1-dcache-loads

0.375 IPC
#

465165709 L1-dcache-load-misses
799150369 L1-dcache-stores

0.000 M/sec (scaled from 74.90%)


#

56811441 LLC-load-misses

0.000 M/sec (scaled from 75.00%)


0.000 M/sec (scaled from 75.12%)

197478094 L1-dcache-store-misses #
535311019 LLC-loads

(scaled from 74.90%)

0.000 M/sec (scaled from 75.13%)

0.000 M/sec (scaled from 49.98%)


#

0.000 M/sec (scaled from 49.86%)

3.506729036 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e


l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u -e llc-load-misses:u
-e llc-stores:u ./task1
CPU = 34.900000 ms

Performance counter stats for './task1':

10682090958 cycles
4003403900 instructions

0.000 M/sec (scaled from 55.38%)

396680731 L1-dcache-loads

0.375 IPC
#

471412419 L1-dcache-load-misses
799389996 L1-dcache-stores

0.000 M/sec (scaled from 66.66%)


#

68255348 LLC-load-misses
383485849 LLC-stores

0.000 M/sec (scaled from 66.84%)

0.000 M/sec (scaled from 44.40%)


#

0.000 M/sec (scaled from 66.77%)


0.000 M/sec (scaled from 66.83%)

202155031 L1-dcache-store-misses #
549729695 LLC-loads

(scaled from 66.55%)

0.000 M/sec (scaled from 44.27%)


0.000 M/sec (scaled from 44.22%)

3.509918495 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e


l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u -e llc-load-misses:u
-e llc-stores:u -e llc-store-misses:u./task1

usage: perf stat [<options>] <command>

-e, --event <event> event selector. use 'perf list' to list available events
-i, --inherit
-p, --pid <n>
-a, --all-cpus
-c, --scale

child tasks inherit counters


stat events on existing pid
system-wide collection from all CPUs
scale/normalize counters

-v, --verbose

be more verbose (show counter open errors, etc)

-r, --repeat <n>

repeat command and print average + stddev (max: 100)

-n, --null

null run - dont start any counters

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e


l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u -e llc-load-misses:u
-e llc-stores:u -e llc-store-misses:u ./task1
CPU = 34.900000 ms

Performance counter stats for './task1':

10659650396 cycles
3998752355 instructions

0.000 M/sec (scaled from 49.78%)

394546020 L1-dcache-loads

0.375 IPC
#

932447914 L1-dcache-load-misses
799272109 L1-dcache-stores

0 L1-dcache-store-misses #
546259120 LLC-loads

389598397 LLC-stores

0.000 M/sec (scaled from 60.02%)


0.000 M/sec (scaled from 60.13%)

0.000 M/sec (scaled from 40.10%)


#

10877361 LLC-store-misses

0.000 M/sec (scaled from 59.90%)

0.000 M/sec (scaled from 60.25%)

68158849 LLC-load-misses

(scaled from 59.82%)

0.000 M/sec (scaled from 39.98%)


0.000 M/sec (scaled from 39.87%)

0.000 M/sec (scaled from 39.75%)

3.505104811 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$

5. Exchange indices in source code of task1.c, such that array is now traversed by rows
instead of by columns, hence taking advantage of both the row-major order used in C

and the cache hierarchy:


array[j][i]

array[i][j]

6. Compile task1.c and execute it through perf, reporting the same events as before.
7. Analyze the new results and compare them with the previous results.
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u
-e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.700000 ms

Performance counter stats for './task1':

6034968746 cycles
4110499159 instructions

0.000 M/sec (scaled from 49.59%)


#

0.681 IPC (scaled from 59.67%)

419891997 L1-dcache-loads

420610030 L1-dcache-load-misses
781834156 L1-dcache-stores

0 L1-dcache-store-misses #
18247292 LLC-loads
236547 LLC-load-misses
397726674 LLC-stores

0.000 M/sec (scaled from 59.77%)


#

0.000 M/sec (scaled from 59.97%)


0.000 M/sec (scaled from 60.17%)

0.000 M/sec (scaled from 60.38%)

0.000 M/sec (scaled from 40.23%)


#

9624784 LLC-store-misses

0.000 M/sec (scaled from 40.03%)


0.000 M/sec (scaled from 39.83%)

0.000 M/sec (scaled from 39.62%)

1.984014911 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$
8. Study the performance of the program without optimization (remove O2 from the
Makefile) and different types of optimization (-O1, -O2, -O3).
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcacheloads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u
-e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.900000 ms

Performance counter stats for './task1':

6067503773 cycles

4118954929 instructions

0.000 M/sec (scaled from 49.95%)


#

0.679 IPC (scaled from 59.96%)

411692296 L1-dcache-loads

424837064 L1-dcache-load-misses
793457905 L1-dcache-stores

0.000 M/sec (scaled from 59.96%)


#

0 L1-dcache-store-misses #
18391820 LLC-loads

0.000 M/sec (scaled from 40.04%)


#

401793291 LLC-stores

0.000 M/sec (scaled from 59.96%)


0.000 M/sec (scaled from 60.04%)

199376 LLC-load-misses

0.000 M/sec (scaled from 59.96%)

0.000 M/sec (scaled from 40.04%)

7865504 LLC-store-misses

0.000 M/sec (scaled from 40.04%)


#

0.000 M/sec (scaled from 39.96%)

1.998203944 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcacheloads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u
-e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.800000 ms

Performance counter stats for './task1':

6038718173 cycles
4037654570 instructions

0.000 M/sec (scaled from 49.71%)


#

395531227 L1-dcache-loads

0.669 IPC (scaled from 59.76%)


#

418401802 L1-dcache-load-misses
801772715 L1-dcache-stores
0 L1-dcache-store-misses #

0.000 M/sec (scaled from 59.76%)


#

0.000 M/sec (scaled from 59.94%)


0.000 M/sec (scaled from 60.14%)

0.000 M/sec (scaled from 60.34%)

18696232 LLC-loads

186117 LLC-load-misses

0.000 M/sec (scaled from 40.24%)


#

402380246 LLC-stores

0.000 M/sec (scaled from 40.06%)

7854278 LLC-store-misses

0.000 M/sec (scaled from 39.86%)


#

0.000 M/sec (scaled from 39.66%)

1.988734313 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcacheloads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u
-e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.800000 ms

Performance counter stats for './task1':

6053984600 cycles
3987057427 instructions

0.000 M/sec (scaled from 49.79%)


#

0.659 IPC (scaled from 59.83%)

400200988 L1-dcache-loads

415827870 L1-dcache-load-misses
803989205 L1-dcache-stores

0 L1-dcache-store-misses #
18570290 LLC-loads
184409 LLC-load-misses
404119786 LLC-stores

0.000 M/sec (scaled from 59.83%)


#

0.000 M/sec (scaled from 59.83%)


0.000 M/sec (scaled from 59.98%)

0.000 M/sec (scaled from 60.18%)

0.000 M/sec (scaled from 40.17%)


#

8084878 LLC-store-misses

0.000 M/sec (scaled from 40.17%)


0.000 M/sec (scaled from 40.02%)

0.000 M/sec (scaled from 39.82%)

1.991933125 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcacheloads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u
-e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1

1
0

CPU = 19.600000 ms

Performance counter stats for './task1':

6037123792 cycles

4116785774 instructions

0.000 M/sec (scaled from 49.54%)


#

0.682 IPC (scaled from 59.63%)

435708410 L1-dcache-loads

420659108 L1-dcache-load-misses
777460939 L1-dcache-stores

0.000 M/sec (scaled from 59.74%)


#

0 L1-dcache-store-misses #
18025137 LLC-loads

0.000 M/sec (scaled from 40.26%)


#

395535681 LLC-stores

0.000 M/sec (scaled from 60.15%)


0.000 M/sec (scaled from 60.35%)

251399 LLC-load-misses

0.000 M/sec (scaled from 59.95%)

0.000 M/sec (scaled from 40.05%)

8632229 LLC-store-misses

0.000 M/sec (scaled from 39.85%)


#

0.000 M/sec (scaled from 39.65%)

1.982170197 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcacheloads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u
-e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.800000 ms

Performance counter stats for './task1':

6035870937 cycles
4117896047 instructions
429597588 L1-dcache-loads

0.000 M/sec (scaled from 49.68%)


#

0.682 IPC (scaled from 59.74%)


#

419548640 L1-dcache-load-misses
777307875 L1-dcache-stores

0.000 M/sec (scaled from 59.75%)


#

0.000 M/sec (scaled from 59.96%)


0.000 M/sec (scaled from 60.16%)

1
1

0 L1-dcache-store-misses #
18108552 LLC-loads

0.000 M/sec (scaled from 60.37%)

214429 LLC-load-misses
396972530 LLC-stores

0.000 M/sec (scaled from 40.25%)


#

0.000 M/sec (scaled from 40.04%)

8410576 LLC-store-misses

0.000 M/sec (scaled from 39.84%)


#

0.000 M/sec (scaled from 39.63%)

1.987528678 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$
9. Write conclusions to lab report (homework).

10.

Sin optimizar
temporal@cmult-25-67-217:~/profilers/task1$ make
gcc -c task1.c
gcc -o task1 task1.o
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 52.400000 ms

Performance counter stats for './task1':

16007060492 cycles
9205605320 instructions

0.000 M/sec (scaled from 49.92%)

0.575 IPC

3598395908 L1-dcache-loads

987594656 L1-dcache-load-misses
1598375817 L1-dcache-stores
0 L1-dcache-store-misses #
588372537 LLC-loads
92496263 LLC-load-misses

0.000 M/sec (scaled from 60.05%)


#

(scaled from 59.97%)

0.000 M/sec (scaled from 60.11%)


0.000 M/sec (scaled from 60.11%)

0.000 M/sec (scaled from 60.11%)

0.000 M/sec (scaled from 39.95%)


#

0.000 M/sec (scaled from 39.89%)


1
2

401075795 LLC-stores

7658290 LLC-store-misses

0.000 M/sec (scaled from 39.89%)


#

0.000 M/sec (scaled from 39.89%)

5.254612519 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$

Optimizando O1
emporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1CPU
= 52.400000 ms

Performance counter stats for './task1':

16023716367 cycles
9200493948 instructions

0.000 M/sec (scaled from 49.86%)

0.574 IPC

3597648558 L1-dcache-loads

988629682 L1-dcache-load-misses
1599266301 L1-dcache-stores
0 L1-dcache-store-misses #
587307597 LLC-loads

403045072 LLC-stores

0.000 M/sec (scaled from 60.12%)

0.000 M/sec (scaled from 40.03%)


#

7594972 LLC-store-misses

0.000 M/sec (scaled from 60.05%)

0.000 M/sec (scaled from 60.16%)

62861608 LLC-load-misses

0.000 M/sec (scaled from 59.97%)


#

(scaled from 59.89%)

0.000 M/sec (scaled from 39.95%)


0.000 M/sec (scaled from 39.88%)

0.000 M/sec (scaled from 39.84%)

5.261701125 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$
Opt O2
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l11
3

dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1


CPU = 52.400000 ms

Performance counter stats for './task1':

16020307101 cycles
9209170086 instructions

0.000 M/sec (scaled from 49.85%)

0.575 IPC

3599616932 L1-dcache-loads

987723501 L1-dcache-load-misses
1599231431 L1-dcache-stores
0 L1-dcache-store-misses #
588535120 LLC-loads

0.000 M/sec (scaled from 60.12%)

0.000 M/sec (scaled from 40.02%)

86883798 LLC-load-misses

0.000 M/sec (scaled from 39.95%)

8350090 LLC-store-misses

0.000 M/sec (scaled from 60.05%)

0.000 M/sec (scaled from 60.12%)

401122256 LLC-stores

0.000 M/sec (scaled from 59.98%)


#

(scaled from 59.90%)

0.000 M/sec (scaled from 39.88%)


#

0.000 M/sec (scaled from 39.88%)

5.255650165 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$
Opt O3
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 52.900000 ms

Performance counter stats for './task1':

16137204247 cycles
9209130186 instructions
3600613819 L1-dcache-loads

0.000 M/sec (scaled from 49.90%)

0.571 IPC
#

(scaled from 59.92%)

0.000 M/sec (scaled from 59.92%)


1
4

986946125 L1-dcache-load-misses
1597655842 L1-dcache-stores

#
#

0 L1-dcache-store-misses #
585647246 LLC-loads

0.000 M/sec (scaled from 60.03%)


0.000 M/sec (scaled from 60.11%)

0.000 M/sec (scaled from 40.08%)

97276744 LLC-load-misses

406561279 LLC-stores

8890869 LLC-store-misses

0.000 M/sec (scaled from 59.95%)

0.000 M/sec (scaled from 40.05%)


0.000 M/sec (scaled from 39.97%)

0.000 M/sec (scaled from 39.89%)

5.310164961 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$
cambiando indices O3
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 5.700000 ms

Performance counter stats for './task1':

1740572548 cycles
1443499389 instructions

0.000 M/sec (scaled from 49.01%)


#

0.829 IPC

97700052 L1-dcache-loads

109374222 L1-dcache-load-misses
197872525 L1-dcache-stores

0 L1-dcache-store-misses #
9433803 LLC-loads
216797 LLC-load-misses
102765566 LLC-stores
8736361 LLC-store-misses

(scaled from 59.38%)

0.000 M/sec (scaled from 60.07%)


#

0.000 M/sec (scaled from 60.77%)


0.000 M/sec (scaled from 61.28%)

0.000 M/sec (scaled from 61.28%)

0.000 M/sec (scaled from 39.93%)


#

0.000 M/sec (scaled from 39.23%)


0.000 M/sec (scaled from 38.72%)

0.000 M/sec (scaled from 38.72%)


1
5

0.578716525 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$
Cambio indices Opt O2
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 5.700000 ms

Performance counter stats for './task1':

1724945432 cycles
1405792427 instructions

0.000 M/sec (scaled from 49.47%)


#

0.815 IPC

96543406 L1-dcache-loads

108272949 L1-dcache-load-misses
199183731 L1-dcache-stores

0 L1-dcache-store-misses #
9583820 LLC-loads

102350319 LLC-stores

0.000 M/sec (scaled from 60.94%)


0.000 M/sec (scaled from 60.94%)

0.000 M/sec (scaled from 39.39%)


#

8069378 LLC-store-misses

0.000 M/sec (scaled from 60.61%)

0.000 M/sec (scaled from 60.94%)

199921 LLC-load-misses

(scaled from 59.92%)

0.000 M/sec (scaled from 39.06%)


0.000 M/sec (scaled from 39.06%)

0.000 M/sec (scaled from 39.06%)

0.573821749 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$
Cambio de Indices Opt O1
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 5.700000 ms

1
6

Performance counter stats for './task1':

1739406471 cycles

1419222785 instructions

0.000 M/sec (scaled from 49.07%)


#

0.816 IPC

99982320 L1-dcache-loads

0.000 M/sec (scaled from 60.15%)

109200483 L1-dcache-load-misses
197092356 L1-dcache-stores

0 L1-dcache-store-misses #
9261294 LLC-loads

0.000 M/sec (scaled from 61.21%)

0.000 M/sec (scaled from 39.85%)


#

102250007 LLC-stores

0.000 M/sec (scaled from 60.85%)

0.000 M/sec (scaled from 61.21%)

194582 LLC-load-misses

(scaled from 59.46%)

0.000 M/sec (scaled from 39.15%)

8148738 LLC-store-misses

0.000 M/sec (scaled from 38.79%)


#

0.000 M/sec (scaled from 38.79%)

0.577769968 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$
cambio los dos indices opt O2
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 5.400000 ms

Performance counter stats for './task1':

1661158018 cycles
4032625761 instructions

0.000 M/sec (scaled from 49.28%)


#

400393144 L1-dcache-loads

2.428 IPC
#

12301881 L1-dcache-load-misses
787831854 L1-dcache-stores
0 L1-dcache-store-misses #

0.000 M/sec (scaled from 59.42%)


#

(scaled from 59.42%)

0.000 M/sec (scaled from 59.42%)


0.000 M/sec (scaled from 60.03%)

0.000 M/sec (scaled from 60.76%)


1
7

6146244 LLC-loads

0.000 M/sec (scaled from 40.58%)

126402 LLC-load-misses
6186194 LLC-stores

0.000 M/sec (scaled from 40.58%)

123017 LLC-store-misses

0.000 M/sec (scaled from 39.97%)


#

0.000 M/sec (scaled from 39.24%)

0.552338166 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$

cambio los dos indices opt O1


temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 5.400000 ms

Performance counter stats for './task1':

1639985549 cycles
3993195021 instructions

0.000 M/sec (scaled from 48.62%)


#

2.435 IPC

393478492 L1-dcache-loads

12291323 L1-dcache-load-misses
792372646 L1-dcache-stores
0 L1-dcache-store-misses #
6303632 LLC-loads

117652 LLC-store-misses

0.000 M/sec (scaled from 60.09%)


0.000 M/sec (scaled from 60.82%)

0.000 M/sec (scaled from 61.56%)


0.000 M/sec (scaled from 40.65%)

127549 LLC-load-misses
6224012 LLC-stores

0.000 M/sec (scaled from 59.35%)


#

(scaled from 58.90%)

#
#

0.000 M/sec (scaled from 39.91%)


0.000 M/sec (scaled from 39.18%)

0.000 M/sec (scaled from 38.44%)

0.545328157 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$
1
8

Cambio los dos indices opt O3


temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 5.400000 ms

Performance counter stats for './task1':

1648017829 cycles
4010909417 instructions

0.000 M/sec (scaled from 48.86%)


#

2.434 IPC

395102231 L1-dcache-loads

12261564 L1-dcache-load-misses
790829417 L1-dcache-stores
0 L1-dcache-store-misses #
6264775 LLC-loads

119045 LLC-store-misses

0.000 M/sec (scaled from 59.79%)


0.000 M/sec (scaled from 60.52%)

0.000 M/sec (scaled from 61.25%)


0.000 M/sec (scaled from 40.92%)

129520 LLC-load-misses
6212471 LLC-stores

0.000 M/sec (scaled from 59.08%)


#

(scaled from 59.09%)

#
#

0.000 M/sec (scaled from 40.21%)


0.000 M/sec (scaled from 39.48%)

0.000 M/sec (scaled from 38.75%)

0.547834641 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$
antes de paralelizar hay que optimizar vaya lo mas rapido posible aun cuando sea mas
lento
Como optimizar el programa para paralelizar
Cuando se compila el codigo de maquina como se compila
tres niveles de optimizacion
el compilador ve el programa para que vaya mas rapido
Cambios que hace el compilador
para ver si va mas rapido
Como afecta la optimizacion con el resultado
PERF (II)

1
9

1. Go to directory profilers/task2 (matrix multiplication).


2. Edit and understand example task2.c.
3. Compile task2.c.
4. Run task2 with the performance analysis tool (perf).
temporal@cmult-25-67-217:~/profilers/task2$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u
-e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task2
Equal 0
CPU = 3110.000000 ms

Performance counter stats for './task2':

9611683298 cycles

12225779862 instructions

0.000 M/sec (scaled from 50.03%)


#

1.272 IPC (scaled from 59.97%)

5087796167 L1-dcache-loads

1015809083 L1-dcache-load-misses
1056402996 L1-dcache-stores

1018152124 LLC-loads
14834768 LLC-load-misses

0.000 M/sec (scaled from 59.98%)

0.000 M/sec (scaled from 40.03%)


#

0.000 M/sec (scaled from 59.97%)

0.000 M/sec (scaled from 59.98%)

59957 LLC-store-misses

0 L1-dcache-store-misses #

62788 LLC-stores

0.000 M/sec (scaled from 59.97%)

0.000 M/sec (scaled from 40.03%)

0.000 M/sec (scaled from 40.02%)


#

0.000 M/sec (scaled from 40.02%)

3.158259454 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task2$
5.

Implement a new version of the matrix multiplication function (Mult2) that takes
advantage of both the row-major order used in C and the cache hierarchy.
Standard matrix multiplication (Mult1)

+
2
0

Optimized matrix multiplication (Mult2)

6. Run task2 and verify that both functions (Mult1, Mult2) yield the same result.

2
1

7. Comment call to Mult1 in task2.c and run task2 with the performance analysis tool
(perf).
8. Compare the performance of both functions and write conclusions to lab report,
including the implementation of Mult2 (homework).
GPROF
1. Go to directory profilers/task3 (image processing algorithm).
2. Edit and understand structure of Makefile. Option -pg at compile time forces
compiler to generate profile data suitable for gprof.
3. Compile program:
make
4. Run program (it generates binary file gmon.out with profile data). Input and output
bitmap images can be viewed with any image visualization program (gimp, xv, )
./algi channel1.bmp channel2.bmp
5. Run gprof generating profile file:
gprof algi > profile
6. Edit and understand the self-explained profile file.
7. Identify functions that consume a significant percentage of running time (bottlenecks)
best candidates to be optimized / parallelized. Any improvement on them will have
a significant impact on the overall running time:
a. Large functions (big self ms/call value) with a big percentage of running time
(big % time value).
b. Small functions (low self ms/call value) that run very frequently (big %
time value).
8. Write conclusions to lab report, including description of bottleneck functions
(homework).
KPROF (graphical front-end to gprof)
1. Execute kprof.
2. Open profile file generated by gprof:
File Open
3. Examine tabs Flat Profile, Hierarchical Profile and Graph View.

2
2

Laboratory 1: OpenMP (Compilation and basic directives)


Proceed through the following steps, completing the lab report as requested.
1. Obtain CPU information from the Linux kernel:
cat /proc/cpuinfo > cpuinfo.txt
gedit cpuinfo.txt
2. Identify the CPUs model in cpuinfo.txt. Example:
model name : Intel(R) Core(TM) i5 CPU

650

@ 3.20GHz

3. Identify the number of CPUs in cpuinfo.txt. The number of CPUs is equal to the number
of different physical identifiers of the available logical processors. Example for a single
CPU Intel Core i5:
processor : 0
physical
processor : 1
physical
processor : 2
physical
processor : 3
physical

(first logical processor)


id: 0
id: 0
id: 0
id: 0

4. Identify number of cores per CPU in cpuinfo.txt. Example for Intel Core i5 with 2 cores:
cpu cores

: 2

5. Identify number of hardware supported threads (i.e.: logical processors) per CPU in
cpuinfo.txt. If the number of supported threads is N times the number of cores, the CPU
supports hyper-threading and each core will be able to concurrently execute N of those
threads by sharing its internal resources (ALU, FPU, etc.). Example for Intel Core i5 with 2
cores and hyper-threading:
siblings

: 4

6. Identify what logical processors correspond to each CPU and core. Example for a single
Intel Core i5 in which the first two logical processors are mapped to the first core and the
last two logical processors to the second core, with a single CPU (physical processor):
processor : 0
physical id
core id
processor : 1
physical id
core id

processor : 2
physical id
core id
processor : 3
physical id
core id

: 0
: 0
: 0
: 0

: 0
: 2
: 0
: 2

7. Using a web browser, verify the number of cores and threads per core on the Internet based
on the CPUs model information. Write conclusions to lab report (homework).
8. Download associated material (openmp1.tar.gz) from Moodles course page into personal
working directory.
9. Uncompress and untar associated material:

2
3

gunzip openmp1.tar.gz
tar xvf openmp1.tar
10. Go to directory openmp1/task1.
11. Edit and understand structure of Makefile. Option -fopenmp at compile time forces
compiler to understand OpenMP directives. Option -lgomp at link time forces linker to
include the OpenMP library for Linux (GOMP).
12. Edit and understand example task1.c.
13. Execute in a new terminal the run-time CPU monitor mpstat (if not available, execute
gnome_system_monitor instead):
xterm &
mpstat P ALL 1

(create new terminal)


(run from the new terminal)

mpstat shows statistical information about each available logical processor, including
percentage of CPU load at the user level (%usr) and the system level (%sys).
14. In task1.c, set the number of OpenMP threads (constant NUM_THREADS) to 1. From
the initial terminal, compile the program and execute it, writing down the wall time (real
execution time). The latter is the minimum sequential time (Ts) of the algorithm. See how
mpstat shows what logical processor is executing the program.
15. Set the number of threads to 2 in task1.c, recompile the program and run it, checking with
mpstat what logical processors are executing both threads. Execute the program several
times. The operating system automatically maps every thread to a different core. The logical
processor within the core may vary from an execution to the next. Write down the average
wall time of all executions, which corresponds to the parallel time for two cores (Tp).
16. Compute speedup and efficiency for two cores.
17. Force the mapping of threads to logical processors, such that both OpenMP threads are
mapped to the first two logical processors. If the CPU supports hyper-threading, the first
two logical processors are executed by the same core. Example:
export GOMP_CPU_AFFINITY=0 1
18. Run the program several times and compute speedup and efficiency for two threads
executed by the same core. This measures the performance of hyper-threading for this
particular CPU-intensive application.
19. Force the mapping of threads to specific logical processors of different cores. Example:
export GOMP_CPU_AFFINITY=1 3
20. Run the program several times and compute speedup and efficiency for two threads
executed in specific logical processors belonging to different cores. Compare that
performance with the one obtained through the automatic mapping of threads to logical
processors provided by the operating system.
21. In task1.c, set the number of OpenMP threads to 4 and force that all threads are run by the
same logical processor. In case of several threads assigned to the same logical processor, the
latter executes them with time-sharing. Example:
export GOMP_CPU_AFFINITY=3
22. Run the program several times and compute speedup and efficiency for four threads
executed by the same logical processor.
23. In task1.c, set the number of OpenMP threads to 4 and force that all threads are run by the
logical processors belonging to the same core. Example:
export GOMP_CPU_AFFINITY=2 3

24. Run the program several times and compute speedup and efficiency for four threads
executed by the logical processors within the same core.
25. Release the explicit mapping of threads to specific logical processors, such that this
mapping be left to the operating system again:
export GOMP_CPU_AFFINITY=
26. Write down the results and conclusions to lab report (homework).
27. Go to directory openmp1/task2.
28. Edit and understand example task2.c.
29. Compile and run task2 several times. Realize that the PID is always different at the
beginning of the parallel body, and the same at its end. Analyze and interpret this behavior.
30. Declare variable pid as private. Compile the program and run it again several times,
realizing that the PID now is always different. Analyze and interpret this behavior.
31. Write conclusions to lab report (homework).
32. Run task2 again and realize that private variable limit, which is initialized to -1 in its
program declaration, is reset to zero at the begging of the parallel body, whereas it is set
back to -1 when the master thread resumes its execution right after the parallel body.
Analyze this behavior by considering that every thread within a parallel region has a local
copy of all its private variables.
33. Change the private clause for firstprivate, which initializes the local copies of private
variables to their original value. Compile and run the program again realizing the difference.
34. Write conclusions to lab report (homework).
35. Go to directory openmp1/task3.
36. Edit and understand example task3.c.
37. Compile and run task3 several times. Analyze why all threads run alternately.
38. Comment both the omp_set_lock and the omp_unset_lock function calls. Compile and
run again several times. Analyze why the threads do not run alternately.
39. Include the critical region into a critical directive. Compile and run again several times.
The result is the same as when locks are utilized. Example:
#pragma omp critical
{
// Critical region: One thread at a time
}

...

40. Remove the critical directive. Insert a barrier synchronization right above the workload.
Compile and run again several times. Analyze why the threads run alternately again. Since
there is no critical region, the workload of both threads is running with global
synchronization but without mutual exclusion. Example:
// Critical region: One thread at a time
#pragma omp barrier
// Workload

41. Go back to the original task3.c with the wait and signal semaphore calls. Comment the
omp_unset_lock (wait but no signal). Compile and run it again. Analyze why the first
thread runs only once and the program halts. Press Ctrl-C to stop the program.
42. Uncomment the omp_unset_lock and insert a barrier synchronization right above the
workload. Compile and run again. Analyze why the program halts right from the beginning.
Press Ctrl-C to stop the program.
43. Write conclusions to lab report (homework).

También podría gustarte