Está en la página 1de 34
A good reference book is “Realtime Collision Detection” by Christer Ericson, Morgan Kaufmann, 2005 See

A good reference book is “Realtime Collision Detection” by Christer Ericson, Morgan Kaufmann, 2005 See chaper 13: “Optimization”

Both 360 and Ps3 documentation provide estensive description of the inner workings of their CPUs and optimization guidelines

For intel platforms and for optimization of C++ constructs a very good reference is in the Agner Fog’s manuals: http://www.agner.org/optimize/

We don’t have even all the full processing power available for 33 milliseconds, the console

We don’t have even all the full processing power available for 33 milliseconds, the console OS will reclaim some time to time (i.e. to handle a background download) so we have to leave some slack space

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.). Note that Knuth attributed the “premature optimization…” statement to Hoare (http://en.wikiquote.org/wiki/C A R

See “Mature Optimization” by Mick West, Game Developer Magazine, January 2006

http://cowboyprogramming.com/2007/01/04/mature-optimization-2/

Do not prematurely pessimize: write code avoiding common performance pitfalls 3

Do not prematurely pessimize: write code avoiding common performance pitfalls

Aaaaaaaargh! 4

Aaaaaaaargh!

Code and screenshot from “lattice” 256 bytes intro 5

Code and screenshot from “lattice” 256 bytes intro

Runtime dependencies: communication, ownership •I.e. “game” does push data to rendering, “rendering” does not

Runtime dependencies: communication, ownership

•I.e. “game” does push data to rendering, “rendering” does not talk to the game

Compile-time dependencies: types, libraries

•I.e. “rendering” data is made of game-defined types, statically depends on the game

More general systems = more complex, more code = harder to change

We should move towards hot-swappable services:

Even better: live-coding

http://en.wikipedia.org/wiki/Big_O_notation Some algorithms may be better for large inputs but worse for the input sizes

http://en.wikipedia.org/wiki/Big_O_notation

Some algorithms may be better for large inputs but worse for the input sizes we are using in practice! It happens all the time

I.E. Vector versus Map or Hashtable Insertion-sort versus Merge- or Quick-sort

Also, cache-efficiency is a very common issue that make log(n) algorithms slower than linear ones on

small input sizes.

Multiple “hardware threads” per core: CPUs like to have multiple independent instruction paths so if

Multiple “hardware threads” per core: CPUs like

to have multiple independent instruction paths so if they are stalled on an instruction in one path, they can use the other one to keep them

busy

accesses.

Stalls are usually caused by memory

All cores of the 360 and the PS3 PPU have two hardware threads (so we have six hardware threads to use on 360 and two PPU threads plus

six SPU ones on the Ps3)

See “Coding For Multiple Cores on Xbox 360 and Microsoft Windows” on your xbox 360

See “Coding For Multiple Cores on Xbox 360 and Microsoft Windows” on your xbox 360 sdk documentation!

Design subsystems being aware of thread safety. Minimize shared _mutable_ data. Const-correctness helps. http://en.wikipedia.org/wiki/Thread-safety

Some further reads:

Stream processing (Stream Processing in General-Purpose Processors:

http://www.cs.utexas.edu/users/skeckler/wild04/Paper14.pdf)

http://en.wikipedia.org/wiki/Erlang_(programming_language) How the GPU works: http://c0de517e.blogspot.com/2008/04/gpu-part-1.html Fibers http://en.wikipedia.org/wiki/Fiber_(computer_science) Map/Reduce (http://en.wikipedia.org/wiki/MapReduce) (http://en.wikipedia.org/wiki/Map_(higher- order_function) .Net Parallel FX PLINQ implementation (http://en.wikipedia.org/wiki/Task_Parallel_Library) OpenMP (http://en.wikipedia.org/wiki/OpenMP)

GPU is another unit that executes in parallel and depends on the Render thread. Usually the render thread prepares data for the next frame, while the GPU is executing the previous one, much like the simulation thread prepares data for the render thread and pushes it in a buffer.

Most of our system libraries are not thread safe, thread safety should be ensured when using them, in our high- level implementation classes. This is done to maximize performances and not end up with a locks everywere (as the syncronized Java containers do for example, with the catch that some JIT virtual machines can automatically avoid locks if needed)

For very simple data structures it’s possible to write thread safe versions without locks (http://en.wikipedia.org/wiki/Lock-free_and_wait-free_algorithms). But lock-free programming is a nightmare, avoid it.

Purely functional (persistent) data structures (http://books.google.ca/books?id=SxPzSTcTalAC) can be of some

usefulness too.

Slides from: bps10.idav.ucdavis.edu/talks/04- lefohn_ParallelProgrammingGraphics_BPS_SIGGRAPH2010.pdf 10

Slides from: bps10.idav.ucdavis.edu/talks/04-

lefohn_ParallelProgrammingGraphics_BPS_SIGGRAPH2010.pdf

http://www.catonmat.net/blog/mit-introduction-to-algorithms-part-fourteen The king of our modern CPU problems, CPUs and

The king of our modern CPU problems, CPUs and GPUs are becoming faster at a higher pace than memory is! This is also a limit to multithreading performance, as we have to fetch data to be processed!

L1 cache hits (accesses to data that is in the L1 cache) costs are hidden by pipeline latency (by the execution of other instructions between the loading of the memory into a register and the actual use of that register). L1 cache misses if L2 hits cost more than 40 cycles, but they can be partically hidden by the execution of the instructions of the other hardware thread in the same core

Beware of cache behaviour when multithreading: nearby data read and written by two different cores leads to bad performance (“false sharing” the L1 caches of the cores will be invalidated each time the other core writes data) while if the same happens on two hardware threads of the same core executing the same code, then the L1 data and code cache will be used optimally (see “Caches and Multithreading” on your xbox 360 sdk documentation)

Refer to the xbox sdk paper: “Xbox 360 CPU Caches”

Some numbers: on 360 the cache-lines (minimum data that will be transfered to the cache due to a cache-miss) are 128-byte wide, L2 is 1MB, L1 is 32+32kb. L1 cache is write-through (all stores will always, after the store-gathering buffer, go to update the L2 cache) and non-allocating (stores do not fill L1 cache lines if the address is not already there). Stores and loads go to queues, stores are further organized in store-gathering buffers to reorder scattered stores into linear ones to go into the caches. There’s no predictive prefercher logic on 360 and just basic predictive prefetching on ps3 (way different from x86 world, in general ps3 and 360 cpus have a lot of raw power and less logic, they’re made for experienced programmers not to improve random code by clever rescheduling and out-of-order execution, and this seems to be the direction of the future anyway)

Another important concept is cache SETS. On the 360 xenon CPU the L2 cache is 8-way associative, it means that the 1mb/128bytes = 8192 cache lines are organized into 8192/8 = 1024 sets. Caching of a memory address goes into a given set using the formula set = (memory_address / line_size) % number_of_sets. So in our case the set numebr is (memory_address / 128) % 1024, that means that two addresses that are number_of_sets * line_size = 128kb apart

fall (critical stride) into the same set. There is space only for 8 cache lines in each set, so if we have a loop where we

read consecutively from 9 addresses each one 128kb apart, we will have a cache miss each iteration of the loop, even if the cache is not full! That issue is more serious on the L1 cache, that is 32k for data, thus has 256 lines arranged in 4-ways, leading to 64 sets, the critical stride there is only 8k! That can cause sometimes problems, i.e. in a bidimensional array with rows of 8k size, accessing 5 elements of a column causes a cache miss always.

12
Cache-agnostic linearization of trees can be performed via the van Emde Boas layout (see

Cache-agnostic linearization of trees can be performed via the van Emde Boas layout (see http://en.wikipedia.org/wiki/Van_Emde_Boas_tree)

Within-vector operations are not common on SWAR (SIMD withing a register) architectures as commonly found

Within-vector operations are not common on SWAR (SIMD withing a register) architectures as commonly found on CPUs. They are possible on the GPU SIMD processors.

To be efficiently loaded and stored SIMD data should be 16 -byte aligned (16 bytes = 4 floats).

15
16
Moore’s law: it was originally about transistor count, and processor’s roughly managed to respect it.

Moore’s law: it was originally about transistor count, and processor’s roughly managed to respect it. But CPU’s are also respecting it in performance, that is odd, as the performance should increase due to the transistor count AND CPU frequency (faster!). GPU’s are following Moore’s in transistor count but beating it (as it should be expected) when it comes to performance, but only on heavily data-parallel tasks where all the code runs in parallel (Amdahl’s law is the limiting factor there)

What’s on the die (PC processors

)

8086

--- Mostly processing power: Logic units.

386

486

--- Processing power and caches: A bit of cache, FPUs. Multiple pipelines.

Pentium2

Pentium3

--- Caches and scheduling logic: Heavy instruction decode/reorder units, branch predition, cache prediction. Longer pipelines.

Pentium4

Core2

i7

---

Multicore + Big caches

Future ---

Back to “pure” processing power, ALUs on most of the die (and cache) Manycore, small decode stages (in-order, shared between units) and caches (shared between units), wide hardware and logical SIMD, lower power/flops ratio (GPUs, Cell Manycore (“GPU”) integrated with multicore (“CPU”), sharing a cache level or direct bus interconnection (single die or fast paths between units: Xenon/Xenos, Ps3

PPU/SPU

)

Past:

)

Future:

www.gpgpu.org/static/s2007/slides/02-gpu-architecture-overview-s07.pdf

s09.idav.ucdavis.edu/talks/02_kayvonf_gpuArchTalk09.pdf

bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

18
19
The prediction is “easy”, again, we already do that on GPUs, even if shader languages

The prediction is “easy”, again, we already do that on GPUs, even if shader languages are very constrained in terms of communication. CUDA and OpenCL are more general but on the other hand expose too much of the underlying hardware. We still have to improve our tools, but it will happen.

How much latency? On the 360 GPU, from the start of a shader (task) to

How much latency? On the 360 GPU, from the start of a shader (task) to the end (write into the framebuffer) there are roughly 1000 gpu cycles of latency

Just a few examples! There are many fast sequential sorts (i.e. Radix and the other

Just a few examples! There are many fast sequential sorts (i.e. Radix and the other “distribution” sorts), many are even faster if the sequence to sort has certain properties (i.e. Uniform: Flash, Almost sorted: Smooth) or if we some given behaviour are desiderable (i.e. Cache efficient: Funnel, Few writes: Cycle, Extract LIS: Patience, Online: Library) and most of them can be parallelized (not only the MergeSort). Also hybrids are often useful (i.e. Radix sort and parallel merge).

www.cse.ohio-state.edu/~kerwin/MPIGpu.pdf

theory.csail.mit.edu/classes/6.895/fall03/projects/final/youn.ppt

The language examples are only a sample of what we can or could use for

The language examples are only a sample of what we can or could use for games.

OpenCL, Intel SPMD are data-parallel programming languages (stream oriented? Not really yet)

oCaML, Haskell, C# support functional programming (lambdas, closures). They also support data-parallel tasks (data parallel haskell, parallel fx) and coroutines (c# only in the Mono runtime)

Go, Lua and Stackless Python are examples of languages implementing

coroutines/continuations (fibers, cooperative threading)

24
Caching is another technique that can be useful to improve data locality. If the access

Caching is another technique that can be useful to improve data locality. If the access to a big data array is random, but coherent in time, we can copy the last n-accessed items in a buffer that holds them near each other in memory. Then next time we can check the buffer first, if it still contains the data we need we avoid performing random accesses and their cache misses.

26
This is applicable to n-dimensional arrays. The NxNx aware (has to be tuned for a

This is applicable to n-dimensional arrays. The NxNx

aware (has to be tuned for a specific cache size), the spacefilling curve approach is cache-oblivious (works optimally, withing a constant factor, for any cache size)

blocks arrrangment is cache-

In the end, most of the time, good design for performances is equal to good design, as the main thing we require to be able to tune the code is ease of changing it. That’s way different from bad, premature optimization, that usually locks the code in a given “shape”. The main difference between generic design best practices and performance best practices is that being aware of some hardware details early on, it’s possible to nail a more optimal design from the start.

See:

http://www.multi.fi/~mbc/sources/fatmap.txt http://en.wikipedia.org/wiki/Space-filling_curve

http://my.safaribooksonline.com/0201914654/ch14lev1sec6

28
31
The 80/20 rule works only if the code was coded in a proper way without

The 80/20 rule works only if the code was coded in a proper way

without any awareness of its performances you won’t find any significant hotspot to optimize, everything will be “bad”, especially in huge projects like ours!

If you write code

Trivial functions should be inlined otherwise compiler can’t perform a huge number of optimizations as it can’t know the implementation of a given function until link-time (that can be avoided with bulk-builds, or enabling “linktime” or “whole-program” optimizations). Forcing to inline complex functions can lead to increased code size and thus decreased code cache efficiency. It should be done only in inner loops, probably unrolling them too, but only when tuning, and using a profiler to find the right inner loops to optimize!

The templates versus virtual function calls issue is nasty (dynamic versus static branching), it’s a design issue that’s hard to make early on, without profiling.

Using sized-integers can be useful to save space and thus optimize cache, but this is something that can be done later after profiling without a big impact on the code design (if the proper getters and setters were created). The only thing that is worth to do early-on is the use of bitfields to store multiple boolean values, as the standard bool

type takes quite some space.

Usually static and global variables are slower to access than class members (that COULD live on the heap) that are slower than local variables (that live on the stack, stack data is most probably in our chaces)

See, on the xbox sdk documentation the paper: “ Xbox 360 CPU: Best Practices” 33

See, on the xbox sdk documentation the paper: “Xbox 360 CPU: Best Practices”

Design your DATA first! 34

Design your DATA first!