Parallel Hashing

Parallel Hashing, Compression and Encryption with OpenCL under OS X
Vasileios Xouris
Master of Science Computer Science School of Informatics University of Edinburgh 2010
Abstract
In this dissertation we examine the efficiency of GPUs with a limited number of stream processors (up to 32), located in desktops and laptops, in the execution of algorithms such as hashing (MD5, SHA1), encryption (Salsa20) and compression (LZ78). For the implementation part, the OpenCL framework was used under OS X. The graphic cards tested were NVIDIA GeForce 9400m and GeForce 9600m GT. We found an efficient block size for each algorithm that results in optimal GPU performance. The results show that encryption and hashing algorithms can be executed on these GPUs very efficiently and replace or assist CPU computations. We achieved a throughput of 159 MB/s for Salsa20, 107.5 MB/s for MD5 and 123.5 MB/s for SHA1. Compression results showed a reduced compression ratio due to GPU memory limitations and reduced speed due to divergent code paths. The combined execution of encryption and compression on the GPU can improve execution times by reducing the latency caused by data transfers between CPU and GPU. In general, a GPU device with 32 stream processors can provide us with enough computation power to replace CPU in the execution of data-parallel computation-intensive algorithms.
Acknowledgements
I would like to thank my supervisor, Paul Anderson, for his invaluable help and guidance. I would also like to thank Dr. Zhang Le for his helpful remarks.
I would like to thank my family that always supports me in everything I do.
Finally, I would like to thank Stefania for being patient and supportive during this year.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.
(Xouris Vasileios)
iii
Table of Contents
Chapter 1 Chapter 2 2.1 2.2 Introduction ................................................................................................1 GPU and OpenCL ......................................................................................3
GPU architecture ................................................................................................... 3 Open Computing Language (OpenCL) ............................................................. 4 Memory model .............................................................................................. 5 Memory access patterns ............................................................................... 7 OpenCL execution model ............................................................................ 8 Encryption on GPU .................................................................................. 10
2.2.1 2.2.2 2.2.3 Chapter 3 3.1 3.2 3.3 3.4
Background .......................................................................................................... 10 GPU advantages and disadvantages ................................................................ 12 Relevant work ...................................................................................................... 13 Implementation of Salsa20 & Results ............................................................... 15 Hashing on GPU ...................................................................................... 21
Chapter 4 4.1 4.2 4.3 4.4
Background .......................................................................................................... 21 GPU advantages and disadvantages ................................................................ 23 Relevant work ...................................................................................................... 23 Implementation of MD5 and SHA1 & Results ................................................ 24 Compression on GPU .............................................................................. 29
Chapter 5 5.1 5.2 5.3 5.4
Background .......................................................................................................... 29 GPU advantages and disadvantages ................................................................ 30 Relevant Work ..................................................................................................... 32 Implementation of LZ78 & Results ................................................................... 32 Putting it all together ............................................................................... 38 Discussion ................................................................................................. 41
Chapter 6 Chapter 7 7.1 7.2
Project difficulties ................................................................................................ 42 Future Work ......................................................................................................... 44
iv
Chapter 8
Conclusion ................................................................................................ 45
Bibliography ................................................................................................................... 46
Chapter 1
Introduction
During the last few years, there has been a lot of research focused on efficient implementations of well known algorithms optimized for execution on the Graphics Processor Units (GPU). GPUs offer a great architecture that can take advantage of data parallelism very effectively. A mid range GPU device can have around 64 stream processors, a number that offers great computation power. Entry level and mid range GPUs can be found in laptops and desktops that are used every day. Of course, there are more specialized, high-end GPUs that offer a much bigger number of stream processors and enormous computation power. In this dissertation, we plan to research whether GPUs located in desktops and laptops can be used in order to execute operations that include heavy computations such as hashing, encryption and compression efficiently. Until now, every published work related to these operations used powerful GPUs with hundreds of stream processors. The motivation of this dissertation was a recent research about a fast and secure backup system for Mac laptops [1]. The main idea of this dissertation is to use the GPU for computations included in a backup system such as data hashing, encryption and compression. We are planning to implement some specific algorithms for each field and see whether we can get a speedup over a CPU implementation or if we can get execution times that are acceptable and can assist the CPU where possible. For the testing of our implementations, entry-level and mid-range GPUs are going to be used with up to 32 streaming processors, which is a number much smaller when compared to high end GPUs containing hundreds of streaming processors. GPUs of this kind are usually located in most laptops. The framework that will be used for the programming of our implementations will be
the OpenCL (Open Computing Language) framework [16]. The advantage of OpenCL is that it gives the programmer the ability to control all available processing units in a system including CPUs and GPUs. This dissertation is organized in several sections. We are going to start with a general background section on GPU architecture and a brief description of the OpenCL framework, its capabilities and restrictions. Then we will examine the operations mentioned above (hashing, encryption and compression) one by one. For each one of them, a brief background and relevant work on implementations for the GPU is given. We will also discuss how each one of them fits on the GPU architecture and mention its advantages and disadvantages. An implementation and results section with the outputs of our research is also available for each case. After looking to each operation in isolation, we present some conclusions from the combined execution of encryption and compression on the GPU. In the final chapters there is a discussion where we describe the difficulties that we faced during our research and implementation and we also propose some ideas for future work.
Chapter 2
GPU and OpenCL
2.1 GPU architecture

Graphics Processor Units (GPUs) are specialized processors originally implemented to render 3-dimensional graphics. The main difference between a CPU core and a GPU is that CPU is designed to execute a stream of instructions as fast as it can, while GPUs have the ability to execute in parallel the same stream of instructions over multiple data. GPUs contain a number of stream multiprocessors (SM) and each SM contains 8 stream processor cores (SP), 2 special function units, an instruction and a constant cache, a multithreaded instruction unit and a shared memory. GPUs have a parallel throughput architecture that allows many threads to be executed concurrently. They are designed to handle complex computations of computer graphics fast and efficiently. They can operate on vectors of data very fast. Because of their nature, programmers started to use them in order to execute more general computation-intensive algorithm by taking advantage of data parallelism. With the introduction of frameworks such as OpenCL and CUDA, the development of GPU versions of general algorithms became easier. Until recently, the main problem of General Purpose Computing on Graphic Processor Units was that only floating point arithmetic computations were supported inside pixel shaders. Fortunately, with the introduction of G80 architecture of NVIDIA, integer data types and bitwise operations are now available [21].
Figure 2.1.1 - CPU versus GPU design (source: [4])
In figure 2.1.1 we can see why GPUs are so powerful. GPUs sacrifice sophisticated control flow in order to have a lot of stream processors on chip. Also the size of cache memories is much smaller because GPUs hide memory latency by executing calculations while waiting for memory access instead of using large cache memories.
2.2 Open Computing Language (OpenCL)

OpenCL was created originally by Apple Inc. and developed later by Khronos Group. OpenCL is a framework that gives access to applications to execute on the GPU device. In this section, we are going to discuss how OpenCL maps to the GPU architecture. In OpenCL, a data-parallel application (kernel) has to be written in a specific language similar to C99. To create parallelism, OpenCL divides the total amount of work into workgroups. Each workgroup is further divided to work items (threads). So workgroups are executed on SMs, and each work item is executed by a SP. The total amount of work, called N-D Range in the OpenCL world, is a collection of workgroups that will be executed in parallel. The distribution of workgroups along the available SMs is taking place dynamically by OpenCL itself. Threads are further organized by the SM in groups of 32 that are called warps and all threads within a
warp are executed in parallel. When a warp is delayed for some reason, another warp is selected for execution in order to hide latency. Because of the SIMT (Single Instruction Multiple Thread) nature of the SPs, all threads within a warp must execute the same instruction in order to take full advantage of parallelism [4]. GPUs handle memory latency by switching between workgroups. Thousands of threads are ready for execution at any time. Every time that a group of threads needs to read data from the memory, immediately another group takes its place. Unlike CPUs where the cost of switching between threads is hundreds of clock cycles, in GPU there is no cost at all. GPU threads are very lightweight.
2.2.1 Memory model

There are several different memory spaces in the OpenCL architecture. A diagram of the locations of different memory spaces described in this section can be found in figure 2.2.1.1. The main and biggest memory of GPU architecture is the global memory which is off-chip and is usually between 128MB up to several GB for highend GPUs. Accessing the global memory requires 400 to 600 cycles and this is the reason why we must be careful when accessing it. Global memory can be accessed by all work items of all workgroups. A region of global memory is reserved in order to be used as constant read-only memory and is called constant memory. In contrast, OpenCLs local memory (referred as shared memory in CUDA world) is on-chip which makes it extremely fast. Accessing the shared memory usually takes 4 to 6 cycles. However the size of shared memory is very small, usually 16KB, so it must be used to store data that are frequently used for computations and updates. The shared memory can be accessed by all work-items of a workgroup so it can also be used for communication between work items of the same workgroup. This feature is ideal in the case where data needs to be shared among work items. Another useful memory space is the constant cache read-only memory which is usually 64KB and is located on-chip. When many threads within a workgroup try to read the same constant cache address space it just takes one transaction, otherwise reads of different addresses are serialized. This memory space is used to speed up
reads from the constant memory by caching frequently used data. A similar cache, texture cache, is also available and it is used in order to speed up reads of image objects. Finally, the private memory (registers) is the fastest memory and distributed privately among the work items of a workgroup by the SM. The total amount of memory is limited for each multiprocessor, between 8192 and 16384 32-bit registers (32kb - 64kb), and is partitioned to threads. In case more registers are needed by a workgroup, there will be a performance problem which is known as register pressure. Registers are the best solution to store small amounts of data that need to be used frequently [12].
Figure 2.2.1.1 - The different memory spaces of GPU (source: [4])
2.2.2 Memory access patterns

The way that a group of threads accesses the global GPU memory is very important. As mentioned before, each transaction with the global memory can take 400 to 600 cycles so it is important to somehow group memory transactions requested by different work items. GPU devices are capable of reading data of 4, 8 or 16 bytes in a single transaction. Another restriction for this to happen is that data must also be aligned to a multiple of the element that we are reading. This means that data of type X must be stored in an address that is a multiple of sizeof(X) [4]. Half warps (groups of 16 threads) that are executed in parallel can be programmed to read global memory in a coalesced way. This can happen if all threads of the half warp access a different element in an aligned segment of global memory (4, 8 or 16-byte words) which can result in a single 64-byte transaction, a single 128-byte transaction or two 128-byte transactions. For NVIDIA GPU devices of compute capability of 1.0 or 1.1 this access must also be further organized so that threads access elements of the segment in order. For example the first thread of the half warp must access the first element of the segment, the second thread must access the second element and so on. For GPU devices of compute capability of 1.2 or higher this restriction does not apply. Threads within a half warp can access different address spaces within a segment with no order and still result in a single transaction. Accessing the shared memory requires a little different behavior in order to achieve a high bandwidth. Shared memory is split in 16 memory banks and in order to achieve a single transaction each thread of a half warp (groups of 16 threads) needs to access a different memory bank. In the case that two or more different threads of a half warp request a transaction with the same memory bank, these accesses take place sequentially. Only in the case that all threads of a half warp request the same memory bank then we have a broadcast that takes place in just one transaction. At this point we should note that the graphic cards (NVIDIA GeForce 9400m, NVIDIA GeForce 9600m GT) used for this dissertation have a compute capability less than 1.2.
2.2.3 OpenCL execution model

The OpenCL framework is responsible for the optimal execution of a data-parallel algorithm. The amount of work that needs to be executed is called NDRange in the OpenCL language. NDrange is a grid of thread blocks (workgroups). Each workgroup contains a number of work items (threads) which are executed in parallel. The OpenCL framework discovers how many SMs are available on the current GPU and assigns workgroups on all available SMs where they execute in turns, so all algorithms can scale to a large number of SMs without problems. The NDRange has to be large enough because as its size gets bigger it is easier to hide memory latency. Each SM has the ability to allow parallel execution a warp. All workgroups are divided in warps for parallel execution. To keep track of different workgroups and work items during execution, each workgroup has a unique group Id number and each work item has: a unique local Id that is used to separate the current work item from other work items of the same workgroup a unique global Id that is used to separate the current work item from all other work items in the NDRange. Warps are threads with consecutive local and global Ids. A representation of the NDRange appears in figure 2.2.3.1.
Figure 2.2.3.1 - Representation of the NDrange (grid) of OpenCL (source: [4])
Chapter 3
Encryption on GPU
Traditionally, since the appearance of General Purpose computing on Graphics Processor Units (GPGPU), GPUs were mostly used for algorithms with a lot of computations on float data structures. Until recently, there was no integer support on GPU which made encryption algorithms very bad candidates for execution on GPUs due to the fact that these algorithms are consisted of complex operations on integer data types. In the last few years, this was no longer a problem, and with the introduction of G80 architecture, encryption algorithms were ready to take a crash test on GPUs [21].
3.1 Background
Encryption algorithms are used when there is a need to transfer a message through an unsafe communication channel. The output of the encryption process is an encrypted message which is usually of the same size as the input and is called ciphertext. There are two kinds of encryption: symmetric and asymmetric. In symmetric encryption, a secret key, that both communication sides possess, is used for encryption and decryption of the data. In asymmetric encryption, each user possesses a secret and a public key. If user A wants to transfer a message to user B, then it uses Bs public key to encrypt the message and then user B can decrypt it with its own secret key that is only known to him. In general, encryption algorithms break the original message in blocks of equal size and process them through a function that applies on them some bitwise operations. This function is usually repeated several times (rounds) on each block of data. The problem here is that if each block is encrypted independently of each other
10
with the same key, then the ciphertext of each block will always be the same and this may be a serious security problem since it can lead to replay attacks: someone might reuse the same encrypted message and claim to be someone he isnt or request a valid operation using the same valid encrypted message. Malicious users can use a large number of blocks that were encrypted with the same key in order to find some patterns that can reveal information about the original message. For this reason, block ciphers have different modes of operation. A block cipher mode of operation has the responsibility of mixing each block ciphertext with some kind of information in order to prevent replay attacks and keep encrypted data consistent. For example, Cipher Block Chaining (CBC) xors the ciphertext of the previous block with the next blocks plaintext that was previously xord with a nonce and then starts the encryption process. The problem with CBC and similar modes of operation is that the original message must be processed sequentially. Fortunately, there exists a mode of operation that allows us to take advantage of data parallelism in encryption algorithms and it is called Counter Mode (CTR). CTR uses a nonce, which is some initialization variables that are different for each execution of the encryption algorithm, and a counter; it combines them in some way (usually by using XOR) and then encrypts the result using a secret key. The output of the encryption process (keystream) is then xord with the original message block and the result of this operation is the ciphertext. So in this mode, we are not actually encrypting the message, but we add to it the noise that comes out of the encryption of the counter and nonce. The counter is simply a variable that is guaranteed to be unique for a large number of blocks so the most popular option is to use an actual counter that starts from 0 and increases by 1 for each block. The nonce is used so that there is randomness in the output of the XOR operation with the counter and to avoid replay attacks. It must be unique for each encryption process. For the decryption of encrypted data, the key and the nonce must be known. Every encryption algorithm that wants to operate in parallel, on multiple blocks at the same time, has to operate on CTR mode. So the information needed for parallel execution is the block number, the nonce, the key, and the block of data. A demonstration of CTR mode appears at the figure 3.1.1 below. CTR is the mode that we will use for our implementation.
11
Figure 3.1.1 - The CTR mode that can process blocks in parallel for encryption (source: [9])
3.2 GPU advantages and disadvantages

First of all, we need to present the advantages and disadvantages of encryption algorithms on GPU implementations when compared to CPU. The main disadvantage of a GPU implementation is that keystream data need to be repeatedly transferred from the GPU device to the host device. In order to have good results, we need to make sure that the communication bandwidth through the PCI express bus between the two devices is big enough. The transfer operation is the bottleneck of many GPU algorithms because it is very costly. The initialization latency of the transfer is usually small and the general trend is that the transfer time grows linearly as the size of the data increases. So moving data in very large amounts has no real benefit and it is also not possible because of the limited memory on GPUs [15]. Transferring data in very small amounts is also inefficient because of the initialization latency mentioned above. In the previous paragraph we talked about the problem of transferring data between the host and the GPU. Fortunately, when the encryption algorithm is executed in CTR mode, the only things that we have to transfer from the host to the GPU are the secret key, the nonce and a counter offset. This is because in CTR mode, we do not encrypt the original message but the combination of the nonce with the counter offset. So the time needed to transfer this kind of data is insignificant. We
12
just need to transfer back to the host an encrypted sequence (keystream) for each block that will then be combined with the original text on the CPU, usually by using XOR. GPUs are designed for fast, parallel operations on vectors of floating point data and this is where they are really unbeatable. With the introduction of GPGPU capable GPUs, the benefits of graphics operations could be used in more general operations including integer support. The computation power of GPUs is by far better than CPU. For example, the Nvidia GeForce 9400 graphics card which is located in most Mac Mini's has 54 GFLOPS (Floating Point Operations per Second), which is an extremely big number. The Intel Core Duo processor, which is also located in Mac Mini's with the GeForce 9400m, has 25 GFLOPS. GPUs clearly outperform CPUs on computation power and this is their biggest advantage. Another advantage is that encryption algorithms are very straightforward algorithms. They do not contain branches and this makes them ideal for execution on GPU devices. As mentioned in previous chapters, all threads that are executed in parallel on a GPU workgroup of threads must execute the same instruction in order to take full advantage of parallelism offered. Since encryption algorithms do not contain branches in the code, we can rest assured that at any given time all threads execute the same instruction but on different data and as a result no thread will need to wait for other threads to finish the execution of a different code path.
3.3 Relevant work

Several encryption algorithms have been tested on GPUs with various speedups during the last years. The results of these studies are very encouraging and GPU seems to be an ideal platform for the execution of encryption algorithms. Before the appearance of OpenCL and CUDA, the traditional OpenGL graphics pipeline was used to take advantage of GPU computation power. Fortunately with the introduction of these frameworks things became easier. Now the GPU can be seen as a device similar to the CPU and through frameworks such as OpenCL and CUDA, developers can distribute the encryption process without the need to know very low level of graphics stuff. We will look briefly some traditional graphics
13
pipeline implementations and some CUDA/OpenCL implementations in more detail because this is our approach for this dissertation. Most implementations for encryption on the GPU choose the AES algorithm. In [10], the Advanced Encryption Standard (AES) is implemented and tested on GeForce 7900 GT which results in 5-6x speed up over a CPU implementation that runs on an Intel Core 2 Duo (1,86GHz). The encryption rate achieved was 12 Mb/s. In [14], an AES encryption implementation is created by using the graphics pipeline and the Raster Operations Unit (ROP) which results in 108.86 Mb/s. Because of the lack of XOR support in fragment processors for prior to DirectX10 support hardware, the XOR operation in this case takes place in ROP. These implementations follow the traditional way of programming the graphics pipeline and use the vertex and fragment processors for parallel computations. Data are passed as texture elements to each fragment processor for independent execution and they are stored to the screen frame buffer or to other textures. The OpenGL API is used for these operations. In [11], the Compute Unified Device Architecture (CUDA) of NVIDIA is used to create an implementation of AES-256 that gives a peak performance of 8,28 Gbit/s on a GeForce 8800 GTX which contains 128 stream processors. They identified the bottleneck of their implementation to be the transfer of data between the GPU and the host device due to the limited bandwidth of PCI Express. They chose a block size of 1024 bytes and each processed block is loaded in shared memory for further parallel processing. Their implementation also appears to be faster when a large number of blocks are transferred on the GPU each time. A pretty much same AES implementation approach appears in [17] but this time the OpenCL framework is used. In this work a GeForce 8600 GT and an ATI Firestream 9270 (800 stream cores) is used. The results show a speedup by a factor of 11 over a sequential implementation on a Dual Core Intel E8500. In this dissertation, a similar algorithm to AES, Salsa20 is going to be implemented. Unfortunately there are no relevant academic papers on GPU implementation of the Salsa20 algorithm but the presented work in this section can be used as a starting point.
14
3.4 Implementation of Salsa20 & Results

For the purposes of this dissertation, we decided to use the Salsa20 encryption algorithm in CTR mode optimized for execution on the GPU [5]. Salsa20 is a stream cipher developed by Daniel Bernstein. The reason why we decided to implement the Salsa20 algorithm is that it seems to be faster than AES. In fact the 20 rounds of Salsa20 algorithm are faster than 14 rounds of AES. For example, Salsa20 requires 3,93 cycles/byte while AES requires 9,2 cycles/byte at its best reported performance [27]. This fact makes Salsa20 ideal for systems that require a high throughput like backup systems. Also Salsa20 is a stream cipher which means that it has the ability to produce encrypted output of equal size as the input. AES is a block cipher meaning that the input size must be a multiple of the block size. To satisfy this condition, we need to add padding to the last block in most occasions. This can be a problem because the encrypted output will be slightly bigger than the input and this can cause problems in systems where we need to process a large number of files (like backup systems). Salsa20s basic operations are XOR, rotation and 32-bit addition. It is a stream cipher that in order to operate needs a 128 or 256-bit key, a 64-bit initialization vector and a 64-bit counter. It consists of 20 rounds of mixing operations. It has the ability to operate on different blocks of a 64kb size in parallel while running in CTR mode that we described in the previous section. This feature, in addition to the fact that it contains a lot of arithmetic and bitwise operations and no branches, makes it ideal for execution on GPU. The first step that we need to take is to split the Salsa20 code in two parts. The first part is the encryption process that creates a block keystream, and the second part is the action of the actual mixing of the keystream with the original data (by using XOR). The keystream is independent of the original data so we can just calculate the keystream on GPU and then transfer it back to host in order to XOR it. In this way we can store the original data on CPU memory and reduce the transferring time by just transferring the generated keystream back to the host device. Because of the fact that each work item needs to have knowledge of its counter number, the designed kernel, apart from the nonce and the secret key, takes also as
15
parameters the following: A bytes offset, which contains information about how many bytes have been processed until now. A block size, which gives information about how many bytes each work item is responsible for in order to create an appropriately sized keystream. The total number of bytes transferred to the GPU this time. This information is used in the case that the total work isnt divided exactly by the block size. So the last work item will need to produce a smaller sized keystream.
When a work item wants to calculate its block number, it has to first calculate its position inside all work items of all work groups and then by taking into account the block size, it can calculate its block number by adding the block offset. A demonstration of this method appears below:
uint myGroupId = get_group_id(0); uint myLocalId = get_local_id(0); uint gsize = get_global_size(0); uint lsize = get_local_size(0); uint groupBlockSize = lsize*blocksize; uint from = myGroupId*groupBlockSize + myLocalId*blocksize; uint to = from + blocksize - 1; if (from >= totalbytes) return; if (to >= totalbytes) to = totalbytes-1; ulong myBlock = (bytesOffset / 64) + ((myGroupId * groupBlockSize) / 64) + ((myLocalId * pr_block_size) / 64);
Figure 3.4.1 - The calculation of counter offset (myBlock) for each work item block
16
The nonce, bytes offset and block size are passed in global memory and are used by all work items. All work items of a workgroup can read the nonce from the same memory address which results in just one transaction. Based on previous works on other encryption algorithms that found a relatively small optimal block size of 1024 bytes [11], we chose to process data through registers and not shared memory for better performance. The results are written to the global memory in chunks of 16 bytes (128 bit). A very important issue is that we need to define an optimal block size. By saying block size we mean the amount of data that is distributed to each work item. A large block size will not cause private memory problems since the keystream is generated in blocks of fixed size (64 bytes), then it is written in global memory, and the same private memory is used to generate the next keystream block. A large block size, however, will cause global memory problems because of the amount of data of the generated keystream. We need to try different block sizes and find the optimal for this method. Here we should note that the optimal block size also depends on the hardware so it has to be decided on the runtime after acquiring the GPU devices information about the maximum number of work items within a workgroup. So for different GPUs this value may vary but not significantly. For example, let us suppose that our GPU device supports a total of X parallel threads and we decide a block size of Y. This is a data size that can be executed in parallel. To hide latency, we need to pass on the GPU a multiple Z of this size. The product of XYZ must be less than the total amount of supported GPU memory or we will get a compiler error. Finally, for the transfer of data between the host device and GPU global memory, pinned memory was used. Pinned memory can provide higher transfer rates between the two devices which can reach 5GB/s on PCIe x16 Gen2 cards [12]. To test our implementation of Salsa20 we used two different graphic cards, NVIDIA GeForce 9400m and GeForce 9600m GT. We should note that these cards can be found in laptops or desktops of normal users. The results from these two cards were compared to a single-threaded and a multithreaded implementation on an Intel Core 2 Duo at 2.26 GHz. The specifications of GeForce 9400m and 9600m GT appear below:
17
Model Streaming Processors Memory Clock
GeForce 9400m 16 256 Mb 1100 MHz
GeForce 9600m GT 32 256 Mb 1250 MHz
Table 3.4.2 - Technical specifications of the graphic cards used for testing In the next figure we present the resulting times of the encryption of a 200 MB file. The times that appear include tests that used different block sizes. The block size refers to the amount of data given to each work item for processing. Execution times were measured by using the average of 10 executions. We tried different block sizes in the range of 64 bytes to 16 KB. Bigger block sizes were not tested because of the limited GPU memory. In the next figure, and generally in all figures from now on, we will use the following abbreviations:
GF 9600m GT - for the execution on the NVIDIA GeForce 9600m GT using

OpenCL.
GF 9400m - for the execution on the NVIDIA GeForce 9400m using OpenCL. CPU 1-thread - for the sequential execution of the algorithm on CPU (Intel
Core 2 Duo).
CPU OpenCL - for the multi-threaded parallel execution on the CPU (Intel
Core 2 Duo) by using the OpenCL framework. As mentioned before, OpenCL can be used to handle parallel execution on heterogeneous devices, including CPUs, by distributing work to available cores. So, by using OpenCL for CPU execution we can take advantage of all available CPU cores and gain maximum performance on CPU.
18
6000 5000 Time (ms) 4000 3000 2000 GF 9600m GT GF 9400m CPU OpenCL CPU 1-thread
1000
0 64 512 Block Size (bytes) 4096
Figure 3.4.3 - Execution times of Salsa20 for all devices using different block sizes
The results we got are very interesting. We can see that the performance of both GPU devices is faster than this of a single threaded CPU implementation. The execution times of the 32 streaming processors of 9600m GT can compete with the times of a multithreaded GPU implementation. The important point in this graph is that GPU performance is maximized for relatively small block sizes between 64 and 256 bytes but it is also acceptable for sizes up to 2048 bytes. For very large block sizes the performance drops considerably. The main reason for this is that with large block sizes, each thread has to write more data in global memory and it is more difficult to hide memory latency. By using smaller block sizes we can take advantage of data parallelism between work items more easily and make sure that we are not loosing performance due to memory latency. The CPU implementations dont seem to be much affected by the block size. The 9600m GT best execution time is very close to the respective multithreaded CPU time. Finally the best throughput achieved by a GPU implementation was this of 9600m GT which was equal to 159 Mbytes/s. The respective value for the multithreaded CPU implementation was 180 Mbytes/s, for the single threaded CPU implementation 49 and for the 9400m was 93 Mbytes/s. The throughput is almost double for the 9600m GT when compared to 9400m. This is because 9600m GT contains the double amount of streaming processors.
19
Device GeForce 9400 GeForce 9600m GT CPU single thread CPU OpenCL
Throughput (Mbytes/s) 93 159 49 180
Table 3.4.4 - Throughput measurements of execution of Salsa20 on different devices In general, the results that we got show that GPUs with a small number of streaming processors can be used effectively in order to achieve a high throughput for the Salsa20 algorithm. The more stream processors we have available, the better the throughput we can achieve. The results show that with a number of stream processors greater than 32, we can achieve better times on GPU. Finally, the results cannot be compared to related work for two reasons: the first one is that there isnt published relevant work for the Salsa20 algorithm on GPU and the second is that the related work in this field uses GPUs with huge computation power and hundreds of cores. We used GPUs with up to 32 stream processors so we cannot make a comparison. For example, in [11] a throughput of 1035 MB/s is achieved for the AES-256 using a GPU with 128 stream processors. Our best GPU used 32 stream processors for Salsa20 and achieved 159 MB/s.
20
Chapter 4
Hashing on GPU
4.1 Background
Hashing algorithms have the ability to create a fixed-sized data sequence from a variable-sized data sequence. In this section, we are going to deal with hashing algorithms that are used to compute a message digest (fingerprint) of data sequences. The main characteristics of hashing algorithms are that they can compute a fingerprint from a large data sequence pretty fast, but the reverse procedure is impossible and also it is unlikely, with a very low probability, to get the same fingerprint from 2 different inputs. These algorithms can help us to identify if there was a transmission error or some other malfunction that resulted in the alteration of some of the original data. For example, the digest of a file can be generated at some point; when someone else wants to copy or download this file he can check if the downloaded file has the same checksum as the original file. If not, then he knows that there was an error during transmission and he can try again. The digest doesnt have to be generated for a whole file, but we can instead create and check the digests of different blocks of data transmitted. Another important point that we need to mention is that this kind of algorithms are not parallelizable. The reason for this is that in order to compute the message digest of a file, we need to process all data of this file through the hashing algorithm sequentially. So we are not allowed to split the file in blocks and process them independently in parallel. This would only work if we kept the digest of each processed block of data, which could result in a lot of disk space occupied by
21
checksums of different blocks of the same file instead of having a single fixed size digest for the whole file. Of course, this attribute of hashing algorithms is desirable because files with the same content but in different order must generate different digests. So every block of data processed must take into account the output of previous blocks of the same data stream. In general, the high level of hashing algorithms has this form:
1. Initialize digest variables 2. Process next block of data stream (fixed size, usually 512 bits) 3. Apply the hashing function on this block (which modifies the digest variables) 4. If there are more blocks to process from the same stream, go to step 2 5. Output digest variables (fixed size)
It is easy to understand that we cannot parallelize this algorithm by processing different blocks because each new block must use the modified variables of the previous block. So how can we take advantage of the parallel nature of GPUs in order to compute digests of data faster? There are 2 main approaches. The first one is to give at each GPU thread a different block of the same data stream in parallel and keep a digest for each of these blocks for later reference. As mentioned before, however, this would need a lot of extra disk space to store all computed digests. The second one is to use many independent data streams and let GPU process one block at a time from each data stream in parallel. At the next step, another block from each data stream is processed. In this way, all blocks that depend on each other will be processed sequentially but in the same time we can take advantage of GPU parallelism. Of course, this approach requires a large number of different data streams that can be processed in parallel and, in fact, this number must be much larger than the maximum number of concurrent threads within a GPU device to help hide memory latency.
22

For the purposes of this project, the MD5 and SHA1 algorithms [7][8] were chosen for testing on GPU. The structure of these algorithms is similar to the one described in the encryption section. MD5 and SHA1 do not contain branches, and are based on arithmetic and bitwise operations such as XOR, AND, OR, NOT, left bit rotation, right shifting and addition modulo . Another very important advantage of
hashing algorithms is that the output of a large data sequence is very small (128 to 512 bits depending on the algorithm). This minimizes the time needed to transfer the results from the GPU device back to the host. In fact, the MD5 algorithm produces a 128-bit digest and the SHA1 algorithm a 160-bit digest. We already know that data transfer to and from the GPU device can be a bottleneck but in the case of hashing we do not have to worry a lot about moving data back to the host because the output of each block has a small fixed size. The biggest disadvantage of hashing algorithms is their sequential nature that does not allow us to operate on different blocks of the same data stream in parallel. We can, however, operate in parallel on different data streams. Additionally, another disadvantage is the large number of blocks that we need to transfer on the GPU. Although we do not need to transfer back a lot of information, the amount of data transferred to the GPU can still be a bottleneck.
4.3 Relevant work

A lot of background work of GPU hashing in industry was focused on cracking digests. At this moment, many programs available on the Internet exist that are able to use the available GPU devices of a system in order to crack MD5 and SHA1 password digests. A digest cracker tries to find a data sequence that can result in a given digest when processed through a specific hashing function. The way to do this is to calculate the digest of many relatively small data sequences until the result matches the given digest. This is the approach that we discussed in the previous section which processes many different data streams in parallel. The most well-
23
known program available is the Lightning Hash Cracker by Elcomsoft reaching a brute-force peak performance of 608 million passwords per second on a GeForce 9800 GX2 (2 x 128 stream processors) [19]. In academic literature, there are a limited number of published papers for MD5 or SHA1 hashing on the GPU. Most academic works so far for algorithms such as MD5 and SHA1 followed a FPGA-based approach. In [20], there is a detailed implementation of the MD5 algorithm on GPU which computes MD5 digests of small blocks of data of the same size in parallel. Again, the main bottleneck of the implementation appears to be the small bandwidth of PCI express compared to the computation power of the GPU device. Each thread is assigned to a 512-bit space of shared memory that uses in order to store each processed chunk of data for further processing. The main limitation of this approach is that due to the limited shared memory (16KB) the implementation can be tested for thread workgroups with less than 256 work items. A bigger number of work items would require a bigger shared memory size. The results of this work show a peak performance of 1400Mbps for a large input size using a NVIDIA GeForce 9800 GTX+ (128 stream processors). Other implementations [18] use the constant memory which can be fast because of the constant memory cache that is located on-chip. In this paper the SHA-1 algorithm is implemented on GPU and achieves a rate of 2,5 GB/s in a NVIDIA GeForce 9800 GTX+ (128 stream processors).
4.4 Implementation of MD5 and SHA1 & Results

For the MD5 algorithm, the "RSA Data Security, Inc. MD5 Message Digest Algorithm" [7] was used as a starting point. Some modifications were needed in the code in order to compile for execution on the GPU. These modifications included the removal of not supported code and also a duplication of the Md5Update function so that it can support pointer parameters that refer to different address spaces (vector variables in registers and GPU global memory). For the SHA1 algorithm, a simple implementation was used that can be found in [26]. For both algorithms a similar approach is used. Data are passed to the GPUs global memory in large blocks. Then the hardware scheduler of the GPU creates
24
workgroups according to the given parameters. Each work item of a workgroup can identify its position in a similar way as in the encryption implementation described in previous chapter. A large file of 200MB was used to run the tests and to simulate a multiple data streams parallel operation. The modified code was compiled as an OpenCL kernel. We decided to use registers for the processing of our data. We knew from the beginning that this will force us to use small block sizes but the parallel nature of GPU can support this decision. By using registers we are sure that we will have very small latency when reading our data. Each thread reads its assigned block in little pieces that process sequentially. The size we chose for these pieces was 16 bytes. The reason for this is that by using this size we can use the built-in vector type of OpenCL char16 and we can achieve aligned access to global memory. The same vector type was used when storing the calculated digest back to the global memory. The digest of MD5 is exactly 16 bytes (128 bits). The digest of SHA1 is 20 bytes (160 bits) Again for the transfer of data between the host device and GPU global memory, pinned memory was used just like in the encryption implementation. Pinned memory can provide higher bandwidth. For the testing procedure, we used the same graphic cards and CPU as in the encryption (GeForce 9400m, GeForce 9600m GT, Intel Core 2 Duo at 2.26 GHz). For specifications please refer to table 3.4.2. In the next figures we present the resulting times of the MD5 and SHA1 hashing of a 200 MB file. Please note that in both cases, GPU and CPU implementation, each block of data was treated as a separate data stream in order to simulate an environment with multiple independent data streams. The times that appear include tests that used different block sizes. The block size refers to the amount of data given to each work item for the calculation of an independent MD5 hash. The execution times were acquired using the average of 10 executions.
25
7000 6000 5000 Time (ms) 4000 3000 2000 GF 9600m GT GF 9400m CPU OpenCL CPU 1-thread 64 512 4096
1000
0
Block Size (bytes)
Figure 4.4.1 - Execution times of MD5 for all devices using different block sizes
In figure 4.4.1, the results of MD5 algorithm are presented. We can see that for small block sizes, the single-threaded CPU implementation appears to be faster than the 9400m GPU. As the block size grows we can see that the 9400m GPU takes a significant lead in front of the single threaded CPU implementation. After the 4KBytes block size, there is not significant increase for the 9400m GPU. The CPU execution times appear to be almost irrelevant of the block size. Both the single threaded and the OpenCL CPU implementations are not affected a lot by the block size. The 9600m GT performance is almost 2 times better than this of 9400m. The big difference in execution times between 9400m and 9600m GT comes from their difference in the number of stream processors (16 vs 32) and in their clock frequency. Both GPU implementations are faster than the single threaded on CPU. The multithreaded CPU implementation seems to be the fastest but by using a more powerful GPU with more stream processors we can get a speedup. As a result of figure 4.4.1, we can say that an optimal block size for each work item in the MD5 GPU implementation is between 1024 and 4096 bytes. Very small block sized are not good for GPU implementations. This is due to the fact that more and more work items require transactions with the global memory in order to read data. In this case, hiding latency is not very efficient because of the small number of computations that each work item has to do when compared to the amount of data that are read and written back to the global memory. For example, a block size of
26
8192 bytes computes and requires a write transaction of 128 bits every 4096 bytes, while a block size of 64 bytes requires the same transaction executed 128 times more. The difference between the hashing algorithm and the encryption algorithm that we discussed in the previous chapter is that in this implementation each work item also needs to read data from the global memory and this appears to be the bottleneck here. In table 4.4.2, the throughput achieved appears measured in Mbytes/s. The maximum GPU throughput of 107,5 MBytes/s was observed with the GeForce 9600m GT.
Device GeForce 9400m GeForce 9600m GT CPU single thread CPU OpenCL
Throughput (Mbytes/s) 57,2 107,5 48,8 190,5
Table 4.4.2 - Throughput measurements of execution of MD5 on different devices
To conclude the MD5 section, we can say with certainty that GPU devices with a small number of stream processors, available in most desktop and laptops, can be used for MD5 computations efficiently and also can be used in co-operation with CPUs for maximum results. A number of at least 32 stream processors or more is desired in order to achieve a good performance. In figure 4.4.3 and table 4.4.4 that can be found below we present the results of SHA1 implementation. We can see that the results are pretty similar to those of MD5. This is natural since SHA1 is based on the principles of MD5. The analysis of the results is also similar to MD5. The general trend is that as the block size grows, the execution times are improved but after the block size of 512 bytes there is not a significant improvement. Again the multithreaded CPU implementation seems to be the fastest but execution times on GPU devices are improved as the number of multiprocessors grows (16 vs 32 stream processors of 9400m and 9600m GT respectively). So a GPU device with 32 or more stream processors can really assist or
27
replace the CPU in SHA1 hashing computations. The 32 stream processors of 9600m GT seem to be enough to replace the CPU in the calculation of SHA1 digests.
8000 7000 6000 Time (ms) 5000 4000 3000 2000 1000 0 64 512 4096 GF 9600m GT GF 9400m CPU OpenCL CPU 1-thread
Block Size (bytes)
Figure 4.4.3 Execution times of SHA1 for all devices using different block sizes
Device GeForce 9400m GeForce 9600m GT CPU single thread CPU OpenCL
Throughput (Mbytes/s) 51,9 123,5 30,4 155
Table 4.4.4 - Throughput measurements of execution of SHA1 on different devices
28
Chapter 5
Compression on GPU
5.1 Background
Compression is an essential operation. A lot of data are compressed every day in order to reduce their size and make them more suitable for transfer over the Internet. There are two different types of compression: Lossy and lossless compression. Lossy compression refers to compression algorithms that try to reduce the size of a file with a cost on its quality. Lossy compression is used on photos, sounds, videos and, more generally, on files where the main characteristics are still recognizable when their quality is not so good. On the other hand, lossless compression refers to compression algorithms that can compress a file by reducing its size, but after decompression we are able to get back the file that was originally compressed. This kind of compression is mostly used on files such as text files, executable files etc. In this section, we are going to research a little further the prospects of lossless data compression on GPU. There are many different compression algorithms that take advantage of the fact that data sequences contain large identical sub-sequences that we can encode with smaller representations. We are going to implement the dictionary-based Lempel Ziv 78 (LZ78) algorithm [13] for execution on GPU so this is a good place for a brief description of this algorithm. Directory-based algorithms are often used because of their simplicity and simple algorithms operate better on the GPU. The LZ78 algorithm uses a dictionary that updates while traversing the available data and it also keeps a copy of the largest sequence found in the dictionary (called prefix). Input is processed byte by byte. Each time a new character
29
is read, a search is taking place to find out if the sequence {prefix + new character} is present in the dictionary. If it is present, we update the prefix with the new character and we keep reading characters following the same procedure until a match in the dictionary cannot be found. At that point, we update the dictionary with a new entry that contains the sequence {prefix + new character}, we reset the prefix and we output the sequence {position of the prefix in the dictionary + new character}. This is a compressed sequence. This procedure continues by constantly updating the dictionary with new sequences and by outputting references to it until there is no more input. The opposite operation, decompression, follows the same technique by constructing a similar dictionary and by following the references.

After getting a more clear understanding on how lossless compression algorithms work, we will present which of their characteristics prevent the full exploitation of the GPUs computation power and how we can deal with these problems. Here we should note that the problems of moving data to and from the GPU discussed in the section about encryption and hashing, also apply here.
Synchronization. The main idea behind compression algorithms is to find

repeated sequences of characters in a file and replace them with a shorter representation depending on their frequency in the file. This operation is optimized when we can have a central dictionary structure that controls the execution of the algorithm and optimizes the compression ratio by keeping as much information as possible. As mentioned in previous chapters, GPU likes to execute a lot of threads in parallel meaning that these threads must operate on independent data. This also means that each thread cannot make use of information gathered by other threads unless there is some kind of synchronization between them which would slow down the whole procedure. Then those data should be moved back and forth between the GPU and the host device in order to feed next blocks. Those restrictions would make the algorithm even more complex. The only efficient way to implement a lossless compression algorithm on GPU is to sacrifice compression ratio in order to get the wanted parallelization. This can be done by compressing different blocks of
30
data independently treating them as different streams of data. This will reduce our compression ratio a little but will speed up the whole procedure.
Complex and branched algorithms. Compression algorithms contain a lot of

branches in their code, a lot of if and while statements that force different threads to follow different paths of execution some times. As a result different threads execute different instructions which results in a sequential execution at some parts of the code between threads. There is not much that we can do to avoid this in a GPU implementation so this is an important disadvantage. Also compression algorithms do not contain arithmetic operations and is all about searching for patterns. So we cannot take advantage of the computation power of GPUs. Another important issue, when dealing with GPUs, is the limited memory supplied and the restrictions for memory allocation of current parallel programming frameworks for GPUs like OpenCL and CUDA. Dynamic memory allocation is not supported in running kernels so we need to know in advance information about the size of the current block. When dealing with compression and decompression, the amount of memory needed for the compressed/decompressed data is not always known in advance. A way to overtake this problem is to make some conventions that will help us deal with it. For example, the compressed size of a block of data can have a maximum size equal to the original size plus some header information about the compressed block. To decompress a block, we will need to know in advance the size of the original block by reading the appropriate header information so that we can easily allocate the memory required for decompression. Apart from this, compression algorithms need to allocate memory for a number of suboperations. This requires a re-implementation of the compression algorithm in order to follow the GPU framework standards. A successful GPU implementation must supply enough pre-allocated memory to the (de)compression kernel in order to successfully (de)compress all blocks without running out of memory resources. The limited GPU global memory and the large number of concurrent threads that deal with different blocks is an important problem that needs to be solved. All problems discussed above plus the complex nature of compression algorithms must be taken into account. The main structure of the algorithm needs to be optimized and modified in order to satisfy all GPU restrictions and to take
31
advantage of all GPU benefits.
5.3 Relevant Work

There are no relevant academic papers on lossless data compression on GPU. In contrary there are a lot of research papers on lossy compression and especially on lossy image compression algorithms on GPU because of the fact the GPUs are optimized for handling image files. The fact that there are no relevant academic papers can be explained because of the nature of lossless compression algorithms. As described in the previous section, these algorithms do not fit well on the GPU architecture. Nevertheless, there is relevant work on algorithms for parallel block compression in general, which is the method that we will use in the implementation part. In [22] a parallel block compression approach is used in order to achieve speedup to dictionary based compression algorithms. Because the parallel processing of blocks may result in a reduced compression ratio with independent dictionaries, a joint dictionary construction is proposed where different compression processes reference a shared dictionary. A very famous block compression program is bzip2 [23] which uses a combination of some famous compressions algorithms including the Burrows Wheeler transform [24] and the Huffman coding algorithm [25]. This algorithm works on blocks and compresses each block independently. The problem is that it operates on large blocks, usually between 100 and 900 Kbytes, which makes it a bad candidate for GPUs due to limited memory.
5.4 Implementation of LZ78 & Results
For the compression algorithm, the LZ78 algorithm was chosen. Before this choice, many other zip libraries were examined such as bzip, gzip and others but these libraries were too complicated for the GPU architecture: too large code with a lot of branches and heavy memory operations. This is the reason why we decided to
32
implement a LZ78 version that can fit well on the GPU and then test it in practice. Dictionary-based compression algorithms are often used because of their simplicity. We must note that this implementation was created for the GPU architecture; other CPU implementations can be a lot faster than this because of their large memory and freedom of memory allocations. For the purposes of this dissertation, we decide to create an implementation that can fit to the GPU architecture and test it in several devices. Our main concern was to find ways to speed up the compression process as much as possible. From the beginning, it was clear that our bottleneck would be the transferring process to and from the GPU. For this reason, we have to choose a relatively large global size of data to be compressed each time with respect to the total available memory of the GPU device. Of course, we must keep in mind that these parameters depend on our hardware and the PCI express bandwidth. On different systems, we need to make sure that the full bandwidth is used. The main idea for the compression on GPU is to split the data and process blocks in parallel that will be compressed independently. We can follow 2 approaches here: either give a block of data to a workgroup or give a block of data to each work item. The first approach can lead to better compression ratio but it needs some kind of synchronization between work threads. The idea is to create a shared dictionary for each workgroup that all work items within it will be able to update and to reference it. The problem with this approach is that synchronization will lead to delays and will reduce efficient use of parallelism. This approach will not be implemented but can be considered as a potential future work of this dissertation. The second approach, assigning an independent small block to each work item, seems faster but will result in a reduced compression ratio. For the implementation part, we will use this approach. Another issue is the dictionary size of each work item. LZ78 uses a dynamic dictionary that is created during the compression process but because of memory issues on GPU we need to set a limit to its size. The bigger the size, the better compression ratio we will achieve. Due to the large number of threads that the GPU platform needs, the dictionary size has to be small. When the dictionary is full and we want to add a new entry, we do so by replacing the oldest entry of the dictionary
33
with the new sequence. Instead of using registers to store the dictionary, we can also use the shared workgroup memory that has a bigger capacity, usually 16Kb, and can be as fast as accessing registers when there are no memory bank conflicts between threads asking for a transaction. Shared memory, unlike global memory, can serve multiple transactions, up to 16, by different work items in parallel. For our implementation, we chose to bypass the shared memory and copy small chunks of data each time into registers for faster execution.
For the current implementation, we chose to use a small dictionary size of 256 entries for a number of reasons. 1. The first and most important reason is because of the limited GPU memory. Each work item must have a small dictionary if we want to guarantee that we are not going to have memory problems. 2. The second reason is because we need a small number of bits to represent a reference in the dictionary. So a 256-entry of the dictionary can be referenced with 8 bits. 3. Another reason is that our implementation uses a sequential search to find a match in the dictionary so a large dictionary size would result in more search time.
As we said before, the OpenCL framework does not support dynamic memory allocation and this may be a problem in compression/decompression functions because we cannot be sure of the compressed and decompressed size. To bypass memory allocation issues we will make some conventions: When each work item completes the compression of a block of data, it also needs to save the compressed data size. In this way, at the time of decompression the decryption function will know that the next compressed data size read will result in an unzipped sequence of a fixed block size. So we can pre-allocate the memory needed. For the encryption part, buffers of the same size as the input data size were preallocated to store the encrypted data. We will make a convention that if the encrypted data results in a bigger size than the input, then the input will be stored unchanged.
34
For the testing procedure, we used the same graphic cards and CPU as in the encryption (GeForce 9400m, GeForce 9600m GT, Intel Core 2 Duo at 2.26 GHz). For specifications please refer to table 3.4.2. In this section we present the resulting times of compressing a 9,3MB file by using our LZ78 implementation. The times that appear include tests that used different block sizes. The block size refers to the amount of data given to each work item for the compression of a different block of data. For all tests, a dictionary size of 256 entries was used.
22000 17000 Time (ms) 12000 7000 2000 64 256 1024 4096 CPU OpenCL GF 9400m GF 9600m GT CPU 1-thread
Block Size (bytes)
Figure 5.4.1 - Execution times of LZ78 for all devices using different block sizes
In figure 5.4.1, we can see the results of the implemented LZ78 algorithm. The execution time is reduced in GPUs when using a relatively small block size between 128 and 1024 bytes. This is because small block sizes dont contain a lot of information in order to take full advantage of the dictionary and as a result there are fewer replacements and this cause fewer threads to follow different paths. As the block size grows more and more threads follow different paths. The 9400m performance is always slower than the sequential CPU implementation. The 9600m GT execution seems improved: execution times are reduced by 50% when compared to those of 9400m. Again, this can be explained from their difference in stream processors (16 vs 32). The performance of 9600m GT is always better than the
35
sequential CPU implementation but for large block sizes the performance drops. In general, the LZ78 algorithm performs better when running as a multithreaded CPU program (OpenCL CPU). Before making any assumptions, we have to look how these block sizes behave when it comes to the compression ratio achieved. The next figure presents the compressed size achieved for the 9,3 MB file used after parallel block compression with different block sizes.
9 Compressed Size (MB) 8.5 8 7.5 7 6.5 6 64 512 Block Size (bytes) 4096 Compressed size
Figure 5.4.2 - Compressed size achieved with different block sizes by using our specific LZ78 implementation with a small fixed sized dictionary
We can see that for very small block sizes, the compressed size is too large, nearly unaffected. This is because small block sizes dont give the capability to the algorithm to fill all available positions of the dictionary. The chosen dictionary size was 256 entries so data of 64, 128, 256, 512 bytes cannot take full advantage of it because each entry can contain several bytes. Fewer dictionary entries mean fewer possible compressed sequences. That is why we see an improvement after a block
36
size 512 bytes. A CPU implementation with an infinite (or very large) dictionary size would give much improved compressed sizes. As a result from figures 5.4.1 and 5.4.2 we can state that for the specific parameters we selected, an optimal block size for each work item would be between 512 and 1024 bytes because these sizes give good execution times and a relatively good compression ratio. To conclude, the results show that GPU memory limitations can be very harmful for the resulting compressed size. Also, the nature of compression algorithms doesnt allow the exploitation of GPU computation power. GPUs are not yet ready for this task.
37
Chapter 6
Putting it all together
In this chapter, we will examine how we can combine some algorithms discussed earlier for execution on the GPU in order to process a single stream of data more efficiently. We already know that a stream of data can be divided in little blocks for parallel encryption and compression. Hashing algorithms, on the other hand, are strictly sequential and have to operate on each block in order. Combining a sequential algorithm with parallel algorithms is not optimal on GPU. So in this section we will discuss how compression and encryption can be combined on the GPU in order to get the maximum performance. The idea is to move blocks on the GPU, compress them, then encrypt them and finally transfer them back to the host. By combining these two operations on the GPU, we can reduce the time required to transfer data from the host device to the GPU and back when compared to executing encryption and compression independently (figure 6.1). We could also say that a compressed stream of data results in a reduced amount of data for encryption. Unfortunately the exact size of compressed data cannot be known in advance, so buffers must be allocated and data need to be transferred back for the worst possible scenario. In figure 6.1, the red arrows represent operations that need recurrent transfer of data and make heavy usage of the PCI express bandwidth. On the other hand, green arrows indicate operations that happen immediately. So from figures 6.1a and 6.1b it is clear that by combining encryption and compression we can reduce the total time needed to move data between the two devices. Heavy transfers through the PCI express are reduced from 3 to 2.
38
Figure 6.1 - (a) Encryption and compression executed separately, (b) Combined execution
Our goal at this moment is to decide on an efficient block size that will fit both to encryption and compression. According to the results that we presented in the encryption chapter, small block sizes up to 2048 bytes have the best performance. On the other hand, the results from the compression indicate that block sizes smaller than 1024 bytes suffer from reduced compression ratio. In general, larger block sizes result in an improved compression ratio but if we get into account the incapability of GPUs to supply enough memory we soon realize that we cannot use very large block sizes. We need to have a very large number of threads on the fly, and each one will be assigned to a block of data. So the limited GPU memory available prevents us from satisfying both conditions. An efficient block size for both encryption and compression seems to be between 1024 and 2048 bytes. The procedure of this combined operation appears in figure 6.2. Each work item is responsible for a block of data equal to the chosen block size. It compresses the block and then it encrypts the output of compression. After this, it stores the final output size of the compressed block and the compressed/encrypted block (C/E) in the appropriate place in the global memory. The size information is needed because the host needs to know how many bytes the output size of each block was in order to recover it. This information is also needed for the decryption/decompression operation.
39
Figure 6.2 - Each work item (Wn) compresses a block and then encrypts the compressed output
40
Chapter 7
Discussion
In previous sections we examined algorithms and developed some GPU implementations. In this section we will discuss in detail the results that we derived by making a critical evaluation. The algorithms used were of different nature and for some of them (i.e. the compression part) had to be re-implemented from scratch in order to fit on the GPU architecture. We did a research on three different categories of algorithms: hashing, encryption and compression. We also examined how encryption and compression can be executed on the GPU with just one call by determining an optimal block size for both of them. For the hashing and encryption part, all available algorithms are very similar and results can somehow be more general and not specific for Salsa20 or MD5 and SHA1 algorithm. On the other hand, the compression implementation was the trickiest. The reason for this is that there exist many compression algorithms and each one is based on a different approach. This fact results in very different algorithms which may or may not fit on the GPU. We tried to choose an algorithm that was relatively simple and could be parallelized easily but by sacrificing speed and compression ratio. The results of hashing and encryption are very straightforward: GPU implementations are much more effective than this of a single threaded CPU version. The results also show that more powerful GPUs can easily overcome a multithreaded CPU implementation. Mid range GPUs can also be very efficient in these tasks and assist or replace CPUs. For Salsa20, we achieved acceptable results for small block sizes between 64 and 2048 bytes. In fact block sizes of 64 and 128 bytes seem to be optimal for our implementation. The results of MD5 and SHA1 gave us a peak performance for block sizes of 1024 or 2048 bytes, but acceptable performance was in the range of 512 to 4096 bytes.
41
The results for the compression part are not very encouraging. Some characteristics of compression algorithms are reduced as we explained in the relevant section such as compression ratio. The block sizes that resulted in an efficient performance, including speed and compression ratio, were 1024 and 2048 bytes. Bigger block sizes resulted in an improved compression ratio but had an impact on speed. Finally, the combined execution of encryption and compression operations can have an improvement of performance. This is a natural result because every block of data is staying for a longer time on the GPU and is used for more computations. So the ratio of computations over amount of data is increased and this is the whole point of parallelism on the GPU: use less data for more computations in parallel. It would be good if the hashing part could be combined for execution on GPU with the other two operations but as mentioned before its sequential nature prevents this. A large block of data can be divided into multiple sub-blocks which can then be used for independent encryption and compression. However, this cannot happen for the calculation of a digest using a hashing algorithm. At this point, we would like also to discuss the results based on our primary motivation which was the use of GPU computation power for assisting the CPU in operations required for a backup system (hashing, encryption, compression). Hashing and encryption results were very promising on the GPU, but compression had many problems. So an efficient backup system could use the CPU to compress files and then send them to the GPU for the encryption and hashing part in a pipelined way. According to the results, efficient systems need to use GPU devices with 32 or more stream processors. In general, by taking into account all the results of this dissertation, we can state that the performance of each algorithm is improved nearly by 50% when the number of available stream processors is doubled from 16 to 32.
7.1 Project difficulties

During the implementation phase of this project we run into a number of difficulties. In this section we are going to discuss some of these difficulties that appeared to be the most important.
42
For the purposes of this project we had to implement a number of algorithms of different kind. We found some implementations of these algorithms that we tried to modify in order to fit in the GPU architecture. The problem was that they were designed for optimal execution on CPU and GPU compiler doesnt support the entire set of the C language. For example, functions related to memory operations such as memcpy etc are not supported by GPU kernels. So when there was a need for copying memory we had to do it manually by replacing these functions. Another difficulty was that GPU has many different address spaces (described in previous chapters) so for optimal execution we had to transfer data between these address spaces. Also the debugging process appeared to be much more difficult than we thought at the beginning. GPU devices do not support at the moment output functions such as printf. Checking the content of some variables during runtime wasnt an easy task. We had to create an extra buffer space on the GPU global memory where we stored any information that we needed to know for the debugging process and then checked the content of those variables by transferring them and outputting them on the host device. The problem with this approach is that when there was a bug in the code that was forcing the kernel to crash then we couldnt reach the part where we could send the data back to the host for examination. In this case, we had to execute small parts of the kernel until we reach the point of the problem. As in most parallel and distributed systems, the debugging of many different instances that are executed in parallel was difficult. We had to coordinate the execution of hundreds of threads which was difficult at the beginning but only until we had our first algorithm running. The same method of coordination and debugging was used for all algorithms. Compression algorithms were the most difficult to be modified in order to fit to the GPU because of their complex memory operations and of their big size. For this reason, a simple implementation of LZ78 compression algorithm was created.
43
7.2 Future Work

The subject of this dissertation included many different areas of study such as hashing, encryption and compression. We did our best to create algorithms that can execute efficiently on the GPU but there is always room for improvements. During our study for the behavior of such algorithms on GPU, we realized that there was very limited information and academic reference to the data compression part on the GPU. Because of the limited time given for this project, we didnt have enough time to go very deep in this section but we think that this research can be used as a starting point for future implementations. The proposed approach for the LZ78 algorithm for the shared/synchronized dictionary between work items of the same workgroup can be examined as a future work of this dissertation. Also other, more efficient, search techniques of the dictionary instead of a sequential search could be tried such as hash tables. The problem of limited GPU memory and of the incapability for dynamic memory allocation prevented us from following this approach. A research of how we can create an efficient hash table with a fixed size for the GPU platform would be very helpful for the LZ78 algorithm and would speed up the process by a large factor. Also as future work, we could test these algorithms on more powerful, high end GPUs. The GPUs that we used for our testing (NVIDIA GeForce 9400m, NVIDIA GeForce 9600m GT) were entry-level and mid-range GPUs but served well the purpose of this dissertation that was to examine whether laptop and desktop GPUs could be used to speed up these operations. Another possible extension of this dissertation can research in detail different ways in which the CPU and GPU device can cooperate in order to achieve maximum performance for hashing, encryption and compression in a pipelined fashion: how can these operations be synchronized and what speedups can be achieved over a pure CPU implementation.
44
Chapter 8
Conclusion
The computation power of GPU devices grows year by year. As this power grows, more and more computationally intensive fields start to use it in order to achieve greater speedups. Encryption and hashing algorithms have been already tried on the GPU architecture and showed great speedups. Most of these speedups were achieved by using expensive high end GPUs with a very large number of stream processors and high clock frequencies. In this dissertation, we proved that even entry level and mid range GPUs can be used for encryption and hashing effectively. The results that we got from the Salsa20 and the MD5 algorithm are very encouraging. Unfortunately, there are fields such as compression that are not yet ready to take full advantage of GPU devices. Compression algorithms need to be implemented with many restrictions in mind in order to run on GPU devices and these restrictions have a cost on speed and compression ratio. In general, we can say that GPUs with 32 or more stream processors can be used as a powerful computation device in any algorithm that involves intensive computations. There is too much unexploited computation power at this moment in most users desktop and laptop GPUs. As our results show, this power can be used to maximize the performance of many algorithms. In previous chapters, we referred many times to the limited GPU memory. We believe that, in a few years, this will not be a problem anymore and GPUs will have bigger and faster memories. As a result of this, we believe that, in the near future, GPUs will be an essential computation device in every users computer, either assisting CPUs in computation intensive problems or even replacing them.
45
Bibliography
[1]
P. Anderson and L. Zhang, Fast and Secure Laptop Backups with Encrypted De-duplication, under publication in 24th Large Installation System Administration Conference (LISA 2010), San Jose, CA, November 712 2010.
[2]
Intel, Intel microprocessor export compliance metrics, http://www.intel.com/support/processors/sb/cs-023143.htm
[3]
GPU Gems 2, Chapter 32. Taking the Plunge into GPU Computing, NVIDIA Corporation, 2009, http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter32.html
[4]
OpenCL Programming Guide for the CUDA Architecture, Version 3.1, NVIDIA Corporation, 2009.
[5]
D.J. Bernstein, The Salsa20 Family of Stream Ciphers, New Stream Cipher Designs: The eSTREAM Finalists, Springer-Verlag, 2008, pp. 84-97.
[6]
T. Xie and D. Feng, How to Find Weak Input Differences for MD5 Collision Attacks, Cryptology ePrint Archive, Report 2009/223, 2009.
[7]
R. Rivest, The MD5 Message-Digest Algorithm, RFC 1321, MIT and RSA Data Security, Inc., 1992.
[8]
D. Eastlake and P. Jones, US Secure Hash Algorithm 1 (SHA1), RFC 3174, Motorola and Cisco systems, 2001.
[9]
Wikipedia, Block cipher modes of operation, http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation
[10]
N. Pilkington and B. Irwin A Canonical Implementation Of The Advanced Encryption Standard On The Graphics Processing Unit, In the Innovative Minds Conference, Johannesburg, South Africa, 7 - 9 July 2008.
46
[11]
S. Manavski, CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography, Signal Processing and Communications, 2007. ICSPC 2007. IEEE International Conference on, 2007, pp. 65-68.
[12]
NVIDIA OpenCL Best Practices Guide, Version 1.0, NVIDIA Corporation, 2009.
[13]
J. Ziv and A. Lempel, Compression of individual sequences via variablerate coding, Information Theory, IEEE Transactions on, vol. 24, 1978, pp. 530-536.
[14]
O. Harrison and J. Waldron, AES Encryption Implementation and Analysis on Commodity Graphics Processing Units, Proceedings of the 9th international workshop on Cryptographic Hardware and Embedded Systems, Vienna, Austria: Springer-Verlag, 2007, pp. 209-226.
[15]
Accelereyes . GPU Memory Transfer, http://wiki.accelereyes.com/wiki/index.php/GPU_Memory_Transfer/
[16]
OpenCL - The open standard for parallel programming of heterogeneous systems, Khronos Group, www.khronos.org/opencl/.
[17]
O. Gervasi, D. Russo, and F. Vella, The AES Implantation Based on OpenCL for Multi/many Core Architecture, Computational Science and Its Applications (ICCSA), 2010 International Conference on, 2010, pp. 129-134.
[18]
Lin Zhou and Wenbao Han, A Brief Implementation Analysis of SHA-1 on FPGAs, GPUs and Cell Processors, Engineering Computation, 2009. ICEC '09. International Conference on, 2009, pp. 101-104.
[19]
Lightning Hash Cracker, ElcomSoft Co.Ltd., http://www.elcomsoft.com/lhc.html
[20]
Guang Hu, Jianhua Ma, and Benxiong Huang, High Throughput Implementation of MD5 Algorithm on GPU, Ubiquitous Information Technologies & Applications, 2009. ICUT '09. Proceedings of the 4th International Conference on, 2009, pp. 1-5.
[21]
NVIDIA, 2007. Ext gpu shader4 opengl extension, http://developer.download.nvidia.com/opengl/specs/GL_EXT_gpu_shade r4.txt.
47
[22]
P. Franaszek, J. Robinson, and J. Thomas, Parallel compression with cooperative dictionary construction, Data Compression Conference, 1996. DCC '96. Proceedings, 1996, pp. 200-209.
[23] [24]
Bzip2 compression algorithm, Julian Seward, http://www.bzip.org/ M. Burrows, D.J. Wheeler, M. Burrows, and D.J. Wheeler, A block-sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994.
[25]
D. Huffman, A Method for the Construction of Minimum-Redundancy Codes, Proceedings of the IRE, vol. 40, 1952, pp. 1098-1101.
[26]
Secure Hashing Algorithm (SHA-1) C implementation, Packetizer Inc., http://www.packetizer.com/security/sha1/
[27]
D. J. Bernstein, Why switch from AES to a new stream cipher?, http://cr.yp.to/streamciphers/why.html
48

Parallel Hashing

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Parallel Hashing

Cargado por

Copyright:

Formatos disponibles

Parallel Hashing, Compression and Encryption with OpenCL under OS X

Master of Science Computer Science School of Informatics University of Edinburgh 2010

I would like to thank my family that always supports me in everything I do.

2.2.1 2.2.2 2.2.3 Chapter 3 3.1 3.2 3.3 3.4

Chapter 4 4.1 4.2 4.3 4.4

Chapter 5 5.1 5.2 5.3 5.4

Chapter 6 Chapter 7 7.1 7.2

Project difficulties ................................................................................................ 42 Future Work ......................................................................................................... 44

GPU and OpenCL

2.1 GPU architecture

Figure 2.1.1 - CPU versus GPU design (source: [4])

2.2 Open Computing Language (OpenCL)

2.2.1 Memory model

Figure 2.2.1.1 - The different memory spaces of GPU (source: [4])

2.2.2 Memory access patterns

2.2.3 OpenCL execution model

Figure 2.2.3.1 - Representation of the NDrange (grid) of OpenCL (source: [4])

3.2 GPU advantages and disadvantages

3.3 Relevant work

3.4 Implementation of Salsa20 & Results

Model Streaming Processors Memory Clock

GeForce 9400m 16 256 Mb 1100 MHz

GeForce 9600m GT 32 256 Mb 1250 MHz

GF 9600m GT - for the execution on the NVIDIA GeForce 9600m GT using

Throughput (Mbytes/s) 93 159 49 180

4.2 GPU advantages and disadvantages

4.3 Relevant work

4.4 Implementation of MD5 and SHA1 & Results

Block Size (bytes)

Throughput (Mbytes/s) 57,2 107,5 48,8 190,5

Table 4.4.2 - Throughput measurements of execution of MD5 on different devices

Block Size (bytes)

Throughput (Mbytes/s) 51,9 123,5 30,4 155

Table 4.4.4 - Throughput measurements of execution of SHA1 on different devices

5.2 GPU advantages and disadvantages

Synchronization. The main idea behind compression algorithms is to find

Complex and branched algorithms. Compression algorithms contain a lot of

advantage of all GPU benefits.

5.3 Relevant Work

5.4 Implementation of LZ78 & Results

Block Size (bytes)

Putting it all together

7.1 Project difficulties

7.2 Future Work

Intel, Intel microprocessor export compliance metrics, http://www.intel.com/support/processors/sb/cs-023143.htm

Wikipedia, Block cipher modes of operation, http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation

Accelereyes . GPU Memory Transfer, http://wiki.accelereyes.com/wiki/index.php/GPU_Memory_Transfer/

Lightning Hash Cracker, ElcomSoft Co.Ltd., http://www.elcomsoft.com/lhc.html

NVIDIA, 2007. Ext gpu shader4 opengl extension, http://developer.download.nvidia.com/opengl/specs/GL_EXT_gpu_shade r4.txt.

Secure Hashing Algorithm (SHA-1) C implementation, Packetizer Inc., http://www.packetizer.com/security/sha1/

D. J. Bernstein, Why switch from AES to a new stream cipher?, http://cr.yp.to/streamciphers/why.html

También podría gustarte