Está en la página 1de 2

2015 IEEE International Conference on Cluster Computing

An FPGA-based Accelerator for Neighborhood-based Collaborative Filtering


Recommendation Algorithms

Xiang Ma, Chao Wang, Qi Yu, Xi Li, Xuehai Zhou


School of Computer Science
University of Science and Technology of China, Hefei, China
Email: {maxiang, yuiq1123}@mail.ustc.edu.cn; {cswang, llxx, xhzhou}@ustc.edu.cn

AbstractNeighborhood-based Collaborative Filtering (CF) is using Xilinx ZYNQ SOC and take several experiments on it.
a kind of techniques in the field of recommendation algorithms Experimental results demonstrate the accelerator could im-
and has been widely used in lots of personalized recommender prove the speedup and acceleration efficiency with the af-
systems. In the big data era, the increasing data amounts make fordable hardware cost and less energy consumption.
these CF recommendation algorithms become time-consuming
and energy-wasted. At present, Cloud computing and Graphic II. NEIGHBORHOOD-BASED COLLABORATIVE FILTERING
Processing Unit (GPU) are the two major platforms to accele-
rate CF algorithms. However, both platforms exist some re- User-based CF, Item-based CF and SlopeOne are three
markable shortcomings such as efficiency and power. To solve representative NCF recommendation algorithms and all
these problems, in our work, we investigate three neighbor- could be divided into the training and prediction phases. For
hood-based CF algorithms and design a general and flexible training phase, user CF accepts user vectors to calculate all
accelerator for them based on Field Programmable Gate Array similarities for each two users; item CF/slopeone accepts
(FPGA). This accelerator cooperates with host CPU and could
item vectors to calculate similarities/differences for each two
accelerates primary time-consuming parts that these algo-
rithms share. Experimental results show that our accelerator items. As to prediction phase, given an active user vector, all
could significantly improve the acceleration efficiency with the algorithms use the information from training phase to predict
affordable hardware cost and less energy consumption. those items ratings which the user have not rated. User vec-
tor contains all the items IDs and ratings that the user has
Keywords-accelerator; FPGA; recommendation algorithm; rated and item vector contains all the users IDs and ratings
neighborhood-based collaborative filtering; heterogeneous who has rated this item.
There are many similarity metrics such as Jaccard Coef-
I. INTRODUCTION ficient, Cosine Similarity and Pearson Correlation. Most
metrics do calculations on all the IDs and ratings that both
Neighborhood-based Collaborative Filtering (NCF) [1] is two vectors share, same with SlopeOne. These calculations
a primary kind of techniques in the field of recommendation mainly involve operations such as multiplication or subtrac-
algorithms and has been widely used in a wide range of per- tion on two vectors followed by an accumulation operation.
sonalized recommender systems. However, due to the prop- It could be seen that the three algorithms have large in
erties of data-intensive and compute-intensive, NCF become common and slight differences on calculation. Thus we
time-consuming when dealing with vast amounts of data. could build a general and flexible accelerator with all needed
Thus, In order to guarantee the real-time interaction require- function modules and using instructions to compose these
ment between users and recommender systems, it is very fine-grained functions into different algorithms.
necessary to accelerate the process of NCF algorithms.
Currently, cloud computing and GPU are the two major III. ACCELERATOR ARCHITECTURE
platforms that used to accelerate NCF algorithms. However,
both platforms exist some remarkable shortcomings to be A. Overall architecture of CF accelerator
overcome: regarding cloud platform, the build and mainten-
ance cost are relatively high, and actually the efficiency of Accelerator and Peripherals
each common CPU-based computing node may not satisfy- CF Accelerator
ing; as for GPU, although the efficiency is usually much Instruction Buffer

better than a cloud computing node, the energy consumption


Execution
Control Interconnect

of each GPU board is quite outstanding. Unit 1


In order to improve the efficiency of CPU based compu-
DMA

Host Execution DDR


ting node for NCF algorithms and reduce the runtime energy Unit 2
CPU Accelerator RAM
consumption at the same time, in this paper, we investigate Controller
Execution
three common neighborhood-based CF algorithms: User- Unit 3
based CF [2], Item-based CF [3] and SlopeOne [4], and pro-
Execution
pose a general accelerator architecture based on FPGA for Unit 4
them. This accelerator acts as a co-processor and could acce-
lerate prime time-consuming parts that these algorithms
Data Bus
share. To evaluate our design, we implement a prototype
Figure 1. The overall architecture of NCF accelerator.

978-1-4673-6598-7/15 $31.00 2015 IEEE 494


DOI 10.1109/CLUSTER.2015.79
The overall architecture of NCF accelerator is illustrated IV. EXPERIMENT
in Fig. 1. As we can see, NCF accelerator has an instruction
buffer, a controller and many execution units and works in
SIMD mode: all execution units execute same instruction on
different input vectors. When utilizing this accelerator, host
CPU first informs DMA to transfer the instructions to in-
struction buffer, then accelerator reads instructions one by
one and execute the corresponding operations such as load,
store or vector accumulation.
B. Execution Units architecture
Execution Units architecture is illustrated in Fig. 2. In
each Execution Unit, there are I/O module, calculation unit,
vector operation unit and vector update unit. All modules
and units need instructions to execute certain function.
Input Module executes load instructions to read two vec-
tors from DMA by stream and put them into input X and Y Figure 3. Speedups of accelerator compared with Intel Xeon E5
buffer. Besides, when reading the second input vector, input We build an accelerator prototype on ZYNQ platform of
module could do the vector intersection operation at the DIGLENT ZedBoard. ZYNQ is an embedded SOC which
same time to put the shared elements of two vectors into consists of two sections, processing system (PS) and pro-
temp X and Y buffers. As to training phase, output module grammable logic (PL). PS mainly integrates two ARM Cor-
executes scalar store instructions to put scalars such as simi- tex-A9 cores and PL is full of FPGA logic. Our prototype
larity value which stored in result scalars back to DMA; for exists in PL and it has one execution unit. The execution
prediction phase, output module executes corresponding unit has one 16-sized operation-adder tree in vector opera-
vector store instructions to send output vector which stores tion unit and 16 PEs in vector update unit, and the calcula-
all the prediction values of each item back. tion unit support Jaccard, Cosine, Pearson and SlopeOne
Vector operation unit does operations such as multiplica- metrics. The prototypes frequency is 101MHz. Comparison
tion or subtraction on each element of two vectors and ac- platform is a 2.3GHz Intel Xeon E5 processor and all algo-
cumulates them and then stores the accumulation value into rithms run in single-thread mode.
corresponding scalar buffer. It is implemented as several We choose MovieLens-1M [5] as test dataset. There are
operation-adder trees to work efficiently. Calculation unit 3883 movies and 6040 users. Experimental results are
usually calls vector operation unit first to get needed scalar shown in Fig. 3. For prediction, the speedup is the average
values and then do the rest special calculations for different value. If we could add more operation-adder trees and PEs
metrics. Finally the similarity or difference value is also in every execution unit, the speedups would be better.
stored in result scalar buffer. Vector update unit only works Moreover, the power of our NCF accelerator prototype is
in prediction phase, it does the weighed-accumulation and 237mW, which is far less than the Xeon and other CPU/GPU.
accumulation operation for each item of every input vector. The maximum used resource is LUT, which occupied 75.9%
The weighed-accumulation values for each item are stored of the totals. As to other resources, the usage is almost 41.8%
in output buffer and the accumulation values stored in temp on average.
buffer Z. After dealing with all input vectors, update unit
executes division on each item of output and temp Z buffers V. CONCLUSION
and then stores final prediction values in output buffer. In this paper, we design an accelerator based on FPGA as
DMA the co-processor to accelerate NCF recommendation algo-
rithms. Experimental results show that this accelerator could
Input Module Output Module
Execution significantly improve speedup with the affordable hardware
Unit cost and less energy consumption.
Output: Result Scalars
REFERENCES
Accelerator Controller

Central Storage

[1] X. Su and T. M. Khoshgoftaar, "A survey of collaborative filtering


Temp: Vector X

Temp: Vector Y

Temp: Vector Z
Input: Vector X

Input: Vector Y

Output: Vector

techniques," Advances in artificial intelligence, vol. 2009, p. 4, 2009.


[2] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry, "Using
collaborative filtering to weave an information tapestry,"
Communications of the ACM, vol. 35, pp. 61-70, 1992.
[3] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, "Item-based
collaborative filtering recommendation algorithms," in Proceedings of
the 10th international conference on World Wide Web, 2001, pp. 285-
Calculation

Vector Operation Unit Vector Update Unit 295.


Unit

Operation-
Adder Tree
... Operation-
Adder Tree
PE PE ... PE [4] D. Lemire and A. Maclachlan, "Slope One Predictors for Online
Rating-Based Collaborative Filtering," in SDM, 2005, pp. 1-5.
[5] http://grouplens.org/datasets/movielens/.
Figure 2. The architecture of Execution Unit.

495

También podría gustarte