Documentos de Académico
Documentos de Profesional
Documentos de Cultura
a r t i c l e
i n f o
Article history:
Received 21 December 2014
Received in revised form 23 March 2015
Accepted 7 April 2015
Available online xxxx
Keywords:
Big data
Green computing
Ecient machine learning
Computational modeling
a b s t r a c t
With the emerging technologies and all associated devices, it is predicted that massive amount of data
will be created in the next few years in fact, as much as 90% of current data were created in the last
couple of years a trend that will continue for the foreseeable future. Sustainable computing studies the
process by which computer engineer/scientist designs computers and associated subsystems eciently
and effectively with minimal impact on the environment. However, current intelligent machine-learning
systems are performance driven the focus is on the predictive/classication accuracy, based on known
properties learned from the training samples. For instance, most machine-learning-based nonparametric
models are known to require high computational cost in order to nd the global optima. With the
learning task in a large dataset, the number of hidden nodes within the network will therefore increase
signicantly, which eventually leads to an exponential rise in computational complexity. This paper
thus reviews the theoretical and experimental data-modeling literature, in large-scale data-intensive
elds, relating to: (1) model eciency, including computational requirements in learning, and dataintensive areas structure and design, and introduces (2) new algorithmic approaches with the least
memory requirements and processing to minimize computational cost, while maintaining/improving its
predictive/classication accuracy and stability.
2015 Elsevier Inc. All rights reserved.
1. Introduction
Today, its no surprise that reducing energy costs is one of the
top priorities for many energy-related businesses. The global information and communications technology (ICT) industry that pumps
out around 830 Mt carbon dioxide (CO2 ) emission accounts for approximately 2 percent of the global CO2 emissions [1]. ICT giants
are constantly installing more servers so as to expand their capacity. The number of server computers in data centers has increased
sixfold to 30 million in the last decade, and each server draws far
more electricity than its earlier models [2]. The aggregate electricity use for servers had doubled between the years 2000 and
2005 period, most of which came from businesses installing large
numbers of new servers [3]. This increase in energy consumption
consequently results in higher carbon dioxide emissions, and hence
causing an impact on the environment. Furthermore, most of these
http://dx.doi.org/10.1016/j.bdr.2015.04.001
2214-5796/ 2015 Elsevier Inc. All rights reserved.
JID:BDR
AID:24 /REV
note that for the digital health data, its integrity and security issues are of critical importance in the eld. For instance, for the
former, data compression techniques may not be used, in many
cases, as they may distort the data; and for the latter, the condentiality of patient data is clearly cardinal in order to foster public
condence in such technologies.
2.3. Stars, galaxies, and the universe
The digital data volume from the stars, galaxies and universe
has multiplied over the past decade due to the rapid development
of new technologies such as new satellites, telescopes and other
observatory instruments. Recently, the Visible and Infrared Survey
Telescope for Astronomy (VISTA) [19] and the Dark Energy Survey
(DES) [20] the largest universe survey projects initiated by two
different consortiums of universities, from the UK, and from the
U.S., are expected to yield databases of 2030 terabytes in size in
the next decade.
According to DES, its observatory eld is so large that a single image will record data from an area of the sky 20 times the
size of the moon as seen from the earth [20]. The survey will
image 5000 degrees of the U.S. southern sky and will take about
ve years to complete. As for VISTA, its performance requirements
were so challenging that it peaks at 55 megabytes/second data rate
with a maximum of 1.4 terabytes of data per night [19]. But, these
are now fairly commonplace. Many other astro-scientic databases,
such as the Sloan Digital Sky Survey (SDSS) are already terabytes
in size [21] and the Panoramic Survey Telescope-Rapid Response
System (Pan-STARRS) is expected to produce a science database
of more than 100 terabytes in size for the next ve years [22].
Likewise, the Large Synoptic Survey Telescope (LSST) is producing
30 terabytes of data per night, yielding a total database of about
150 petabytes [23]. As the data produced by the new telescopes
are expected to come to the Internet, this picture will change radically.
Many believe that the massive data volume and the ever increasing computing power will dramatically change the way in
how conventional science and technology are conducted. We believe that this surge in data will open up and challenge further
research in each eld, hence, instigating the search for new approaches. Likewise, such challenge needs to be addressed in the
area of intelligent information science as well.
3. Sustainable data modeling and ecient learning
With consideration of the large inux of data, it is denitely
necessary to improve the way in how conventional computational/analytic data models are designed and developed. Sustainable data modeling can be dened as a form of data modeling
technology, aimed to make sense of the large amount of data
associated in its own eld, by discovering patterns and correlations in an effective and ecient way. Sustainable data modeling
specically focuses on 1) maximum learning accuracy with minimum computational cost, and 2) rapid and ecient processing
of large volumes of data. Sustainable data modeling seems to be
ideal because of its ease in which large quantities of data are handled eciently as well as its associated cost reduction observed
in many cases. In a wider perspective, it entails a data-modeling
revolution in e-sciences. In fact, these newly designed sustainable
data models will effectively cope with the above data issues and,
as a result, bring about benets to the various e-science areas.
Some of the excellent examples are well discussed in Patnaik et
al., Sundaravaradan et al., and Marwahs article [2427]. Hence, in
this section, we will give a few recommendations to green engineers/researchers on a few key mechanics of the sustainable data
modeling.
y = arg max
c j C
P (c j |h i ) P ( T |h i ) P (h i ),
hi H
JID:BDR
AID:24 /REV
Z i exp
(x c i )T (x c i )
2
Zi
exp
j =1
(x x j )T (x x j )
2
f (x) = sgn
y i ai (x), (xi ) + b
i =1
= sgn
y i ai k(x, xi ) + b
i =1
maximize W (a) =
ai
i =1
1
ai a j y i y j k(xi , x j ),
i , j =1
subject to ai 0, i = 1, . . . , ,
and
a i y i = 0,
i =1
min
C,Z
n
k
j =1 i =1
Z i, j X i
C j 22
n
k
+R
Z i , j y i ,
j =1 i =1
min
Z ,t
s.t.
n
k
k
Z i , j X i C j 22 + R
j =1 i =1
t j
t j,
j =1
n
Z i, j y i t j ,
i =1
t j 0, 0 Z i , j 1,
k
Z i , j = 1,
j =1
n
i =1
Q j ( Xm ) = C j =
n
Z i, j X i
i =1
Z i, j
j = 1, 2, . . . , N .
To nd the optimal C and P , vector quantization uses a squareerror distortion measure that species exactly how close the approximation is. The distortion measure is given as:
y i ai k(x, c i ) + b ,
i =1
maximize W (a) =
ai
i =1
1
2
and
a i y i = 0.
i =1
k
d(x, c i ) =
(x j c i j )2 ,
M
1
Mk
Xm Q ( Xm )2
m =1
If C and P are solution parameters to the minimization problem, then it must satisfy two conditions: (1) nearest-neighbor and
(2) centroid. The nearest-neighbor condition indicates that the subregion S n should consist of all the vectors that are closer to cn than
any of the other codevectors:
S n = x: x cn 2 x cn 2 , n = 1, 2, . . . , N ,
nally, the centroid condition indicates that the codevector cn
can be derived from the average of all the training vectors in its
Voronoi region S n :
X Sn
cn = m
Xm
Xm S n
n = 1, 2, . . . , N .
j =1
D ave =
ai a j y i y j k Q i (x), Q j (x) ,
i , j =1
subject to ai 0, i = 1, . . . , ,
Fig. 1. Two-dimensional (2D) vector quantization. (For interpretation of the references to color in this gure, the reader is referred to the web version of this article.)
V i = x R k : x c i x c j , for all j = i ,
the training vectors falling into a particular region are approximated by a red dot associated with that region (Fig. 1).
Shallow learning models (e.g., SVM, MLP, and GMM) have been
widely used in the literature to solve simple or well-constrained
problems. However, their limited modeling and representational
power do not support their use in solving more complex problem, such as natural language problems. In 2006, the so-called
deep learning (a.k.a. Representation learning) has emerged as
new area of ML research [4345] that exploits multiple layers of
information-processing in a hierarchical architecture for pattern
classication and or representation learning (e.g., Feed-forward
neural networks) [46]. The main advantage of deep learning is
referred to the drastically increased chip processing abilities, the
lowered cost of computing hardware, and the recent advances in
ML.
JID:BDR
AID:24 /REV
exp(x j )
Pj =
,
k exp(xk )
where P j represents the class probability and x j and xk represent
the total input to units j and k respectively. The cross entropy is
dened as follows:
C=
d j log( p j ),
exp(c j )
k exp(ck )
exp(x j )
k exp(xk )
[18] EMR, E. News, Dell launches new cloud-based services for hospitals and
physician practices, online, http://www.emrandhipaa.com/news/2011/02/21/
dell-launches-new-cloud-based-services-for-hospitals-and-physician-practice,
accessed on 3rd April 2011.
[19] A.M. McPherson, et al., Vista: project status, in: SPIE 6267, Ground-based and
Airborne Telescopes, 626707 (23 June 2006), vol. 6267, 2006, http://spie.org/
Publications/Proceedings/Paper/10.1117/12.671352.
[20] The dark energy survey, online, http://www.darkenergysurvey.org/, accessed on
20th April 2011.
[21] The sloan digital sky survey, http://www.sdss.org/, accessed on 20th April 2011.
[22] N. Kaiser, et al., The pan-starrs survey telescope project, Bull. Am. Astron. Soc.
37 (2007) 1409.
[23] S. Ref, Lsst receives $30 million from Charles Simonyi and Bill Gates, online,
http://www.spaceref.com/news/viewpr.html?pid=24409, accessed on 25th April
2011.
[24] M. Marwah, A. Shah, C. Bash, C. Patel, N. Ramakrishnan, Using data mining to
help design sustainable products, Computer 44 (8) (2011) 103106.
[25] N. Sundaravaradan, D. Patnaik, N. Ramakrishnan, M. Marwah, A. Shah, Discovering life cycle assessment trees from impact factor databases, in: AAAI, 2011.
[26] N. Sundaravaradan, M. Marwah, A. Shah, N. Ramakrishnan, Data mining approaches for life cycle assessment, in: IEEE ISSST, vol. 11, 2011.
[27] D. Patnaik, M. Marwah, R.K. Sharma, N. Ramakrishnan, Data mining for modeling chiller systems in data centers, in: Advances in Intelligent Data Analysis,
vol. IX, Springer, 2010, pp. 125136.
[28] P. Yang, Y. Hwa Yang, B.B. Zhou, A.Y. Zomaya, A review of ensemble methods
in bioinformatics, Current Bioinform. 5 (4) (2010) 296308.
[29] D. Opitz, R. Maclin, Popular ensemble methods: an empirical study, J. Artif.
Intell. Res. 11 (1999) 169198.
[30] R. Polikar, Ensemble based systems in decision making, IEEE Circuits Syst. Mag.
6 (3) (2006) 2145.
[31] Y. Tang, Y. Wang, K. Cooper, L. Li, Towards big data Bayesian network learning
an ensemble learning based approach, in: 2014 IEEE International Congress
on Big Data (BigData Congress), June 2014, pp. 355357.
[32] Wikipedia, Ensemble in numerical weather prediction, online, http://
en.wikipedia.org/wiki/Ensemble_forecasting, accessed on 23rd Feb. 2011.
[33] J.A.R. Blais, Estimation and Spectral Analysis, Univ. Calgary Press, Canada, 1988.
[34] C. Elkan, Naive Bayesian learning, Department of Computer Science and Engineering, University of California, San Diego, 1997.
[35] T. Joachims, Training linear svms in linear time, in: Proceedings of the 12th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2006, pp. 217226.
[36] P.D. Yoo, Y.S. Ho, B.B. Zhou, A.Y. Zomaya, Siteseek: post-transltional modication analysis using adaptive locality-effective kernel methods and new proles,
BMC Bioinform. 9 (1) (2008) 272.
[37] L. Bottou, V. Vapnik, Local learning algorithms, Neural Comput. 4 (6) (1992)
888900.
[38] N. Gilardi, S. Bengio, Local machine learning models for spatial data analysis,
J. Geogr. Inform. Dec. Anal. 4 (2000) 1128, EPFL-ARTICLE-82651.
[39] H. Zhang, A.C. Berg, M. Maire, J. Malik, Svm-knn: discriminative nearest neighbor classication for visual category recognition, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE, 2006,
pp. 21262136.
[40] K. Lau, Q. Wu, Local prediction of non-linear time series using support vector
regression, Pattern Recognit. 41 (5) (2008) 15391547.
[41] P.D. Yoo, A.R. Sikder, J. Taheri, B.B. Zhou, A.Y. Zomaya, Domnet: protein domain
boundary prediction using enhanced general regression network and new proles, IEEE Trans. Nanobiosci. 7 (2) (2008) 172181.
[42] A. Gersho, R.M. Gray, Vector Quantization and Signal Compression, Springer
Science & Business Media, 1992.
[43] G. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets,
Neural Comput. 18 (7) (2006) 15271554.
[44] Y. Bengio, A. Courville, P. Vincent, Representation learning: A review and new
perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 17981828.
[45] X.-W. Chen, X. Lin, Big data deep learning: challenges and perspectives, IEEE
Access 2 (2014) 514525.
[46] L. Deng, A tutorial survey of architectures, algorithms, and applications for
deep learning, APSIPA Trans. Signal Inform. Process. 3 (2014), p. e2.
[47] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504507.
[48] Y. Lv, Y. Duan, W. Kang, Z. Li, F.-Y. Wang, Trac ow prediction with big
data: a deep learning approach, IEEE Trans. Intell. Transp. Syst. PP (99) (2014)
19.
[49] J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q.V. Le, A.Y. Ng, On optimization
methods for deep learning, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 265272.
[50] L. Deng, D. Yu, Deep convex net: a scalable architecture for speech pattern
classication, in: Proceedings of the Interspeech, 2011.
[51] D.H. Wolpert, Stacked generalization, Neural Netw. 5 (2) (1992) 241259.
[52] K.P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012.
[53] A. Baldominos, E. Albacete, Y. Saez, P. Isasi, A scalable machine learning online
service for big data real-time analysis, in: 2014 IEEE Symposium on Computational Intelligence in Big Data (CIBD), Dec. 2014, pp. 18.
[54] H. Huang, H. Liu, Big data machine learning and graph analytics: current state
and future challenges, in: 2014 IEEE International Conference on Big Data (Big
Data), Oct. 2014, pp. 1617.
[55] A. Bifet, G.D.F. Morales, Big data stream learning with samoa, in: 2014 IEEE
International Conference on Data Mining Workshop (ICDMW), Dec. 2014,
pp. 11991202.
[56] Apache, Hadoop, http://hadoop.apache.org.
[57] R. Gu, X. Yang, J. Yan, Y. Sun, B. Wang, C. Yuan, Y. Huang, Shadoop: improving mapreduce performance by optimizing job execution mechanism in hadoop
clusters, J. Parallel Distrib. Comput. 74 (3) (2014) 21662179.
[58] L. Ding, J. Xin, G. Wang, S. Huang, Commapreduce: an improvement of mapreduce with lightweight communication mechanisms, in: Database Systems for
Advanced Applications, Springer, 2012, pp. 150168.
[59] M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: distributed data-parallel
programs from sequential building blocks, in: ACM SIGOPS Operating Systems
Review, vol. 41, ACM, 2007, pp. 5972.
[60] R. Power, J. Li, Piccolo: building fast, distributed programs with partitioned tables, in: OSDI, vol. 10, 2010, pp. 114.
[61] C.P. Chen, C.Y. Zhang, Data-intensive applications, challenges, techniques and
technologies: a survey on big data, Inf. Sci. 275 (2014) 314347.
[62] Q.V. Le, Building high-level features using large scale unsupervised learning, in:
2013 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, 2013, pp. 85958598.
[63] K. Zhang, X. Chen, Large-scale deep belief nets with mapreduce, 2014.