Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Lab Guide
Summer 2014
This Certified Training Services Partner Program Guide (the Program Guide) is protected under
U.S. and international copyright laws, and is the exclusive property of MapR Technologies,
Inc. 2014, MapR Technologies, Inc. All rights reserved.
ii
Contents
Administration of Hadoop Lab Guide ......................................................................i
Get Started ............................................................................................................ 8
Get
Started
1:
Set
up
a
lab
environment
in
Amazon
Web
Services
(AWS)
.................................
8
Lab
Procedure
.............................................................................................................................
8
Create
an
AWS
Account
..........................................................................................................
8
Configure
Virtual
Private
Cloud
(VPC)
Networking
.................................................................
8
Create
AWS
Virtual
Machine
Instances
for
Hadoop
Installation
...........................................
10
Create
an
AWS
VM
Instance
for
NFS
Access
.........................................................................
12
Log
in
to
AWS
Nodes
.............................................................................................................
14
Managing
your
Nodes
...........................................................................................................
15
Terminating
Your
Instances
and
EBS
Storage
.......................................................................
16
Get
Started
2:
Setup
passwordless
ssh
access
between
nodes
................................................
17
Get
Started
3:
Log
into
the
class
cluster
....................................................................................
18
Lab
Procedure
...........................................................................................................................
18
Get
Started
4:
Explore
the
MapR
Control
System
.....................................................................
22
Lab
Procedure
...........................................................................................................................
23
Log
on
and
explore
different
views
of
the
cluster
................................................................
23
Identify
Specific
Characteristics
of
Your
Cluster
....................................................................
24
Conclusion
.................................................................................................................................
25
Lessons
Learned
....................................................................................................................
25
iii
iv
vi
vii
Get Started
Get Started 1: Set up a lab environment in Amazon Web
Services (AWS)
This
set
up
procedure
will
show
you
how
to
create
your
lab
environment
in
AWS
for
the
MapR
Hadoop
Operations
on-demand
training.
For
a
classroom
or
virtual,
instructor
led
training
session,
these
AWS
environments
will
already
be
set
up
for
you,
and
your
instructor
will
give
you
further
instructions
for
how
to
access
your
lab
environment.
The
steps
below
need
to
be
followed
in
order
to
properly
set
up
the
AWS
lab
environment.
Lab Procedure
Create an AWS Account
You
need
to
have
an
account
on
Amazon
Web
Services.
If
you
already
have
an
AWS
account,
you
can
skip
this
task.
Note
that
you
will
need
to
provide
your
email
address,
billing
information
(credit
card),
and
a
phone
number
that
you
may
be
contacted
at
in
order
to
create
the
account.
1. Point
your
Web
browser
to
http://aws.amazon.com
2. Click
the
"Sign
Up"
button
at
the
top
right-hand
side
of
the
Web
page
3. Select
the
"I
am
a
new
user"
radio
button
4. Type
your
email
address
in
the
"My
e-mail
address
is:"
text
field
5. Click
the
"Sign
in
using
our
secure
server"
button
6. Fill
out
the
"Login
Credentials"
Web
form
and
click
the
"Continue"
button
7. Fill
out
the
"Contact
Information"
Web
form
and
click
the
"Create
Account
and
Continue"
button
8. Fill
out
the
"Payment
Information"
Web
form
and
click
the
"Continue"
button
9. Fill
out
the
"Identity
Verification"
Web
form
and
click
the
"Call
Me
Now"
button.
Once
you
reply
to
the
phone
call
using
your
4-digit
code
from
this
Web
form,
click
the
"Continue
to
select
your
Support
Plan"
button
10. Fill
out
the
"Support
Plan"
Web
form.
Note
you
will
not
need
support
services
from
Amazon
in
order
to
run
the
labs
in
this
class.
Click
the
"Continue"
button
11. Your
AWS
account
is
now
provisioned
and
you
can
begin
setting
up
the
virtual
machines
for
your
class
AWS
provides
two
types
of
network
configurations:
VPC
and
"classic".
The
lab
guide
has
been
written
using
the
recommended
VPC
network.
The
configuration
steps
are
below.
1. Point
your
Web
browser
to
http://aws.amazon.com
2. Select
"AWS
Management
Console"
from
the
"My
Account
/
Console"
drop-down
list
3. Type
your
email
address
in
the
"My
e-mail
address
is:"
text
field.
Select
the
"I
am
a
returning
user
and
my
password
is:"
radio
button.
Click
the
"Sign
in
using
our
secure
server"
button
4. In
the
"Compute
&
Networking"
section
of
your
AWS
management
console,
click
the
"VPC"
link
5. In
the
"Virtual
Private
Cloud"
section
of
your
navigation
pane,
click
"Your
VPCs"
6. Cliick
the
"Create
VPC"
button
and
fill
out
the
Web
form
as
follows:
a. Name
tag:
mapr-odt-vpc
b. CIDR
block:
10.0.0.0/16
c. Tenancy:
Default
d. Click
the
"Yes,
Create"
button
7. In
the
"Virtual
Private
Cloud"
section
of
your
navigation
pane,
click
"Subnets"
8. Click
the
"Create
Subnet"
button
and
fill
out
the
Web
form
as
follows:
a. Name
tag:
mapr-odt-subnet
b. VPC:
mapr-odt-vpc
c. Availability
Zone:
No
Preference
d. CIDR
block:
10.0.0.0/24
e. Click
the
"Yes,
Create"
button
9. Select
the
"mapr-odt-subnet"
checkbox
and
click
the
"Modify
Auto-Assign
Public
IP"
button
as
follows:
a. Select
the
"Enable
auto-assign
Public
IP"
checkbox
b. Click
the
"Save"
button
10. In
the
"Virtual
Private
Cloud"
section
of
your
navigation
pane,
click
"Route
Tables"
11. Click
the
"Create
Route
table"
button
and
fill
out
the
Web
form
as
follows:
a. Name
tag:
mapr-odt-routes
b. VPC:
mapr-odt-vpc
c. Click
the
"Yes,
Create"
button
12. In
the
"Virtual
Private
Cloud"
section
of
your
navigation
pane,
click
"Internet
Gateways"
13. Click
the
"Create
Internet
Gateway"
button
and
fill
out
the
Web
form
as
follows:
a. Name
tag:
mapr-odt-gw
b. Click
the
"Yes,
Create"
button
c. Select
the
checkbox
next
to
the
"mapr-odt-gw"
object
and
click
the
"Attach
to
VPC"
button
d. Select
"mapr-odt-vpc"
from
the
"VPC"
drop-down
list
and
click
the
"Yes,
Attach"
button
14. In
the
"Virtual
Private
Cloud"
section
of
your
navigation
pane,
click
"Route
Tables"
15. Select
the
"mapr-odt-routes"
object,
select
the
"Routes"
tab,
and
click
the
"Edit"
button.
Fill
out
the
Web
form
as
follows:
a. Destination:
0.0.0.0/0
b. Target:
mapr-odt-gw
c. Click
the
"Save"
button
16. In
the
"Virtual
Private
Cloud"
section
of
your
navigation
pane,
click
"Subnets"
17. Select
the
"mapr-odt-subnets"
object
and
select
the
"Route
Table"
tab.
Click
the
"Edit"
button
and
fill
out
the
form
as
follows:
a. Select
the
"Change
To"
drop-down
list
and
select
"mapr-odt-routes"
b. Click
the
"Save"
button
10
b. US
West
(Oregon)
c. US
West
(N.
California)
d. EU
(Ireland)
e. Asia
Pacific
(Singapore)
f. Asia
Pacific
(Tokyo)
g. Asia
Pacific
(Sydney)
h. South
America
(Sao
Paulo)
6. In
the
"INSTANCES"
section
of
the
navigation
pane
on
the
left-hand
side
of
the
Web
page,
click
the
"Instances"
link
7. Click
the
"Launch
Instance"
button
8. In
the
"Step
1:
Choose
an
Amazon
Machine
Image"
Web
page,
scroll
down
to
the
bottom
of
the
page
and
select
the
64-bit
version
of
an
image
of
Red
Hat
v6.4
or
6.5.
Note:
Red
Hat
7.0
is
NOT
currently
supported.
9. In
the
"Step
2:
Choose
an
Instance
Type"
Web
page,
select
the
checkbox
for
"m3.large"
type
and
click
the
"Next:
Configure
Instance
Details"
button
10. In
the
"Step
3:
Configure
Instance
Details"
Web
page,
fill
out
the
form
as
follows:
a. Number
of
instances:
3
b. Purchasing
option:
leave
"Request
Spot
Instances"
unchecked
c. Network:
mapr-odt-vpc
d. Subnet:
mapr-odt-subnet
e. Auto-assign
Public
IP:
enable
f. IAM
role:
None
g. Shutdown
behavior:
Stop
h. Enable
termination
protection:
Check
"protect
against
accidental
termination"
checkbox
i. Monitoring:
leave
"Enable
CloudWatch
detailed
monitoring"
unchecked
j. Tenancy:
"shared
tenancy
(multi-tenant
hardware)
k. Click
the
"Next:
Add
Storage"
button
11. In
the
"Step
4:
Add
Storage"
Web
page:
a. Click
the
"Add
New
Volume"
button
b. Leave
all
the
defaults
except
check
the
"Delete
on
termination"
checkbox
c. Repeat
the
above
steps
2
more
times
to
add
a
total
of
3
EBS
volumes
to
your
instances
d. Click
the
"Next:
Tag
Instance"
button
12. In
the
"Step
5:
Tag
Instance"
Web
page,
type
"mapr-install-node"
in
the
"Value"
field
and
click
the
"Next:
Configure
Security
Group"
button
13. In
the
"Step
6:
Configure
Security
Group"
Web
page,
select
the
"Create
new
security
group"
radio
button,
type
"mapr-sg"
in
the
"Security
group
name:"
field",
and
perform
the
following
steps:
a. Click
the
"Add
Rule"
button
11
b. Select
"All
TCP"
from
the
"Type"
drop-down
list
and
select
"Anywhere"
from
the
"Source"
drop-down
list
c. Click
the
"Add
Rule"
button
d. Select
"All
UDP"
from
the
"Type"
drop-down
list
and
select
"Anywhere"
from
the
"Source"
drop-down
list
e. Click
the
"Add
Rule"
button
f. Select
"All
ICMP"
from
the
"Type"
drop-down
list
and
select
"Anywhere"
from
the
"Source"
drop-down
list
g. Click
the
"Review
and
Launch"
button
14. In
the
"Step
7:
Review
Instance
Launch"
Web
page,
review
your
instance
launch
details
and
click
the
"Launch"
button
15. In
the
"Select
an
existing
key
pair
or
create
a
new
key
pair"
pop-up
window,
perform
one
of
the
following
steps:
a. select
"Create
a
new
key
pair"
and
type
"mapr-odt-keypair"
in
the
"Key
pair
name"
text
field.
Click
the
"Download
Key
Pair"
button.
OR
b. Select
"select
an
existing
key
pair"
and
select
the
key
pair
from
the
"key
pair
name"
drop-down
list
IMPORTANT
NOTE:
makes
sure
you
save
a
copy
of
the
new
or
existing
key
pair
file
in
a
location
that
you
can
reference
it
throughout
your
training.
If
you
lose
this
file,
you
will
lose
access
to
your
AWS
instances,
and
will
have
to
create
new
ones.
16. Click
the
"Launch
Instances"
button
17. In
the
"Launch
Status"
Web
page,
click
the
"View
Instances"
button
18. Wait
for
the
instances
to
get
in
the
"running"
state
and
status
checks
to
complete
19. Log
the
IP
Addresses
of
VMs
for
use
later.
12
2. Select
"AWS
Management
Console"
from
the
"My
Account
/
Console"
drop-down
list
3. Type
your
email
address
in
the
"My
e-mail
address
is:"
text
field.
Select
the
"I
am
a
returning
user
and
my
password
is:"
radio
button.
Click
the
"Sign
in
using
our
secure
server"
button
4. In
the
"Compute
&
Networking"
section
of
your
AWS
management
console,
click
the
"EC2"
link
5. In
the
"INSTANCES"
section
of
the
navigation
pane
on
the
left-hand
side
of
the
Web
page,
click
the
"Instances"
link
6. Click
the
"Launch
Instance"
button
7. In
the
"Step
1:
Choose
an
Amazon
Machine
Image"
Web
page,
scroll
down
to
the
bottom
of
the
page
and
select
the
64-bit
version
of
an
image
of
Red
Hat
v6.4
or
6.5.
Note:
Red
Hat
7.0
is
NOT
currently
supported
8. In
the
"Step
2:
Choose
an
Instance
Type"
Web
page,
select
the
checkbox
for
"t1.micro"
type
and
click
the
"Next:
Configure
Instance
Details"
button
9. In
the
"Step
3:
Configure
Instance
Details"
Web
page,
fill
out
the
form
as
follows:
a. Number
of
instances:
1
b. Purchasing
option:
leave
"Request
Spot
Instances"
unchecked
c. Network:
mapr-odt-vpc
d. Subnet:
mapr-odt-subnet
e. Auto-assign
Public
IP:
enable
f. IAM
role:
None
g. Shutdown
behavior:
Stop
h. Enable
termination
protection:
Select
"protect
against
accidental
termination"
checkbox
i. Monitoring:
leave
"Enable
CloudWatch
detailed
monitoring"
unchecked
j. Tenancy:
"shared
tenancy
(multi-tenant
hardware)
k. Click
the
"Next:
Add
Storage"
button
10. In
the
"Step
4:
Add
Storage"
Web
page,
click
the
"Next:
Tag
Instance"
button
11. In
the
"Step
5:
Tag
Instance"
Web
page,
type
"MapR-NFS-node"
in
the
"Value"
field
and
click
the
"Next:
Configure
Security
Group"
button
12. In
the
"Step
6:
Configure
Security
Group"
Web
page:
a. select
the
"select
an
existing
security
group"
radio
button
b. select
the
"mapr-sg"
checkbox
c. Click
the
"Review
and
Launch"
button
13. In
the
"Step
7:
Review
Instance
Launch"
Web
page,
review
your
instance
launch
details
and
click
the
"Launch"
button
14. In
the
"Select
an
existing
key
pair
or
create
a
new
key
pair"
pop-up
window:
a. select
"select
an
existing
key
pair"
13
b. select
the
"mapr-odt-keypair"
key
pair
from
the
"key
pair
name"
drop-down
list
Click
the
"I
acknowledge
that
I
have
access
to
the
selected
private
key
file
(name),
and
that
without
this
file,
I
won't
be
able
to
log
into
my
instance"
checkbox
REMINDER:
You
must
have
a
copy
of
this
key
file
in
a
location
that
you
can
reference
it
throughout
your
training.
If
you
lose
this
file,
you
will
lose
access
to
your
AWS
instances,
and
will
have
to
create
new
ones.
15. Click
the
"Launch
Instances"
button
16. In
the
"Launch
Status"
Web
page,
click
the
"View
Instances"
button
17. Wait
for
the
instance
to
get
in
the
"running"
state
and
status
checks
to
complete
14
$ passwd mapr
then
type
the
password
for
the
mapr
user
when
prompted
8. Set
the
root
user
password:
$ passwd root
then
type
the
password
for
the
root
user
when
prompted
9. Allow
password
authentication
to
the
VM:
$ vi /etc/ssh/sshd_config
change
PasswordAuthentication
no
to
PasswordAuthentication
yes
save
and
exit
vi
10. Repeat
steps
6-8
for
all
VM
instances
and
log
the
hostname
of
each
instance
Now
you
have
root
access
on
your
RHEL
virtual
machine
instance,
and
you
can
proceed
with
the
MapR
Hadoop
Operations
labs.
15
To
restart
the
instances,
repeat
these
steps,
and
select
Start
in
step
5.
Remember,
you
should
check
the
Public
IP
settings
of
your
VMs
and
note
any
changes
to
your
IP
addresses.
The
internal
IP
addresses
will
remain
consistent,
so
the
passwordless
ssh
and
Hadoop
software
will
still
function
normally.
16
17
Lab Procedure
Windows
and
Unix
users
use
these
initial
instructions:
1. Download
putty
if
you
are
on
a
windows
machine
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
2. Connect
to
the
MCS
for
your
cluster.
Go
to
class
doc
link
provided
for
this
class.
(Example:
http://doc.mapr.com/display/SE/San+Jose+2013Oct)
3. For
unix
users
do
the
following
in
a
terminal
window:
ssh
-i
students07172012.pem
ec2-user@
<ec2-54-219-84-67.us-west-
1.compute.amazonaws.com>
Note:
be
sure
to
use
the
DNS
resolvable
names
as
indicated
on
your
class
webpage,
or
use
an
Outside
IP
address
to
get
to
the
Amazon
node.
In
this
example
it
would
be
54.219.84.67
4. For
windows
users
take
the
following
steps
in
a
putty
window.
Before
clicking
the
OPEN
button
install
the
PPK
key
file,
see
the
following
images:
18
19
5. Login
as
ec2-user
password
=
mapr
6. To
become
root
use
#sudo
i
7. To
become
user02
use
the
su
command
8. To
log
in
to
your
MapR
GUI,
point
your
browser
to
the
URL
of
the
first
node
in
your
class
cluster.
Dont
forget
to
use
https://<node-url>:
8443
--
this
is
the
port
the
MCS
listens
on
for
MCS
sessions
20
Click
I
understand
the
risk.
Then
add
exception.
Confirm
the
security
exception
in
the
popup
window.
9. Login
as
mapr;
password
is
mapr.
You
should
see
a
view
of
the
MCS
GUI
upon
successful
login.
21
22
Lab Procedure
Log on and explore different views of the cluster
1.
Connect
to
the
MCS
for
your
cluster.
2. From
the
Dashboard,
identify
node
icons
on
the
cluster
heat
map.
Notice
designation
for
racks
if
shown.
3. From
the
Dashboard,
discover
the
different
views
of
the
cluster,
including:
Health
CPU
utilization
Memory
utilization
Disk
space
utilization
4. Step
4.
Find
how
to
display
the
legend
for
each
view.
23
5. Step
5.
In
the
navigation
pane,
change
view
from
Dashboard
to
Nodes..
You
should
see
something
like
this:
24
9. CPU
utilization?
10. Memory
utilization?
11. What
%
of
disk
space
is
being
used
by
the
entire
cluster?
12. Are
any
MapReduce
jobs
currently
running
on
the
cluster?
Conclusion
The
MapR
Control
system
is
a
convenient
way
to
access,
monitor
and
perform
administrative
tasks
on
your
cluster.
Lessons Learned
Different
filters
can
be
applied
to
see
different
views
of
the
cluster,
including
health,
CPU/memory
utilization,
and
disk
space
utilization.
The left navigation bar directs you to different features or components of the system.
25
Lesson 1: Pre-install
Lab Overview
In this lesson you will learn where you can download a collection of tools and scripts that we will
use to prepare the cluster hardware for the parallel execution of tests and then test and
measure the performance of the hardware components for our cluster to determine that they
are functioning properly and within the specifications for Hadoop installation. we will also
identify the current firmware for each of the new hardware components in the cluster, and
update these components to make sure that they have matching firmware.
Lab Procedures
Lab 1.1: Pre-install validation
Note: One of the most common causes for a failure when installing Hadoop is that the hardware
is not within the necessary specifications. You can see a list of the current hardware and OS
specifications at: http://doc.mapr.com/display/MapR/Preparing+Each+Node
The Professional Services team at MapR has developed a collection of all of the tools and scripts
that we will need to validate our hardware and prepare it for installation.
1. Download the cluster-validation package onto your master node from:
https://github.com/jbenninghoff/cluster-validation/archive/master.zip
Extract master.zip and move the pre-install and post-install folder directly under /root for
simplicity.
2. Here, we will find two directories, pre-install and post-install. We will use the tools and
scripts inside the pre-install directory to validate our new hardware prior to installing
Hadoop. We will use tools and scripts in the post-install later, to test our new cluster
after we have completed our install.
Note: The tools and files in this collection are updated frequently, so we should always make
sure we download the latest package when preparing for a new Hadoop installation.
3. To prepare the cluster for these validation tests, choose one node on the cluster to be
your set up master node. Generate ssh keys on this node, and make sure that it has
passwordless ssh access to all other nodes on the cluster. You can find steps for how to
do this in your lab guide at the end of this guide.
4. Inside the pre-install directory is a clustershell rpm. Install this rpm on the master node
with passwordless ssh access to the rest of our cluster. We will be making all further
commands for this exercise from this master node, using clush to propagate those
commands throughout the rest of our hardware.
5. Once installed, update the file to include an entry for all:
/etc/clustershell/groups
then the host names for the nodes we will use, such as:
all: node[0-19]
6. Once we have our node names listed, type the following to copy the /root/pre-install
directory to all of our node hardware.:
7. When that is complete, type to confirm that all of the nodes have a copy of the package:
# clush -Ba ls
/root/pre-install
8. After we have a copy of the pre-install package on all nodes, we are ready to start our
hardware validation tests. First, we will run an audit of our hardware to see exactly
what we have on each node, and to verify that they all have a similar configuration. to
run the cluster-audit.sh script, type:
20
Note: that the audit output will give us deltas when looking at things like the RAM. It will tell us
the total about of RAM, number of slots and then the types of DIMMs found, but it will not tell
us which exact DIMMs are in which slots. Also, if only one DIMM type is listed, then all slots
have the same DIMM type
This tests the memory performance of the cluster. The exact bandwidth of memory is
highly variable and is dependent on the speed of the DIMMs, the number of memory
channels and to a lesser degree, the CPU frequency.
4. Evaluate the raw disk performance. The disk-test.sh script will run IOzone on our
hard drives to test their performance.
Note: This process is destructive to any existing data, so make sure the drives do not have any
needed data on them, and that you do not run this test after you have installed MapR Hadoop
on the cluster.
21
Type:
Conclusion
Now that we have run all of our hardware tests, and compiled benchmarks for all of our
components, we have one final task to prepare our new hardware for installation.
The firmware for the new hardware must be up to date with vendor specifications and match
across each of the nodes of the same type. The BIOS versions and settings must also match for
similar nodes. In addition, the firmware for the management interfaces needs to be the same
on each of these nodes. Any other hardware components that we may have in our system, such
as NICs or onboard RAID controllers also need to have updated and matching firmware.
We will need to refer to the manual for each node vendor that we are including, and update the
firmware and BIOS according to their specifications. If there is a discrepancy in our BIOS or
firmware between nodes from the same vendor, then we can see inconsistent performance
across nodes.
22
Lab Procedure
Install a MapR cluster using the map-installer on the AWS environment
Note:
Check
the
following
requirements
prior
to
installation:
1. Log
into
the
master
node
of
your
cluster
as
described
above,
or
as
described
by
your
instructor.
2. Navigate
to
the
/home/mapr
directory:
$ cd /home/mapr
3. Download
the
mapr-setup
package:
$ wget http://package.mapr.com/releases/v.3.1.1/<yourLinuxOS>/mapr-setup
32
MapReduce
=
true
YARN
=
false
HBase
=
false
M7
=
true
ControlNodesAsDataNodes
=
true
WirelevelSecurity
=
false
LocalRepo
=
false
[Defaults]
ClusterName
=
<your_Team#_cluster>
User
=
mapr
Group
=
mapr
Password
=
mapr
UID
=
2000
GID
=
2000
Disks
=
CoreRepoURL
=
http://package.mapr.com/releases
EcoRepoURL
=
http://package.mapr.com/releases/ecosystem
Version
=
<3.1.0>
MetricsDBHost
=
<node1
of
classcluster_if
setup_by_instructor>
MetricsDBUser
=
<mapr>
MetricsDBPassword
=
<mapr>
MetricsDBSchema
=
<metrics[1-6]>
[root@ip-10-170-125-38
bin]#
bash
/opt/mapr-installer/bin/install
--help
Verifying
install
pre-requisites
updating
package
cache...
installing
pre-requisite
openssl098e
installing
pre-requisite
sshpass
...
verified
======================================================================
MapR
Installer
======================================================================
Version:
2.0.135
usage:
mapr-install.py
[-h]
[-s]
[-U
SUDO_USER]
[-u
REMOTE_USER]
33
--debug
--password
REMOTE_PASS
--private-key
PRIVATE_KEY_FILE
--quiet
--skip-checks
--sudo-password
SUDO_PASS
-K, --ask-sudo-pass
-h, --help
-k, --ask-pass
-s, --sudo
34
OR
B.
If
you
are
using
a
config
file,
run
the
installer
to
determine
if
the
parameters
you
have
specified
are
correct.
$ sudo /opt/mapr-installer/bin/install K s --cfg
config.example
--private-key <yourPEMkey> -u ec2-user -U root --debug new
10. In
the
summary
response
area
choose
(a)bort
after
examining
your
parameters.
11. Rerun
the
installer
with
the
-quiet
aruguement
for
non-interactive
mode
and
the
&
to
background
the
installer
in
case
the
window
is
lost
or
the
laptop
goes
to
hibernate
mode.
This
time,
select
(c)
to
continue
with
the
install
after
reviewing
the
parameters.
A. $ sudo /opt/mapr-installer/bin/install K s privatekey <yourPEMkey> -u ec2-user U root debug --quiet new &
OR
B. $sudo /opt/mapr-installer/bin/install --cfg
config.example
--private-key students07172012.pem -u ec2-user -s -U root -debug --quiet new &
Note:
View
details
about
installing
on
an
OS
other
than
RedHat,
or
more
options
for
custom
installation
at:
http://www.mapr.com/doc/display/MapR/Preparing+Packages+and+Repositories
http://www.mapr.com/doc/display/MapR/Installing+MapR+Software
The
administrative
user
who
should
be
given
full
permission
is
mapr
and
the
user
password
is
mapr.
When
registering
your
cluster
select
an
M7
Trial
license.
Also,
be
sure
to
apply
your
M7
license
before
you
close
the
License
Management
dialog.
12. Watch
the
installation
process
and
look
for
the
various
packages
being
installed.
After
the
control
nodes
have
been
installed
(usually
20-30min)
log
into
the
MCS
by
point
your
browser
to
the
IP
address
of
one
of
the
control
nodes,
at
port
8443:
http://ControlNodeIP:8443/
13. Accept
the
MapR
agreement,
and
select
the
licenses
link
in
the
upper
right
corner.
35
14. Apply
the
temporary
M7
license
received
when
registering
for
the
course.
If
you
do
not
have
a
temporary
license,
contact
training@mapr.com
or
ask
your
instructor
if
you
are
taking
a
classroom
or
virtual
training
class.
15. After
you
have
successfully
applied
a
trial
license
you
may
notice
that
some
of
the
nodes
in
the
cluster
have
orange
icons
in
the
heatmap
indicating
that
they
have
degraded
service.
16. As
the
installer
continues
to
install
packages,
and
the
warden
service
start
the
services
on
each
node,
we
will
begin
to
see
the
nodes
turn
green.
Eventually
all
of
the
nodes
will
be
green,
indicating
that
all
nodes
are
active
and
healthy.
Conclusion
Discussion
1. Once
you
see
that
the
cluster
is
active,
try
exploring
the
MCS
by
clicking
on
the
different
links
in
the
Navigation
pane
and
on
the
Dashboard.
What
will
you
be
able
to
monitor
once
you
begin
to
use
your
cluster?
2. What
would
your
next
step
be
after
installing
the
cluster?
36
Lesson 3: Post-install
Lab Overview
If you remember, the package that we downloaded in our pre-install lesson contained a postinstall directory. That directory contains all of the tools and scripts that we need to run post
install benchmarks to make sure our new cluster is performing as expected.
First, we will test the drive throughput. As with our pre-install tests, we will use clush to push
this test to all of the nodes on our cluster.
Lab Procedures
3.1 Run RWSpeedTest
1. Log into the master node that we used for our pre-install tests and navigate to the
directory /root/post-install. In here we will find the file runRWSpeedTest.sh.
2. Note: This script uses an HDFS API to stress test the io subsystem. The output provides
an estimate of the maximum throughput the io subsystem can deliver. To begin the test,
type:
# clush -Ba /root/post-install/runRWSpeedTest.sh | tee
RWSpeedTest.log
3. After we run our RWSpeed, we can compare our results to our pre-install IOzone tests.
We should expect to see similar results, within 10-15% of the pre-installation test.
3.2 TeraGen/TeraSort
Teragen is a map/reduce program that will generate 1GB of synthetic data, and Terasort
samples this data and uses map/reduce to sort it into a total order. These two tests together
will challenge the upper limits of our clusters performance
1. Type:
# maprcli volume create -name data1 replication 1 mount 1
path /root/data1
# mkdir data1/out1
#
mkdir data1/out2
3. This will create 1TB worth of small number data. Once teragen has finished then type to
sort the newly created data:
hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2dev-examples.jar terasort /data1/out1 /data1/out2
When we are running Terasort, we can use the MCS to watch the node usage. When we set the
heatmap to show Disk Usage, we can see the load on each node. We are looking for the load
to be spread evenly across our cluster. Hotspots suggest a problem with a hard drive or its
controller. We can change the view of our heatmap to look at the load of different resources of
our cluster as we run our tests.
In addition to the heatmap views, we can look at the services and jobs. Since we are using
synthetic code, we know that it functions properly. If we have a job or task failure, then we
have an issue with our hardware.
When Terasort is finished, we can compare the results with our RWSpeedTest results. We
should expect to see our Terasort throughput to be between 50% to 70% of our RWSpeedTest
throughput. Since we know the Terasort job code does not have any errors, if we see
performance that doesnt match our expectations, we know we have a problem with the
hardware in our cluster.
32
These labs provide insight into how data is managed in a MapR cluster, and teach hands-on
experience configuring topologies, volumes and quotas. You have a great degree of control over
your organizations MapR storage resources. Configuring a cluster with appropriate topologies
and volumes has long-term impacts on performance, reliability and ease-of-management. This
lab is broken into three separate exercises that build on each other.
Lab Procedures
Always set up node topology before deploying cluster. Never leave nodes in /data/defaultrack
Key Tips:
Create volumes to contain different types of data on the cluster before deploying the
cluster. (E.g., create one volume per user, one volume per project, distinct volumes for
production work and development work, etc.) Dont let data accumulate at the root
level of the cluster.
MapR separates the concepts of volume ownership and quota accounting. Project
members can have full ownership of files and folders for a project, while the collective
storage for the whole project is restricted by a quota independent of individual users.
Rack Layout
In this training lab environment, our physical rack layout is hypothetical. If you were
configuring node topology in a physical cluster environment, then you would coordinate with
the team responsible for the physical setup of the cluster to build a diagram of the physical rack
layout. For this lab, lets assume our clusters nodes are contained in two racks.
Note: If applicable, you may need to coordinate your activities on your Team# cluster with the
other members of your team.
r1_nodeN
rack2/
r2_node1
r2_node2
r2_nodeN
decomissioned/
<nodes_to_remove>
34
Important!
Dont start using your cluster with nodes assigned to /data/defaultrack. If you dont take the time to set up topologies early on, you will have
difficulty later taking advantage of MapRs HA features.
In this exercise you will define two new topologies, /data/rack1 and /data/rack2, and
assign all nodes in the cluster to one of the two. The diagram below shows the logical
organization of our clusters node topology.
Given this structure, assigning data to /data will distribute the data across all nodes in the
cluster. Assigning data to /data/rack1 will restrict data storage to the nodes in rack1. And, if
desired, assigning data to /data/rack2/node1 will restrict storage to that particular node.
password = mapr
35
5. Set the default physical topology using the CLI. You can change the default topology,
such that any new node added to the cluster will appear in the specified topology. In
this step, you are going to change the default topology to /data.
a. Open a SSH session with a node in the cluster.
b. Type the following command at a command line.
maprcli config load json | grep default
c. Notice the default topology.
d. To change it you would do the following:
maprcli config save -values
'{"cldb.default.volume.topology":"/data"}'
6. Verify that all nodes are assigned to a physical topology.
a. In the MCS Navigation pane under the Cluster group, click Nodes.
b. Look at the Topology pane and confirm that each node in the cluster appears in a
specific rack, and that no nodes remain under /default-rack.
36
37
38
-g 8000 <USER_number>
39
Name
Username/loginID
Groupname
Teamname/Clusterna
me
user01
webcrawl_dev
Team1
user02
webcrawl_dev
Team1
user03
webcrawl_prod
Team1
user04
webcrawl_prod
Team1
user05
frauddetect_dev
Team2
user06
frauddetect_dev
Team2
user07
frauddetect_prod
Team2
user08
frauddetect_prod
Team2
user09
recommendations_dev
Team3
user10
recommendations_dev
Team3
user11
recommendations_prod
Team3
user12
recommendations_prod
Team3
user13
twittersentiment_dev
Team4
user14
twittersentiment_dev
Team4
user15
twittersentiment_prod
Team4
user16
twittersentiment_prod
Team4
user17
loganalysis_dev
Team5
user18
loganalysis_dev
Team5
user19
loganalysis_prod
Team5
user20
loganalysis_prod
Team5
40
Notice how many volumes are listed. Do these include systems volumes? Hint: notice
whether or not the Systems check box is selected on upper menu.
Display only the non-system volumes by de-selecting the System check box on
the upper menu.
Locate the New Volume button that lets you create a new volume.
What other volume actions are allowed in the Volume Actions modify volume
menu?
Look across the columns to find whether the volume of interest contains data, and if
so, what is the data size?
What is the replication factor listed for the volume you are examining?
2. Find more details for this volume on the Volumes Properties pane. Hint: Open the pane
by clicking the highlighted name of the volume.
41
Repeat the process in Step 1 to create a user volume 2 for your name.
Once again remove the filter so that you can view the full list of non-system
volumes.
42
1. Decide which of your own volumes you want to remove and select it by clicking the
check box by the volume name.
2. Select Remove on the modify Volume menu. You will see this dialog box:
Make your choice for what style of removal you want and click the Remove Volume
button on lower right.
Verify in the volumes list that one of your volumes has disappeared.
Create a volume for each user
In this step, you will create a home volume for all project members, if applicable. On each user
volume:
Restrict the volume to the /data/rack2 topology, which prevents users from consuming
storage resources on /data/rack1.
Assign the Accounting Entity of the user volume to the appropriate group for that user.
Assigning this Accounting Entity prevents the members of the group from collectively
overshooting a storage quota for the project.
Note: user17 and loganalysis_dev are used as examples below. Be sure to substitute the
appropriate user name and group when you create the volumes for your team members.
1. In the MCS, in the Navigation pane under the MapR-FS group, click Volumes.
2. In the Volumes tab click the New Volume button.
3. Following the example below, enter the volume settings for each user volume in the
New Standard Volume dialog box.
Volume Setup section
PROPRIETARY AND CONFIDENTIAL INFORMATION
2014 MapR Technologies, Inc. All Rights Reserved.
43
fc
4. Click OK.
44
Command Line
It is also possible to create a new volume at the command line. For example:
maprcli volume create -path /home/user17/vol \
-ae loganalysis_dev -aetype 1 -topology /data/rack2 \
-quota 128G -advisoryquota 100G \
-user user17:fc -name user17-homedir
Note: The maprcli volume create command requires specific ordering of
arguments. Make sure that the -name option comes last.
You can change quotas later at the command line. For example:
maprcli volume modify -quota 20G -advisoryquota 15G \
-name user17-homedir
5. Change ownership of the volume for the user. At a command line type:
chown user17 /mapr/<my.cluster.com>/home/user17/
Create a volume for your team project
In this step, you will create a volume for your team project, if applicable. Bear in mind the
following criteria for your project volume
Production volumes should be allowed to span the entire cluster, so they will have a
topology of /data
For development volumes, members of both prod and dev groups get full control
For production volumes, only members of the prod group get full control
Assign your group as the Accounting Entity
Note: loganalysis_dev is used in the examples below. Be sure to substitute the appropriate user
name and group when you create the volumes for your project.
45
Note: the example below is for a development group volume. If you are creating a
volume for a production group then the topology would be /data
Topology: /data/rack2
Permissions section
Note: the example below is for a development group volume. If you are creating a
volume for a production group, do not add permissions for the development group.
g:loganalysis_dev
g:loganalysis_prod
fc
fc
Usage Tracking
Group loganalysis_dev
Note: the examples below are for a development group volume. If you are creating a
volume for a production group the Advisory Quota is 19T and the Hard Quota is 20T.
5. Click OK.
6. Change ownership and permissions of the project volume. At a command line type:
46
chgrp loganalysis_dev
/mapr/<my.cluster.com>/home/loganalysis_dev/vol
chmod g+rwx /mapr/<my.cluster.com>/home/loganalysis_dev/vol
1. In the MCS, in the Navigation pane under the MapR-FS group, click Volumes. The
Volumes view appears, listing all volumes in the cluster.
2. Confirm that all of the volumes you created are listed in the Volumes view. Other
volumes that are part of the default cluster configuration may also appear here. You can
use the Filter option to list, for example, only the volumes with a mount path matching
/home*, as shown below.
3. Navigate the volumes at the command line and verify that they have been mounted. For
example:
ls -al /mapr/<my.cluster.com>/home/
ls -al /mapr/<my.cluster.com>/home/loganalysis_dev/vol
You should see the volumes you just created in the previous steps mounted in these
locations.
47
By setting a quota on an Accounting Entity, we can make sure that all volumes assigned to the
Accounting Entity (including user volumes and project volumes) do not collectively overshoot a
project maximum.
1. In the MCS, in the Navigation pane under the MapR-FS group, click User Disk Usage.
The User Disk Usage panel displays all users and groups that have been assigned as an
Accounting Entity (e.g. loganalysis_dev).
2. Click on your project Accounting Entity. The Group Properties dialog box appears.
3. Following the example below, enter the quota settings for your project Accounting
Entity in the Usage Tracking section of the Group Properties dialog box.
48
Command Line
It is also possible to set the Accounting Entity quotas at the command line. For example:
maprcli entity modify -quota 10T -advisoryquota 9T \
-name loganalysis_dev -type 1
Conclusion
Before you begin adding data to your cluster or submitting jobs make a decision about topology
(node/data placement) and implement this decision on your cluster
Create volumes early and often. It is much easier to manage cluster data at a volume level than
managing all of the data on the cluster as one enormous data set. Imagine trying to manage
petabytes of data!
Creating separate volumes provides flexibility of resource management by separating ownership
from accounting
Do not use the / or /data/default-rack topology for data placement
49
Snapshots
Mirrors
view and manipulate data directly on your cluster using standard Linux file commands via
NFS
Before you begin the lab steps, the cluster filesystem must be mounted on the data instance.
/mapr/<TeamCluster3>
3. Copy the data from the /etc directory on the data instance to the input directory on
your project volume that you created in the previous step
cp v /etc/*.conf
/mapr/<my.cluster.com>/home/loganalysis_dev/input
4. Verify that the data is now on in the input directory on your cluster volume
ls /mapr/<my.cluster.com>/home/loganalysis_dev/input
You should see a collection of files that end in .conf
5. Verify that the data you moved from the data instance is now on the cluster in your
project volume
ls /mapr/<my.cluster.com>/home/loganalysis_dev/input
51
Note: The diff and vi you used above are standard Linux commands. Because the cluster
filesystem is mounted via NFS, any standard Linux programs that operate on text files (sed,
awk, grep, etc.) can be used with data on your cluster. This would not be possible without
NFS. You would need to copy the file out of the cluster first before performing your task and
then copy the resultant file back into the cluster.
Conclusion
In this lab you experienced copying data from an external data source to the cluster storage via
NFS. You were able to do so with standard Linux file commands that are familiar to system
administrators. This process would have been much more technically challenging and taken a
significantly longer time to perform without NFS.
52
1. Use ssh to log in to a node in your cluster. Use your own user id here.
$ ssh mapr@classnode-cluster
2. Change directory to your personal volume.
$ cd /mapr/<my.cluster.com>/snapshot_lab_mnt_user01
3. Create a data file called STATIC in your personal user-volume containing whatever
data you choose.
$ cat /etc/hosts > STATIC
Create a volume snapshot of your volume using MCS
Select New Snapshot from the pull down menu under Modify Volume on top bar, provide a
name for your snapshot, and click OK to create a snapshot of the selected volume, in this case,
snapshot_lab_vol_user01. This will create a snapshot of the volume you have selected.
53
Verify the snapshot was made by clicking Snapshots from navigation pane at left side of
window to see snapshot name, mount-path and reference volume.
Create and view contents of a new snapshot
Use CLI to manually create a new snapshot and to see its contents for comparison to the source
volume.
1. Connect to your node via ssh and use CLI command to create a snapshot. Make sure
that the name you give to your snapshot does not have a dash in it.
$ maprcli volume snapshot create \
-volume snapshot_lab_vol_user01\
-snapshotname snapshot2_user01
2. Change directory to your volume mount point, list the snapshots, and then list the
contents of the .snapshot directory
$ cd /mapr/<my.cluster.com>/snapshot_lab_mnt_user01
$ ls -al(notice you dont see your snapshot)
STATIC
$ ls .snapshot
snapshot2_user01
SNP_of_lab_vol_user01_2013-07-16.12-31-44
$ ls .snapshot/snapshot2_user01/
STATIC
Notice that there is a directory with the name of your snapshot.
The contents of the .snapshot/ snapshot2_user01 directory will be identical to
contents of the volume at the time you took the snapshot.
54
$ while true; do
touch file-$(date +%T)
date >> log; sleep 13
done &
This creates a new file every 13 seconds as this script runs in the background. The file name of
each file will contain the time the file is created. The last command will also log the time each
file is created. This log file will look something like this:
Thu Dec 13 17:15:44 PST 2012
Thu Dec 13 17:15:57 PST 2012
Thu Dec 13 17:16:10 PST 2012
Thu Dec 13 17:16:23 PST 2012
The files created will look something like this:
$ ls
file-17:15:44
file-17:16:23
log
file-17:15:57
file-17:16:10
STATIC
$
Create a new snapshot, wait about 30 seconds, then create another snapshot
Note the last time notation that was displayed in the original ssh window when you created
each snapshot by putting a line into the log file.
$ maprcli volume snapshot create -volume
snapshot_lab_vol_user01
-snapshotname snapshot3_user01; echo snapped $(date)
>> log
PROPRIETARY AND CONFIDENTIAL INFORMATION
2014 MapR Technologies, Inc. All Rights Reserved.
55
1. Change directory into the mount point of the volume you created the snapshots for
earlier
2. List all files and directories there using "ls -a". Note that you won't see the .snapshot
directory because it is hidden. You can see the contents of the .snapshot directory if
you explicitly give its name, but you wont see it otherwise.
Even though you don't see the .snapshot directory using ls in the volume mount point, it is still
there and you can look inside. Do this:
$ ls alh .snapshot
total 2.5K
drwxr-xr-x. 5 root root 3 Jul 16 12:58 .
drwxr-xr-x. 2 root root 2 Jul 16 12:57 ..
drwxr-xr-x. 2 root root 1 Jul 16 12:24 snapshot2_user01
drwxr-xr-x. 2 root root 2 Jul 16 12:57 snapshot3_user01
drwxr-xr-x. 2 root root 1 Jul 16 12:24
SNP_of_lab_vol_user01_ ---2013-07-16.12-31-44
You should see the snapshots that you created earlier.
Note: You can also see a list of snapshots in the MCS along with details like when they were
created and when they will expire. You will not, however, be able to see the contents of the
snapshots from the MCS.
1. List the contents of each snapshot. You should see that more files appear in each
subsequent snapshot, like this:
$ ls .snapshot/*
.snapshot/snapshot1:
STATIC
.snapshot/snapshot2:
file-08:39:16
file-08:39:55
file-08:40:34
file-08:41:13
file-08:39:29
file-08:40:08
file-08:40:47
log
56
file-08:39:42
file-08:40:21
file-08:41:00
STATIC
.snapshot/snapshot3:
file-08:39:16
file-08:40:34
file-08:41:52
file-08:43:11
file-08:39:29
file-08:40:47
file-08:42:05
log
file-08:39:42
file-08:41:00
file-08:42:19
STATIC
file-08:39:55
file-08:41:13
file-08:42:32
file-08:40:08
file-08:41:26
file-08:42:45
file-08:40:21
file-08:41:39
file-08:42:58
You can also look at the contents of the log files in each snapshot. In the second and third
snapshots, you should see everything in the log file up to the moment the snapshot was taken.
That means that you will see the log line for the second snapshot in the third snapshotted
version of the log.
$ cat .snapshot/snapshot3_user01/log
Thu Dec 13 08:39:16 UTC 2012
Thu Dec 13 08:39:29 UTC 2012
...
Thu Dec 13 08:41:13 UTC 2012
snapped at Thu Dec 13 08:41:24 UTC 2012
Thu Dec 13 08:41:26 UTC 2012
Thu Dec 13 08:41:39 UTC 2012
...
Thu Dec 13 08:43:11 UTC 2012
$
57
The parent directory has continued to fill up with files due to the script that has been running all
this time. Note that each snapshot has all of the files that were created before the snapshot
was created, but it has nothing else. The snapshots preserve a view of the content as it was
when the snapshot was created.
$ kill %1
Remove all files except log and static
1. List the contents of your volume and compare the volume contents to the contents of
the last snapshot of the volume:
$ ls
log
STATIC
$ ls .snapshot/snapshot3
file-08:39:16
file-08:40:34
file-08:41:52
file-08:43:11
file-08:39:29
file-08:40:47
file-08:42:05
log
file-08:39:42
file-08:41:00
file-08:42:19
STATIC
file-08:39:55
file-08:41:13
file-08:42:32
file-08:40:08
file-08:41:26
file-08:42:45
file-08:40:21
file-08:41:39
file-08:42:58
$
Note that files that you deleted are still present in each snapshot made before the deletion.
Remember you can review the exact sequence of events that happened by looking at your log
file. Comparing the final version of the log with each snapshotted version is very instructive.
PROPRIETARY AND CONFIDENTIAL INFORMATION
2014 MapR Technologies, Inc. All Rights Reserved.
58
Now you should use the MCS to apply the custom schedule a snapshot schedule for one of your
volumes:
1. Click Volumes under MapR-FS in the Navigation pane
2. Click the name of one of your volumes
3. Scroll down to the Snapshot Scheduling section
59
4. Select the custom schedule from the previous step. Click the OK button at the bottom
of the dialog.
Note that new snapshots are being created and that their content is a frozen view of the volume
as of the particular moment in time when each snapshot was created.
These new volumes will have names based on the time that they are created rather than
sequentially numbered names like the snapshots that you created before.
5. Verify that the new snapshots are being credited according to the schedule you applied.
6. Using the MCS, list snapshots and notice that the ones created by schedule have an
expiration date, while the ones created manually do not.
60
2. Make the selection New Volume from top menu and fill in the template to make a
local mirror (Mounting is optional )
61
Now you have created the mirror volume, but no data has been copied to it.
3. Verify your new mirror volume exists by selecting Mirror Volumes on left bar menu to
display names of all mirrors.
Copy data to your new mirror volume
1. Use the MCS to start mirroring by selecting this option from the Modify Volume
button drop down menu.
2. Verify that data are copied to your mirror volume by watching the display of mirror
volumes. If there is a lot of data, you will see an indication that the copying is in
progress:
62
CLI example:
63
Conclusion
The mirror volumes you created are located on the local cluster. They would be appropriate for
load balancing, for making a read-only version of data available, for isolating a copy of data from
ongoing activities or for deployment.
Remember these lessons learned:
When you make a mirror volume, you must reference the source volume by name.
The new mirror volume you create does not contain data until you start the mirroring
process or apply a schedule.
You will break into teams and, as a class, you will configure both clusters to be aware of each
other so data can be mirrored between them. Each team will configure one or two nodes on
each cluster so it is aware of the other cluster. Then one person will restart the Webserver on
each cluster. Then each team will create a mirror volume on the destination cluster that refers
back to a volume with data on the source cluster and initiate mirroring.
Before beginning lab steps you should verify that they have some data in a volume on the
source cluster. This data could be left over from a previous lab exercise (e.g. the NFS/Accessing
lab) or you can copy new data into a volume for the purpose of this lab. The instructor will
provide you with some test data to use if necessary.
64
Set Up
1. Verify all nodes in the source cluster have <Team name> for the source cluster(line 1)
and configure all nodes to be aware of the <destination Team cluster> (line2)
2. SSH to the node you are configuring on the source cluster
3. Verify in /opt/mapr/conf/mapr-clusters.conf that the <Teamname> is there
Team1
4. Add a second line in /opt/mapr/conf/mapr-clusters.conf for the remote
cluster in the format:
<clusterTeam2> <CLDB1>:7222 <CLDB2>:7222
<CLDB1> and <CLDB2> are the CLDB nodes in the destination cluster
<CLDB_A> and <CLDB_B> are the CLDB nodes in the source cluster
65
66
You should see confirmation at the top of the MCS indicating that the mirror volume was
created
67
68
Conclusion
In this lab you learned how to copy data from one cluster to another using remote mirroring. As you
learned earlier in this course, MapR volumes allow you a greater degree of control over how to
manage data in the cluster. Mirroring the volumes that contain your business-critical data to a
remote cluster can significantly reduce the amount of key data you would lose and the time it would
take to resume productivity in the event of a disaster.
69
Where Title and Topic are in column family info and First and Last are in column family author
ID
Title
Topic
First
Last
cloud
Diana
Truman
highavail
Roopesh
Nair
nosql
Jonathan
Morgan
5. Count the number of rows. Make sure every row is printed to the screen as it is
counted.
6. Retrieve the entire record with ID '2'.
7. Retrieve only the title and topic for record with ID '3'.
8. Change the last name of the author with title "A Comparison of NoSQL Database
Platforms".
Display both the new and old value. Can you explain why both values are there?
70
A data store that will be access by large numbers of client requests, for example
thousands of reads per second.
3. Columns may be created when data is inserted, they don't have to be defined up front.
MapR can scale up to very large numbers of columns per column family. However, table
name and column family have to be defined before data is inserted.
4. In addition to using the list command in the HBase shell, you can use standard Linux
ls to list all tables (and files) stored in a particular directory.
71
72
hbase> put
'/user/user01/Blog','3','author:first','Jonathan'
hbase> put '/user/user01/Blog','3','author:last','Morgan'
6. Count the number of rows of data inserted
hbase> count '/user/user01/Blog',INTERVAL=>1
8. Retrieve only the title and topic for record with ID '3'.
hbase> get
'/user/user01/Blog','3',{COLUMNS=>['info:title','info:topic
']}
9. The record with title "A Comparison of NoSQL Database Platforms" has ID 3. To update
its value execute a put operation with that ID.
hbase> put '/user/user01/Blog', '3','author:last','Smith'
To verify the put worked, select the record:
hbase> get '/user/user01/Blog','3',
{COLUMNS=>'author:last'}
To display both version specify the number of versions in a get operation:
hbase> get '/user/user01/Blog','3',
{COLUMNS=>'author:last', VERSIONS=>3}
The reason we see the old value is cells have up to three versions by default in MapR
tables.
10. Display all the records.
hbase> scan '/user/user01/Blog'
73
11. Display the title and last name of all the records.
hbase> scan
'/user/user01/Blog',{COLUMNS=>['info:title','author:last']}
12. Display the title and topic of the first two records.
hbase> scan '/user/user01/Blog',
{COLUMNS=>['info:title','info:topic'],LIMIT=>2}
13. The record with title "Enterprise Grade Solutions for HBase" has record ID '2'; delete all
columns for record with ID '2':
hbase> delete '/user/user01/Blog','2','info:title'
hbase> delete '/user/user01/Blog','2','info:topic'
hbase> delete '/user/user01/Blog','2','author:first'
hbase> delete '/user/user01/Blog','2','author:last'
14. To delete a table in HBase shell, the table must first be disabled, and then you can drop
it.
hbase> disable '/user/user01/Blog'
hbase> drop '/user/user01/Blog'
Troubleshooting
74
75
put '/home/user01/Blog','3','author:last','Morgan'
count '/home/user01/Blog',INTERVAL=>1
get '/home/user01/Blog','2'
get '/home/user01/Blog','3',{COLUMNS=>['info:title','info:topic']}
put '/home/user01/Blog', '3','author:last','Smith'
get '/home/user01/Blog','3', {COLUMNS=>'author:last'}
get '/home/user01/Blog','3', {COLUMNS=>'author:last', VERSIONS=>3}
scan '/home/user01/Blog'
scan '/home/user01/Blog', {COLUMNS=>['info:title','author:last']}
scan '/home/user01/Blog',
{COLUMNS=>['info:title','info:topic'],LIMIT=>2}
delete
delete
delete
delete
'/home/user01/Blog','2','info:title'
'/home/user01/Blog','2','info:topic'
'/home/user01/Blog','2','author:first'
'/home/user01/Blog','2','author:last'
#disable '/home/user01/Blog'
#drop '/home/user01/Blog'
##########################################################
# Additional commands to experiment with
# NOTE: You can copy-paste multiple lines at a time
#
into HBase shell. Or, you can source a script.
#
Example: hbase> source "hbase_script.txt"
##########################################################
# add content column-family to table
alter '/home/user01/Blog', {NAME=>'content'}
# insert row 1
put '/home/user01/Blog', 'Diana-001', 'info:title', 'MapR M7 is Now
Available on Amazon EMR'
put '/home/user01/Blog', 'Diana-001', 'info:author', 'Diana'
put '/home/user01/Blog', 'Diana-001', 'info:date', '2013.05.06'
put '/home/user01/Blog', 'Diana-001', 'content:post', 'Lorem ipsum
dolor sit amet, consectetur adipisicing elit'
# insert row 2
put '/home/user01/Blog', 'Diana-002', 'info:title', 'Implementing
Timeouts with FutureTask'
put '/home/user01/Blog', 'Diana-002', 'info:author', 'Diana'
put '/home/user01/Blog', 'Diana-002', 'info:date', '2011.02.14'
put '/home/user01/Blog', 'Diana-002', 'content:post', 'Sed ut
perspiciatis unde omnis iste natus error sit'
# insert row 3
PROPRIETARY AND CONFIDENTIAL INFORMATION
2014 MapR Technologies, Inc. All Rights Reserved.
76
77
get '/home/user01/Blog',
VERSIONS=>3}
get '/home/user01/Blog',
VERSIONS=>2}
get '/home/user01/Blog',
VERSIONS=>1}
# selects 1 by default
get '/home/user01/Blog',
'Jonathan-004', {COLUMN=>'info:date',
'Jonathan-004', {COLUMN=>'info:date',
'Jonathan-004', {COLUMN=>'info:date',
'Jonathan-004', {COLUMN=>'info:date'}
The columns (Start Key, End Key, Physical Size, Logical size, etc.) represent
meaningful data about the table regions
PROPRIETARY AND CONFIDENTIAL INFORMATION
2014 MapR Technologies, Inc. All Rights Reserved.
78
Node hostnames are hyperlinks to the node details page for a given node
Click the New Column Family button to see the options available for a new column
family (name, Max Versions, Min Versions, etc.)
Click Cancel we do not want to create a new column family at this time
2. Observe the new table you have created in the UI and at the cli(your choice)
3. Locate the sample data file ( instructor can point to the right directory)
4. Cd to sample data directory.
5. Run the hadoop command with the importtsv flag
[root@CentOS001 data2]# su mapr c hadoop jar
/opt/mapr/hbase/hbase-0.94.9/hbase-0.94.9-mapr-1308.jar
importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf1:name,\
cf2:age,cf2:party,cf3:contribution_amount,\
cf3:voter_number /mapr/<Cluster_name>/<data2>/
<userX_voter_data_table> \
/mapr/<Cluster_name>/data2/voter10M
79
If necessary, scroll to the right so you can see the Gets, Puts and Scans columns.
You should see a large number of puts across several nodes while your import is
processing
Click the name of the table you used for the import under Recently opened tables
You should see that your table automatically split into a number of regions during
the import
Not all HBase shell commands are applicable to MapR tables. Consult MapR documentation
for the list of supported commands.
hbase(main):001:0 >scan '/<data2>/ <userX_voter_data_table>',
LIMIT => 5
80
In this table we will import the same data and create a new HBASE_ROW_KEY from field
position 6.
hadoop jar /opt/mapr/hbase/hbase-0.94.9/hbase-0.94.9-mapr1308.jar importtsv Dimporttsv.columns=cf1:number,cf1:name,cf2:age,cf2:party,cf
3:contribution_amount,HBASE_ROW_KEY
/mapr/<Cluster_name>/data2/ <userX_voter_data_table>2
/mapr/<Cluster_name>/data2/voter10M
PROPRIETARY AND CONFIDENTIAL INFORMATION
2014 MapR Technologies, Inc. All Rights Reserved.
81
10. check the job progress in the MCS as you did before in the previous step
11. enter an hbase shell
Notice the data is in a new position. This may or may not improve future scanning
operations. You now have a technique to import tab separated data and to control the
definition/position of a ROW_KEY
12. Create Presplit a table (optional)
observe your newly created presplit table in the MCS or command line.
rerun steps 9 and 10 substituting your newly created presplit table for the
destination table.
82
83
Labs Overview
Lab 6.2: Set up SMTP to configure the cluster to use your SMTP server to send mail
Lab 6.3: Metrics, Monitoring & Troubleshooting in MCS. Explore the various metrics
available through the MCS, then monitor and assess a set of MapReduce jobs
Lab 6.4: Managing Services: Practical Exercises. In this lab you will get practical
experience with managing services and nodes from MCS and CLI.
Lab 6.5: OPTIONAL - Decommissioning vs. Maintenance. In this lab you will experience
what happens when one or more nodes are moved out of the /data topology to
/offline before being decommissioned and contrast that behavior with what
happens when you temporarily shut down Warden on a node so you can maintain it.
In the Configure Email Addresses dialog, you can specify whether MapR gets user email
addresses from an LDAP directory, or uses a company domain:
1. Use Company Domain - specify a domain to append after each username to
complete each user's email address.
2. Use LDAP - obtain each user's email address from an LDAP server.
85
86
3. Click Test SMTP Connection. If there is a problem, check the fields to make sure the
SMTP information is correct.
4. Once the SMTP connection is successful, click Save to save the settings.
Along with setting up email addresses, you need to set up your SMTP server so you can send
emails. Start by accessing the MapR Control System (MCS) from your browser, then follow these
steps:
Troubleshooting
87
What should happen is that the table of jobs should be narrowed down to just the ones
that had a duration in the range that you selected. You can see a filter setting above the
histogram that expresses this limitation. The bar that you used to filter the table will be
highlighted in yellow.
4. Remove the filter by clicking on the minus signs at the right of each filter expression.
5. Change the filter expression to be anything you like.
6. Try filtering on user name or job name.
88
7. Examine a particular set of jobs in more detail by clicking on Zoom instead of Filter.
The filter expression will be set as before but the histogram is limited to just those jobs
that match the filter. The horizontal axis of the histogram will be expanded
appropriately.
8. Explore a single job
The name of each job in the jobs table is highlighted in blue to indicate that it is an active link. If
you click on the name of a job, a new tab opens with information about that job. Whereas the
job metrics page has information about jobs, this new page has information about the tasks for
a single job. You can explore the tasks of a job just as you were able to previously explore all
jobs.
With tasks, you can control whether map tasks, reduce tasks or setup tasks are shown.
NOTE: One common thing to look for in a job is to see if there are map or reduce tasks that take
significantly longer to complete than others. To find such anomalous map tasks, display only
map tasks by checking the appropriate boxes above the histogram. For the job shown below,
there are 3 map tasks that took considerably longer than most tasks.
Isolate those tasks by adding a filter that only shows tasks with a duration > 3 seconds.
Common causes of slow tasks include a malfunctioning node (all or most of the tasks would be
on one node) or tasks that are reading data from other nodes rather than from a local replica
(the Local column in the table shows this for map tasks).
89
Use the MCS to monitor the progress of these jobs to answer the following questions:
a. How many jobs are complete in less than 1 minute? How many jobs completed
between 1 and 2 minutes?
b. Did any jobs fail?
c. Click the name of a job to display information about the jobs tasks. What do you
observe?
d. Which jobs had the longest duration? Shortest duration? Had more map tasks?
e. Find how to drill down to task; task attempts.
Try showing both map and reduce tasks. Sort the task list by task duration and scroll
down the list. Do you see a pattern about the arrangement of map and reduce
tasks? Do you see a pattern between map tasks with local data or non-local data?
Troubleshoot Jobs
You can look for failed jobs by creating a filter. Go back to the Jobs page by clicking on the left
Navigation panel. Reset any filters by making sure that the filters check box is unchecked as
shown below:
90
Now check the filter check box again and add a filter to find jobs with Job Failed Task Count
greater than zero, as shown below:
If you click on a job that had failed tasks, you will see tasks with a variety of coded squares next
to them:
91
The green tasks completed successfully. The bright red tasks failed. The dark red tasks were
killed by the job tracker when the job could not be completed due to a persistent error.
You can find out more by clicking on the failed task. This will show you all of the attempts to run
this task. This should look something like this:
Clicking further on one of these task attempts takes you to a page that describes all that is
known about this task attempt. This includes all of the counters generated by this task as well
as a link in the upper right of the page that allows you to get as stack trace from the failed
process:
Note: you may need to modify the URL by replacing the internal hostname with the external
hostname.
92
In this case, here is the stack trace for this task attempt. It looks very similar to the stack traces
for the other attempts.
In this case, this task died because your trainer removed permissions on the output directory. In
real life, the problems are not as simple as that. Often, the next task for you will be to look at
the source code of the program that failed.
93
Note: Warden and ZooKeeper services are not displayed on the MCS.
94
Figure 2: Nodes view showing services configured on and running on one node
Hint:
You can also click on the numbers in the various columns to view information about the
associated nodes. For example, if you wish to view information on the Standby JobTracker from
Figure 1, click on the number 1 in the Stby column for JobTracker.
95
Hint: You can also click on the icon for your team node in the heatmap to display the node
details.
2. The node details view contains a section called Manage Node services. Select a service
to display the available options at the bottom of the Manage Node Services pane.
96
97
98
Hint: If you had selected more than one node in the previous step then any action you take
in the Manage Node Services dialog would affect all of those nodes simultaneously. The
hostname for each node would be displayed in the Nodes affected by service changes
section.
Notice that all TaskTrackers are once again running on the cluster
Hint: Alternatively, you could have started the TaskTracker from the Manage Node Services
pane in the Node details view for your node.
99
Figure 11: Start TaskTracker from Manage Node Services pane in node details view
Are there any services displayed by jps that are not displayed in the MCS?
Are there any services displayed in the MCS that are not displayed by jps?
Can you explain the differences between these two methods for viewing
services
100
What happens when you stop the JobTracker service on the active JobTracker?
How long does it take for the standby JobTracker to become the active
JobTracker?
How long does it take for the restarted JobTracker to appear in the Services pane of
the Dashboard?
Where does this service now appear in the Services pane of the Dashboard?
Decommissioning
A node is placed into the /offline topology (a topology that has no volumes associated with it) so
that it can be removed from the cluster. The intention in this scenario is that the node will no
longer be used in the MapR cluster.
Maintenance
The Warden is shut down on a node so that it can be taken offline temporarily, possibly for
maintenance. The intention is that the Warden will be started once again in a relatively short
time (less than 1 hour) and that the node will return to the MapR cluster.
Both of these scenarios are likely to occur at some point in the lifecycle of a MapR cluster. The
frequency depends upon the size of the cluster and other factors such as disk failure rate, etc.
While these two scenarios may seem similar, the data container replication behavior is quite
different, as you will observe.
101
4. Log onto the CLDB master node and switch to the root or mapr user
Navigate to the MapR logs directory and monitor the CLDB log file
1. Navigate to /opt/mapr/logs
Notice that there are many log files in this location. This is where all MapR services
write their logs. You will also notice that some of the log files roll over periodically. For
example, you will see 10 files in the format warden.log.<date> in addition to the
current warden.log file. Each day a new log file is created, the previous days log file
is renamed with the date, and the oldest log file is deleted. This helps to make the log
files manageable and keeps them from filling up too much disk space.
2. Begin to monitor the CLDB log file
# tail f cldb.log
3. Keep this window open so you can observe changes as they are recorded in the CLDB
log file
Step through the servicing scenario by stopping Warden on a node that has containers data.
How are the entries in the CLDB log file different from the decommissioning scenario?
102