Documentos de Académico
Documentos de Profesional
Documentos de Cultura
All rights reserved. This document contains proprietary and confidential material, and is only for use by
licensees of DMExpress. This publication may not be reproduced in whole or in part, in any form,
except with written permission from Syncsort Incorporated. Syncsort is a registered trademark and
DMExpress is a trademark of Syncsort, Incorporated. All other company and product names used
herein may be the trademarks of their respective owners.
The accompanying DMExpress program and the related media, documentation, and materials
("Software") are protected by copyright law and international treaties. Unauthorized reproduction or
distribution of the Software, or any portion of it, may result in severe civil and criminal penalties, and
will be prosecuted to the maximum extent possible under the law.
The Software is a proprietary product of Syncsort Incorporated, but incorporates certain third-party
components that are each subject to separate licenses and notice requirements. Note, however, that
while these separate licenses cover the respective third-party components, they do not modify or form
any part of Syncsorts SLA. Refer to the Third-party license agreements topic in the online help for
copies of respective third-party license agreements referenced herein.
Table of Contents
Table of Contents
1 Introduction .................................................................................................................................... 1
1 Introduction
DMX-h ETL is Syncsorts high-performance ETL software for Hadoop. It combines
powerful ETL data processing capabilities with the enhanced performance and
scalability of Hadoop, without the need to learn complex MapReduce programming
skills. A downloadable package of use case accelerators demonstrates how common
ETL applications, easily developed in DMExpress, can be run in the Hadoop
environment.
Installing the DMX-h software and setting up the use case accelerators in your
Hortonworks Sandbox VM is fast and easy. Just follow the instructions in this
document, and try out DMX-h for yourself.
2 Getting Started
2.1 Getting the DMX-h Components
The following components are included in the downloadable zip file, DMX-h_<DMX-h
version>_Hortonworks_Sandbox.zip, from www.syncsort.com/hortonworks:
dmexpress_<DMExpress version>_en_linux_2-6_x86-64_64bit.tar
dmexpress_<DMExpress version>_windows_x86.exe
1. Be sure that the network adapter on the VM is configured correctly for SSH
connectivity. Refer to http://hortonworks.com/wp-
content/uploads/2013/03/InstallingHortonworksSandboxonWindowsUsingVMwarePl
ayerv2.pdf for details.
2. From your desktop, use scp (Secure Copy) via PuTTY or WinSCP to copy (in
binary mode) the file dmexpress_<DMExpress version>_en_linux_2-6_x86-
64_64bit.tar to the /root directory on the Sandbox VM, logging in as root with
password hadoop.
3. Log into the Sandbox VM as root with password hadoop using ssh via PuTTY or
another terminal emulator. This will put you in the /root folder.
4. Run the following command to extract the DMExpress software into a subfolder
named dmexpress:
cd dmexpress
./install
a. When prompted, select the option to start a free trial. The trial has a duration of
30 days, starting from the first time you run DMExpress.
b. Choose /usr/local/dmexpress as the installation directory.
c. When prompted to install the service, choose Yes.
6. Create a new user named dmxdemo for running DMX-h jobs (this will automatically
create the folder /home/dmxdemo on the VM) as follows:
useradd dmxdemo
passwd dmxdemo //follow prompts to create password dmxdemo
su - hdfs
hadoop fs -mkdir /user/dmxdemo
hadoop fs -chown dmxdemo /user/dmxdemo
8. Switch user back to dmxdemo (su - dmxdemo), then edit the dmxdemo users
.bash_profile and add the following lines, adjusting the JVM library path as
needed for your installation:
DMX_HOME=/usr/local/dmexpress
PATH=$PATH:$HOME/bin:$DMX_HOME/bin
LD_LIBRARY_PATH=/usr/lib/jvm/java-1.7.0-openjdk-
1.7.0.95.x86_64/jre/lib/amd64/server/:$LD_LIBRARY_PATH:$DMX_HOM
E/lib
export DMX_HOME
export PATH
export LD_LIBRARY_PATH
To simplify creating remote server and file browsing connections, edit the Windows
hosts file as an Admin user and add an entry for the IP address of the VM (shown
when you connect to the VM) with the hostname sandbox.hortonworks.com. For
example:
192.168.137.128 sandbox.hortonworks.com
This will create Data, Jobs, and bin directories under /home/dmxdemo, which you
will later designate as the value of $DMXHADOOP_EXAMPLES_DIR.
4. Save the configuration, and allow Ambari to restart the MR2 services if prompted.
4 Using DMX-h
4.1 Use Case Accelerators
Syncsort provides a set of use case accelerators that cover a variety of common ETL
use cases to quickly and easily demonstrate both the development and running of
DMX-h ETL jobs in Hadoop.
A brief description of each use case accelerator is provided below, with links to more
detailed descriptions:
CDC Single Output Performs change data capture (CDC) against two
large input files, producing a single output file marking
records as inserted, deleted, or updated.
CDC Distributed Output Same as CDC Single Output, except that it produces
Change Data
three separate output files for the inserted, deleted,
Capture (CDC)
and updated records.
Mainframe Extract + CDC Same as CDC Single Output, but also converts and
loads mainframe data to HDFS before passing the
HDFS data to the CDC job.
Join Large Side | Small Side Performs an inner join between a small distributed
cache file and a large HDFS file.
Joins and
Join Large Side | Large Side Performs a join of two large files stored in HDFS.
Lookups
File Lookup Performs a lookup in a small distributed cache file
while processing a large HDFS file.
Web Logs Aggregation Calculates the total number of visits per site in a set of
web logs using aggregate tasks.
Aggregations
Lookup + Aggregation Performs a lookup followed by an aggregation.
Direct Mainframe Extract & Loads two files residing on a remote mainframe
Load system to HDFS, converting to ASCII displayable text.
Mainframe Redefine File Same as Direct Mainframe Redefine, except that the
Load mainframe file is loaded to HDFS from the local file
system.
Connectivity HDFS Load Same as HDFS Extract, but loads data to HDFS.
HDFS Load Parallel Same as HDFS Load, but splits the data into multiple
partitions and loads to HDFS in parallel.
1. Log into the VM as dmxdemo and set the following environment variables as
indicated in order to run the prep script:
export DMXHADOOP_EXAMPLES_DIR=/home/dmxdemo
export LOCAL_SOURCE_DIR=$DMXHADOOP_EXAMPLES_DIR/Data/Source
export LOCAL_TARGET_DIR=$DMXHADOOP_EXAMPLES_DIR/Data/Target
export HDFS_SOURCE_DIR=<HDFS directory under which to copy the
sample source data, such as /user/dmxdemo/source/>
export HDFS_TARGET_DIR=<HDFS directory under which to write the
target data, such as /user/dmxdemo/target/>
export LOCAL_TEMP_DATA_DIR=$DMXHADOOP_EXAMPLES_DIR/Data/Temp
$DMXHADOOP_EXAMPLES_DIR/bin/prep_dmx_example.sh ALL
b. Or it can be done for the specified space-separated list of folder names under
$DMXHADOOP_EXAMPLES_DIR/Jobs. For example:
$DMXHADOOP_EXAMPLES_DIR/bin/prep_dmx_example.sh FileCDC
WebLogAggregation
4. On the Windows Workstation, start the DMExpress Job Editor, and run the desired
use case accelerator(s) as follows:
a. Select File->Open Job, click on the Remote Servers tab, click on New file
browsing connection, specify the connection as follows, and click OK:
Server: sandbox.hortonworks.com
Connection type: Secure FTP
Authentication: Password
User name: dmxdemo
Password: dmxdemo
b. Open the desired job as follows:
i. Browse to the location of the job you want to run in one of the following
folders as described earlier:
$DMXHADOOP_EXAMPLES_DIR/Jobs/<JobName>/DMXStandardJobs
$DMXHADOOP_EXAMPLES_DIR/Jobs/<JobName>/DMXUserDefinedMRJobs
$DMXHADOOP_EXAMPLES_DIR/Jobs/<JobName>/DMXHDFSJobs
d. Click on the Run button in the Job Editor toolbar. In the Run Job dialog, select
Cluster in the Run on section and click OK.
e. This will bring up the DMExpress Server dialog, which will show the progress
of the running job. Upon completion, select the job and click on Detail to see
Hadoop messages and statistics. (The SRVCDFL warning message about the
data directory can be safely ignored.) To view the Hadoop logs, see Where to
Find DMExpress Hadoop Logs on YARN (MRv2).
For information on how to develop your own DMExpress Hadoop solutions, see DMX-
h ETL in the DMExpress Help, accessible via the DMExpress GUI (Job Editor or Task
Editor).
Syncsort provides enterprise software that allows organizations to collect, integrate, sort, and distribute
more data in less time, with fewer resources and lower costs. Thousands of customers in more than
85 countries, including 87 of the Fortune 100 companies, use our fast and secure software to optimize
and offload data processing workloads. Powering over 50% of the worlds mainframes, Syncsort software
provides specialized solutions spanning Big Iron to Big Data, including next gen analytical platforms
such as Hadoop, cloud, and Splunk. For more than 40 years, customers have turned to Syncsorts
software and expertise to dramatically improve performance of their data processing environments, while
reducing hardware and labor costs. Experience Syncsort at www.syncsort.com.
Syncsort Inc. 50 Tice Boulevard, Suite 250, Woodcliff Lake, NJ 07677 201.930.8200