Está en la página 1de 160

IBM InfoSphere BigInsights

Version 3.0 

Tutorials

GC19-4104-03
IBM InfoSphere BigInsights
Version 3.0 

Tutorials

GC19-4104-03
Note
Before using this information and the product that it supports, read the information in “Notices and trademarks” on page
145.

© Copyright IBM Corporation 2013, 2014.


US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
Chapter 1. InfoSphere BigInsights Chapter 6. Tutorial: Developing Big
tutorials . . . . . . . . . . . . . . . 1 SQL queries to analyze big data . . . . 35
Setting up the Big SQL tutorial environment . . . 36
Chapter 2. Tutorial: Managing your big Creating a directory in the distributed file system to
data environment. . . . . . . . . . . 3 hold your samples . . . . . . . . . . . . 37
Getting the sample data . . . . . . . . . . 38
Lesson 1: Starting to use the InfoSphere BigInsights
Accessing the Big SQL sample data installed with
Console . . . . . . . . . . . . . . . . 3
InfoSphere BigInsights . . . . . . . . . . 38
Lesson 2: Exploring the InfoSphere BigInsights
Downloading sample data from a
Console . . . . . . . . . . . . . . . . 4
developerWorks source . . . . . . . . . 43
Summary of managing your big data environment . 5
Creating a project and tables, and loading sample
data . . . . . . . . . . . . . . . . . 44
Chapter 3. Tutorial: Importing data for Optional: Changing the default Eclipse SQL
analysis . . . . . . . . . . . . . . . 7 Results view to see more results . . . . . . 47
Tutorial: Importing data for analysis . . . . . . 8 Module 1: Creating and running SQL script files . . 48
Lesson 1: Managing your data . . . . . . . 9 Lesson 1.1: Creating an SQL script file . . . . 49
Lesson 2: Importing data by using the Lesson 1.2: Creating and running a simple query
BoardReader application . . . . . . . . . 9 to begin data analysis . . . . . . . . . . 49
Lesson 3: Importing data by using the Lesson 1.3: Creating a view that represents the
Distributed File Copy application . . . . . . 11 inventory shipped by branch . . . . . . . 53
Summary of importing data to the distributed Lesson 1.4: Analyzing products and market
file system. . . . . . . . . . . . . . 12 trends with Big SQL Joins and Predicates . . . 54
Lesson 1.5: Creating advanced Big SQL queries
Chapter 4. Tutorial: Analyzing big data that include common table expressions, aggregate
with BigSheets . . . . . . . . . . . 13 functions, and ranking. . . . . . . . . . 57
Tutorial: Analyzing big data with BigSheets . . . 13 Lesson 1.6: Running Big SQL queries in the Big
Lesson 1: Creating master workbooks from social SQL Console . . . . . . . . . . . . . 59
media data . . . . . . . . . . . . . 14 Lesson 1.7: Advanced: Creating a user defined
Lesson 2: Tailoring your data by creating child function to return total units sold and price with
workbooks . . . . . . . . . . . . . 16 a discount . . . . . . . . . . . . . . 60
Lesson 3: Combining the data from two Lesson 1.8: Advanced: Creating and running a
workbooks . . . . . . . . . . . . . 18 simple Big SQL query from a JDBC client
Lesson 4: Creating columns by grouping data . . 19 application . . . . . . . . . . . . . 63
Lesson 5: Viewing data in BigSheets diagrams . . 20 Module 2: Analyzing big data by using Big SQL and
Lesson 6: Visualizing and refining the results in BigSheets . . . . . . . . . . . . . . . 66
charts . . . . . . . . . . . . . . . 21 Lesson 2.1: Preparing queries to export to
Lesson 7: Exporting data from your workbooks 23 BigSheets that examine the results of sales by
Summary of analyzing data with BigSheets year . . . . . . . . . . . . . . . . 66
tutorial . . . . . . . . . . . . . . . 24 Lesson 2.2: Exporting Big SQL data about total
sales by year to BigSheets . . . . . . . . 68
Lesson 2.3: Creating tables for BigSheets from
Chapter 5. Tutorial: Developing your other tables . . . . . . . . . . . . . 70
first big data application . . . . . . . 27 Lesson 2.4: Exporting BigSheets data about IBM
Lesson 1: Creating an InfoSphere BigInsights project 27 Watson blogs to Big SQL tables . . . . . . . 71
Lesson 2: Creating and populating a Jaql file with Lesson 2.5: Creating a catalog table from
application logic . . . . . . . . . . . . . 28 BigSheets Watson blog data to use in Big SQL . . 76
Lesson 3: Testing your application . . . . . . . 29 Module 3: Analyzing Big SQL data in a client
Lesson 4: Publishing your application in the spreadsheet program . . . . . . . . . . . 79
InfoSphere BigInsights applications catalog . . . . 30 Lesson 3.1: Installing the IBM Data Server Driver
Lesson 5: Deploying and running your application Package for the client ODBC drivers . . . . . 79
on the cluster . . . . . . . . . . . . . . 31 Lesson 3.2: Importing Big SQL data in a client
Lesson 6: Making your application more dynamic 31 spreadsheet program . . . . . . . . . . 81
Summary of developing your first big data Summary of developing Big SQL queries to analyze
application . . . . . . . . . . . . . . 33 big data . . . . . . . . . . . . . . . 82

© Copyright IBM Corp. 2013, 2014 iii


Chapter 7. Tutorial: Analyzing big data Chapter 9. Tutorial: Identifying and
with IBM InfoSphere BigInsights Big R . 85 analyzing errors in machine data . . . 119
Lesson 1: Uploading the airline data set to Lesson 1: Downloading the sample data . . . . 119
InfoSphere BigInsights server with Big R . . . . 86 Lesson 2: Extracting the sample data . . . . . 121
Lesson 2: Exploring the structure of the data set Lesson checkpoint . . . . . . . . . . . 122
with IBM InfoSphere BigInsights Big R . . . . . 87 Lesson 3: Indexing the sample data . . . . . . 123
Lesson 3: Analyzing data with IBM InfoSphere Lesson 4: Identifying frequent sequences of errors 124
BigInsights Big R . . . . . . . . . . . . 88 Lesson 5: Viewing frequent sequence results . . . 126
Lesson 4: Visualizing big data with IBM InfoSphere Lesson 6: Identifying event significance on errors 127
BigInsights Big R . . . . . . . . . . . . 89 Lesson 7: Viewing significance analysis results . . 128
Lesson 5: Creating a predictive model with IBM Summary: Analyzing machine data errors . . . . 130
InfoSphere BigInsights Big R . . . . . . . . 92
Summary of analyzing data with IBM InfoSphere Chapter 10. Tutorial: Identifying user
BigInsights Big R tutorial . . . . . . . . . . 95 feedback in social data . . . . . . . 133
Lesson 1: Downloading the sample data . . . . 133
Chapter 8. Tutorial: Creating an Lesson 2: Configuring the sample data . . . . . 136
extractor to derive valuable insights Lesson checkpoint . . . . . . . . . . . 137
from text documents. . . . . . . . . 97 Lesson 3: Analyzing the sample data . . . . . 137
Lesson 1: Setting up your project . . . . . . . 98 Lesson checkpoint . . . . . . . . . . . 140
Lesson 2: Selecting input documents and labeling Lesson 4: Viewing analysis results . . . . . . 140
examples . . . . . . . . . . . . . . . 99 Summary: Analyzing social data feedback . . . . 143
Lesson 3: Writing and testing AQL . . . . . . 102
Summary - the basic lessons . . . . . . . 109 Notices and trademarks . . . . . . . 145
Lesson 4: Writing and testing AQL for candidates 109
Lesson 5: Writing and testing final AQL . . . . 114 Providing comments on the
Lesson 6: Finalizing and exporting the extractor 116
documentation . . . . . . . . . . . 151
Lesson 7: Publishing the AQL module . . . . . 116
Summary of creating your first Text Analytics
application . . . . . . . . . . . . . . 117

iv IBM InfoSphere BigInsights Version 3.0: Tutorials


Chapter 1. InfoSphere BigInsights tutorials
Learn how to use InfoSphere® BigInsights™ by completing these tutorials, which
use real data and teach you to run applications. Complete the tutorials in any
order. If you are not familiar with InfoSphere BigInsights, consider completing the
Managing your big data environment tutorial as your first exercise.

If you are using the InfoSphere BigInsights Quick Start Edition VMWare image,
you will find pre-populated Eclipse projects installed with the Eclipse client. Use
these projects to validate your progress in the Eclipse-related tutorials.

Manage Import Analyze

Within minutes, dive into the Collect and import data for Delve into BigSheets, an
world of big data with exploration and analysis that intuitive spreadsheet-like
robust, browser-based helps you make sense of tool, to create analytic queries
control. seemingly unrelated data. without any previous
programming experience.

Develop Query
Predict
Easily develop your first big Quickly master the intricacies
data application by using the of SQL queries for Hadoop Explore, visualize, and model
InfoSphere BigInsights with IBM® Big SQL. big data with IBM InfoSphere
Eclipse plugin. BigInsights Big R using the
interactive R language.

Extract
Accelerate
Accelerate
Discover the power of Text machine data
social data
Analytics by creating
extractors to derive valuable Use IBM Accelerator for Use IBM Accelerator for
insights from text documents. Machine Data Analytics to Social Data Analytics to
import, extract, index, search,
download, import, and
and analyze your machine
analyze your social data files.
data files.
Not available in the
Not available in the
InfoSphere BigInsights Quick
InfoSphere BigInsights Quick
Start Edition.
Start Edition.

© Copyright IBM Corp. 2013, 2014 1


2 IBM InfoSphere BigInsights Version 3.0: Tutorials
Chapter 2. Tutorial: Managing your big data environment
Learn how to use the InfoSphere BigInsights Console to check the status of
services, view the health of your system, and monitor the status of applications.

Within minutes, you will be able to quickly navigate and use the InfoSphere
BigInsights Console to manage your big data environment.

This tutorial does not cover real-time monitoring of dashboards, or application


linking. These topics are for more advanced users.

Learning objectives

After completing the lessons in this tutorial, you will have learned how to
complete the following tasks:
v Use the InfoSphere BigInsights Console to inspect the status of your cluster, start
and stop components, and access tools that are available for open source
components.
v Work with the distributed file system. You will explore the distributed file
system (DFS) directory structure, create subdirectories, and upload files to
HDFS.
v Launch applications and inspect their status. You will also learn how to view
output in BigSheets, a spreadsheet-like tool.

Time required

This module should take approximately 20 minutes to complete.

Lesson 1: Starting to use the InfoSphere BigInsights Console


In this lesson, you log in to the InfoSphere BigInsights Console to explore the
Welcome page and ensure that all InfoSphere BigInsights nodes are running.

For InfoSphere BigInsights to function correctly, nodes such as MapReduce,


Hadoop Distributed File System (HDFS), or General Parallel File System (GPFS)
are required. You can start, stop, and manage these nodes directly from the
InfoSphere BigInsights Console. In addition, you can use the InfoSphere
BigInsights Console to view the health of your cluster, deploy applications, manage
your files and cluster instances, and schedule workflows, jobs, and tasks from a
single location.
1. Log in to the InfoSphere BigInsights Console.

Option Description
In a non-SSL installation Enter the following URL in your browser:
http://host_name:8080

host_name is the name of the host where the


InfoSphere BigInsights Console is running,
and 8080 is the default port.

© Copyright IBM Corp. 2013, 2014 3


Option Description
In an SSL installation Enter the following URL in your browser:
https://host_name:8443

host_name is the name of the host where the


InfoSphere BigInsights Console is running,
and 8443 is the default port.

2. Explore each section of the Welcome tab to learn more about the tasks and
resources that are available.

Option Description
Understand IBM Big Data Tools An interactive model that describes an
overview of the product capabilities in the
InfoSphere BigInsights Knowledge Center.
Tasks Quick access to commonly used InfoSphere
BigInsights tasks.
Quick Links Links to internal and external quick links
and downloads to enhance your
environment.
Learn More Online resources available to learn more
about InfoSphere BigInsights.

Lesson 2: Exploring the InfoSphere BigInsights Console


In this lesson, you navigate the different sections of the InfoSphere BigInsights
Console to familiarize yourself with its capabilities.

Administrators use the InfoSphere BigInsights Console to inspect the overall health
of the system. They also use InfoSphere BigInsights Console to complete basic
functions such as starting and stopping specific servers and components, and
adding nodes to the cluster. Other users can interact with files in the distributed
file system and manage applications.
1. On the Welcome tab, select Access secure cluster servers under the Quick
Links section.
A pop-up window appears with a list of URLs and the alias for each URL. For
example, click the hive link, which opens the Hive Web Interface into a new
browser window. You see an open source tool that is provided with Hive for
administration purposes, such as browsing the database schema and creating a
session. Close the browser window to return to the InfoSphere BigInsights
Console home page.
2. On the Cluster Status tab, ensure that all InfoSphere BigInsights nodes are
running. If any node is not running, select it, and then click Start. If you would
like to see more information about a node, select it. From this view, you can
also stop a node if it is running. After you start these nodes, they should
remain active for the remainder of this tutorial.
By default, monitoring is unavailable to optimize performance.
3. To explore your distributed file system, select the Files tab. Here, you can see
contents of the distributed file system, create new subdirectories, upload small
files for test purposes, and complete other file-related functions.
4. Become familiar with the functions that are provided by using the icons at the
top of the pane in the Files page. These icons are used throughout the tutorials.
Hover over an icon with your cursor to learn its function.

4 IBM InfoSphere BigInsights Version 3.0: Tutorials


5. Expand the directory tree in the left navigation. Here, you can locate files that
were uploaded and explore existing files. To learn how to upload files, you can
use the "Importing data for analysis" tutorial. .
6. On the Applications tab, click Manage. Here, you can view applications that
are available in your cluster, deploy applications to the cluster, and delete
applications that you no longer need.
7. In the Manage applications panel, deploy an application. For example, you can
use the BoardReader application in a later module to import current web data
into the distributed file system. Select the BoardReader application and then
click Deploy. In the Deploy Application window, click Deploy. The status of
this application changes to deployed, and the application is available for use
from the Run applications panel.
8. To view the status of applications, click the Applications Status tab. If this is
the first use of the InfoSphere BigInsights Console, no applications, workflows
or jobs are listed. After you run applications, workflows, or jobs, you can view
their status from this page.

Summary of managing your big data environment


In this tutorial, you learned about the InfoSphere BigInsights Console and how you
can use it to start managing your big data environment.

Lessons learned

You now have a good understanding of the following tasks:


v Getting started with common tasks in the InfoSphere BigInsights Console
v Starting and stopping InfoSphere BigInsights services
v Managing and interacting with files in the distributed file system
v Managing applications in the cluster

Additional resources

To learn more about the tasks that you can complete by using InfoSphere
BigInsights, use the interactive conceptual models. These models provide insight
into some of the other tutorials that you can complete by using the product.
v Overview of InfoSphere BigInsights
v Developing applications by using the InfoSphere BigInsights Tools for Eclipse
v Creating text extractors by using Text Analytics

You can also access the following resources:


v Getting started with BigInsights video
v IBM InfoSphere BigInsights Enterprise Edition: Efficiently manage and mine big
data for valuable insights

Chapter 2. Tutorial: Managing your big data environment 5


6 IBM InfoSphere BigInsights Version 3.0: Tutorials
Chapter 3. Tutorial: Importing data for analysis
Learn how to import data into your distributed file system from your local system
or network by using the InfoSphere BigInsights Console and IBM provided
applications.

Business data is stored in various formats and sources. Before you import your
data into the InfoSphere BigInsights distributed file system, you must determine
what questions you want to answer through analysis, identify the data type of
your sources, and use the tools and procedures that best fit your business need.
You can use InfoSphere BigInsights with your existing infrastructure or data
warehouse to import data and content in its original formats, or you can import
huge volumes of at-rest (static) data or incoming data in motion (continually
updated data). After you import your data, you can explore the data separately or
combine the data to complete exploration and analysis.

Many businesses might want to examine the popularity of a specific brand or


service in social media. The data that is provided for this lesson is the result of a
BoardReader application search for the instances of the phrase "IBM Watson™" on
the Internet. This search is detailed in the developerWorks® article, Analyzing
social media and structured data with InfoSphere BigInsights: Get a quick start
with BigSheets. IBM Watson is a research project that uses complex analytics to
answer questions that are presented in a natural language.

For this tutorial, and the related tutorial on BigSheets, only news and blog data
that was returned by the search is used. The returned data was slightly modified
to contain only a subset of the information that the BoardReader application
collects from blogs and news feeds. The full-text/HTML content of posts, news
items, and certain metadata, was removed to keep the size of each file manageable.

The BoardReader application requires a license for use. If you have a license, you
can choose to follow the steps in the lesson on using the BoardReader application
(Lesson 2), or download the data to your computer and import it to the InfoSphere
BigInsights distributed file system for use with the Distributed File Copy
application (Lesson 3). To obtain a license, see the BoardReader website.

Learning objectives

After you complete the lessons in this tutorial, you will understand the concepts
and know how to:
v Create a folder for your sample data in the InfoSphere BigInsights distributed
file system.
v Collect and import data by using the BoardReader application.
v Import data from your local system or network by using the Distributed File
Copy application.
v Locate imported data in the distributed file system for use in BigSheets, Big
SQL, and Text Analytics.

Time required

The time required to complete this tutorial depends on which method you choose
to use to import your data, and the cluster configuration and the number of nodes

© Copyright IBM Corp. 2013, 2014 7


available for your use. If you choose to complete the BoardReader lesson, this
tutorial will take approximately 20 minutes to complete. If you choose to use only
the Distributed File Copy application, this tutorial will take approximately 5
minutes to complete.

Prerequisites

Before you begin this tutorial, ensure that you installed the InfoSphere BigInsights
tools for Eclipse, and that you have access to the application through the
InfoSphere BigInsights Console.

Tutorial: Importing data for analysis


Learn how to import data into your distributed file system from your local system
or network by using the InfoSphere BigInsights Console and IBM provided
applications.

Business data is stored in various formats and sources. Before you import your
data into the InfoSphere BigInsights distributed file system, you must determine
what questions you want to answer through analysis, identify the data type of
your sources, and use the tools and procedures that best fit your business need.
You can use InfoSphere BigInsights with your existing infrastructure or data
warehouse to import data and content in its original formats, or you can import
huge volumes of at-rest (static) data or incoming data in motion (continually
updated data). After you import your data, you can explore the data separately or
combine the data to complete exploration and analysis.

Many businesses might want to examine the popularity of a specific brand or


service in social media. The data that is provided for this lesson is the result of a
BoardReader application search for the instances of the phrase "IBM Watson" on
the Internet. This search is detailed in the developerWorks article, Analyzing social
media and structured data with InfoSphere BigInsights: Get a quick start with
BigSheets. IBM Watson is a research project that uses complex analytics to answer
questions that are presented in a natural language.

For this tutorial, and the related tutorial on BigSheets, only news and blog data
that was returned by the search is used. The returned data was slightly modified
to contain only a subset of the information that the BoardReader application
collects from blogs and news feeds. The full-text/HTML content of posts, news
items, and certain metadata, was removed to keep the size of each file manageable.

The BoardReader application requires a license for use. If you have a license, you
can choose to follow the steps in the lesson on using the BoardReader application
(Lesson 2), or download the data to your computer and import it to the InfoSphere
BigInsights distributed file system for use with the Distributed File Copy
application (Lesson 3). To obtain a license, see the BoardReader website.

Learning objectives

After you complete the lessons in this tutorial, you will understand the concepts
and know how to:
v Create a folder for your sample data in the InfoSphere BigInsights distributed
file system.
v Collect and import data by using the BoardReader application.

8 IBM InfoSphere BigInsights Version 3.0: Tutorials


v Import data from your local system or network by using the Distributed File
Copy application.
v Locate imported data in the distributed file system for use in BigSheets, Big
SQL, and Text Analytics.

Time required

The time required to complete this tutorial depends on which method you choose
to use to import your data, and the cluster configuration and the number of nodes
available for your use. If you choose to complete the BoardReader lesson, this
tutorial will take approximately 20 minutes to complete. If you choose to use only
the Distributed File Copy application, this tutorial will take approximately 5
minutes to complete.

Prerequisites

Before you begin this tutorial, ensure that you installed the InfoSphere BigInsights
tools for Eclipse, and that you have access to the application through the
InfoSphere BigInsights Console.

Lesson 1: Managing your data


Before you import your data, it is important to determine how you want to
manage your data in the InfoSphere BigInsights distributed file system.

For this module, there are two options for gathering your data. However, to best
manage your information you should first create a folder to store the data.
1. Open the InfoSphere BigInsights Console.
2. From the Files tab, select the DFS Files tab.
3. Create a directory to store this data in the distributed file system. Click the

Create Directory folder icon ( ) in the Files toolbar.


4. Name your directory. For this lesson, in the DFS Files tab, create the directory
bi_sample_data. If you are using the InfoSphere BigInsights Quick Start
Edition, the home directory is /user/biadmin/.
5. In the bi_sample_data directory, create a subdirectory named bigsheets where
you can store and access this same IBM Watson data for the BigSheets tutorial.

You now have a directory to store all of your source data files and application
results.

Lesson 2: Importing data by using the BoardReader


application
The data that is used in this tutorial is gathered by using the BoardReader
application. This application is just one method of collecting data and importing it
into the InfoSphere BigInsights distributed file system.

To use the BoardReader application, each customer must contact BoardReader to


obtain a valid license key. To obtain a license, see the BoardReader website. If you
do not have access to a BoardReader license, you can follow along to learn the
steps to use the application, or skip to the next lesson to download the finished
data by using the Distributed File Copy application.

Chapter 3. Tutorial: Importing data for analysis 9


The following is an example of what the properties file for BoardReader might
look like, and your_key_value is your license key:
boardreaderkey=your_key_value

You must create a credential file with the BoardReader key. There are private and
public files in the credentials store. The private credentials store contains your
private information in the /user/username/credstore/private directory.

If you want to import data by using an SFTP or FTP connection, make sure that
this connection is running on your system.

Collecting social media data can be challenging because each site can hold different
information and use varying data structures. Also, visiting numerous sites to
gather your information is a time-consuming process. For this lesson, the
BoardReader sample application that is provided with InfoSphere BigInsights can
search blogs, news feeds, discussion boards, and video sites.
1. Deploy the BoardReader application to make it available for your use.
a. In the InfoSphere BigInsights Console, in the Applications tab, click
Manage.
b. From the navigation tree, expand the Import directory.
c. Select the BoardReader application, and click the Deploy button (

).
d. In the Deploy Application window, select Deploy.
2. From the toolbar on the top of the hierarchy tree window, select Run.
3. Select the BoardReader application.
4. Define the Execution name of your project. This step creates a project, and you
can track the results and reuse the project later. For example, enter the
Execution name br_ibmwatson.
5. Define your application parameters.
a. In the Results path field, specify the directory for the application's output.
Use the Browse button to locate the file /bi_sample_data/bigsheets in the
Hadoop Distributed File System (HDFS) directory. If you are using the
InfoSphere BigInsights Quick Start Edition, the directory is
/user/biadmin/bi_sample_data/bigsheets.
b. Define the Maximum matches that you want to be returned from the
search. Since you want to be able to use this data for full scale analysis, use
the range 1,000.
c. Select a Start date and an End date. Define a specific past time frame for
the BoardReader to search. To search for this Watson data, define the start
date as January 1, 2011. Define the end date as March 31, 2012.
d. Select a Properties file. The Properties file references the file in the
InfoSphere BigInsights credentials store that was populated with the
BoardReader license key.
e. In the Search terms field, enter the term "IBM Watson" as the subject of this
search. This string causes the BoardReader application to search for any
instance of both terms appearing together.
6. Select Run to run the search in the BoardReader application. The data is
imported to the specified results path.
7. Verify that the BoardReader application conducted a successful search. You can
examine the status in the Application History panel. Return to the Files tab,

10 IBM InfoSphere BigInsights Version 3.0: Tutorials


and locate the /bi_sample_data/bigsheets directory to locate your search
results. If you are using the InfoSphere BigInsights Quick Start Edition, the
directory is /user/biadmin/bi_sample_data/bigsheets.

Lesson 3: Importing data by using the Distributed File Copy


application
The Distributed File Copy application copies files to and from a remote source to
the InfoSphere BigInsights distributed file system by using Hadoop Distributed
File System (HDFS), GPFS™, FTP, or SFTP. You can also copy files to and from your
local file system.

To use the Distributed File Copy application with SFTP, you can create a credential
file. There are private and public files in the credentials store. The private
credentials store contains the private information for each user that is in the
/user/username/credstore/private directory. The following is an example of what
the properties file for SFTP might look like:
database=db2inst2
dbuser=pascal
password=[base64]LDo8LTor

Note: The Distributed File Copy application is designed to move large amounts of
data. This application is designed to run on a Linux platform. To upload smaller
data sets (less than 2G), you can use the Upload function from the Files tab in the
InfoSphere BigInsights Console.

For this lesson, you will download the IBM Watson data that was the result of the
BoardReader application search to your local system, and then upload it to the file
system for analysis.

Before you begin, you must first download the data to your local system. The data
is in the Download section of the developerWorks article, "Analyzing social media
and structured data with InfoSphere BigInsights: Get a quick start with BigSheets".
Accept the terms and conditions and save the file article_sampleData to your local
system. After you unzip the file, the article_sampleData folder should contain the
files RDBMS_data.csv, blogs-data.txt, news-data.txt, and a README.txt file that
details the data output.
1. Deploy the Distributed File Copy application to make it available for your use.
a. In the InfoSphere BigInsights Console, in the Applications tab, click
Manage.
b. From the navigation tree, expand the Import directory.
c. Select the Distributed File Copy application, and click the Deploy button (

).
d. In the Deploy Application window, select Deploy.
2. From the toolbar on the top of the hierarchy tree window, select Run.
3. Select the Distributed File Copy application.
4. Define your application parameters.
a. Specify an Execution name. This step creates a project, and you can track
the results and reuse the project later. Name the execution dc_ibmwatson.
b. In the Input path field, specify the fully qualified path to the
article_sampleData file on your local file system. For example,
sftp://username:password@localhost/file/path/article_sampleData/

Chapter 3. Tutorial: Importing data for analysis 11


blogs-data.txt If you provide just the directory name as input, all of the
files in the local directory will be uploaded. The default is HDFS, if an SFTP
or FTP connection, or a GPFS file system, is not specified.
c. In the Output path field, specify the fully qualified path to where you want
to store the data, for example /bi_sample_data/bigsheets/
article_sampleData. If you are using the InfoSphere BigInsights Quick Start
Edition, the directory is /user/biadmin/bi_sample_data/bigsheets/
article_sampleData. Make sure to include the name of the file that you
want to import in the file path to prevent the folder from being mistaken as
the name of the data file.
d. Optional: If you are using SFTP to connect to your local file system, use the
Browse button to specify the fully qualified path to your properties file in
the InfoSphere BigInsights credentials store.
5. Select Run to import the file.
6. Repeat steps 4 and 5 for the news-data.txt file.
7. Verify that the Distributed File Copy application conducted a successful import.
To verify the import, you can examine the status in the Application History
panel. Return to the Files tab, and locate the /bi_sample_data/bigsheets
directory to locate your import results. If you are using the InfoSphere
BigInsights Quick Start Edition, the directory is /user/biadmin/bi_sample_data/
bigsheets/article_sampleData.

Summary of importing data to the distributed file system


In this module, you learned about how to use the Distributed File Copy
application and the BoardReader application to import data into the InfoSphere
BigInsights distributed file system.

Lessons learned

You now have a good understanding of the following tasks:


v Creating a new directory in the InfoSphere BigInsights distributed file system.
v How to deploy an IBM-provided application.
v Collecting and importing data with the BoardReader application.
v Importing data with the Distributed File Copy application.
v Locating your data for use with BigSheets, Big SQL, or Text Analytics.

12 IBM InfoSphere BigInsights Version 3.0: Tutorials


Chapter 4. Tutorial: Analyzing big data with BigSheets
Learn how to use BigSheets, a browser-based tool that is included in the
InfoSphere BigInsights Console, to analyze and visualize big data.

BigSheets uses a spreadsheet-like interface that can model, filter, combine, and
chart data collected from multiple sources, such as an application that collects
social media data by crawling the Internet.

Data is categorized and formatted by creating master workbooks, read-only


representations of your complete original data set. From these master workbooks,
you can derive child workbooks, editable versions of the master workbooks in which
you can create specific sheets to manipulate and analyze your data.

In this tutorial, you link social media data about IBM Watson with simulated
internal IBM data about media outreach efforts. Your goal is to analyze the
visibility, coverage, and sentiment around IBM Watson, a common requirement of
data analysts about their products

This tutorial teaches you the key aspects of BigSheets so that you can quickly
begin analyzing your own big data.

Learning objectives

After you complete the lessons in this module, you will understand the concepts
and processes associated with:
v Creating master workbooks from files that you upload into your distributed file
system cluster
v Creating child workbooks to tailor and explore data
v Merging data from two sources into one workbook
v Creating columns to group and sort data
v Viewing data in diagrams to see the history of a workbook and relationships
between workbooks
v Charting and refining the results of your analysis
v Exporting your results

Time required

This module takes approximately 60 minutes to complete.

Tutorial: Analyzing big data with BigSheets


Learn how to use BigSheets, a browser-based tool that is included in the
InfoSphere BigInsights Console, to analyze and visualize big data.

BigSheets uses a spreadsheet-like interface that can model, filter, combine, and
chart data collected from multiple sources, such as an application that collects
social media data by crawling the Internet.

Data is categorized and formatted by creating master workbooks, read-only


representations of your complete original data set. From these master workbooks,

© Copyright IBM Corp. 2013, 2014 13


you can derive child workbooks, editable versions of the master workbooks in which
you can create specific sheets to manipulate and analyze your data.

In this tutorial, you link social media data about IBM Watson with simulated
internal IBM data about media outreach efforts. Your goal is to analyze the
visibility, coverage, and sentiment around IBM Watson, a common requirement of
data analysts about their products

This tutorial teaches you the key aspects of BigSheets so that you can quickly
begin analyzing your own big data.

Learning objectives

After you complete the lessons in this module, you will understand the concepts
and processes associated with:
v Creating master workbooks from files that you upload into your distributed file
system cluster
v Creating child workbooks to tailor and explore data
v Merging data from two sources into one workbook
v Creating columns to group and sort data
v Viewing data in diagrams to see the history of a workbook and relationships
between workbooks
v Charting and refining the results of your analysis
v Exporting your results

Time required

This module takes approximately 60 minutes to complete.

Lesson 1: Creating master workbooks from social media data


In this lesson, you upload two social media data files from the Internet to your
cluster and use these files to create two new master workbooks.

Note: For the purposes of this tutorial, you are uploading sample data files that
are less than 2 GB. To load files larger than 2 GB, you must use the Import feature.
For more information, see Tutorial: Importing data for analysis.

Master workbooks protect and preserve the raw data in its original form. If, during
your data explorations, you accidentally remove a column, you can create a new
child workbook from the master workbook without reloading the original data.

Master workbooks also model the data format. This format is determined by
applying a reader, a data format translator that maps data into the spreadsheet-like
structure necessary for BigSheets. BigSheets provides several built-in readers for
working with common data formats.
1. Collect the social media files:
a. In your web browser, enter the following URL: http://www.ibm.com/
developerworks/data/library/techarticle/dm-1206socialmedia/. This
URL takes you to a BigSheets article on IBM developerWorks.
b. Scroll down until you see the Download section. Click the sampleData.zip
file, review the terms and conditions, and then click I ACCEPT THE
TERMS AND CONDITIONS.

14 IBM InfoSphere BigInsights Version 3.0: Tutorials


c. In the opening sampleData.zip window, select Save File, and click OK.
The sampleData.zip file is saved to the default location of your
downloaded files. For example, on a Windows system, the default
download directory is often C:\Documents and Settings\Administrator\My
Documents\Downloads.
2. Extract and upload the files to your cluster:
Typically, you create master workbooks on the BigSheets tab from files that
are already in your cluster.
a. Navigate to the location of the downloaded sampleData.zip file, and
extract the sampleData.zip file to a local directory. For example, on a
Windows system, you may extract the files to C:\temp.
b. Open the InfoSphere BigInsights Console by pointing your browser to
http://host:port/, and then click the Files tab.
c. Expand the main hdfs:// directory and navigate to the biginsights >
sheets directory by expanding the tree next to each directory.
d. Make sure that the sheets directory is highlighted, and click the Create

Directory icon ( ).
e. In the Name field of the Create Directory window, enter Watson_data, and
click OK.

f. Click the Upload icon ( ).


g. In the Upload window, click Browse, and navigate to the extracted files
location. Select the SampleData/article_sampleData/blogs-data.txt file,
and click Open. The blogs-data.txt file is listed under Files to Upload.
h. Click Browse again, select the news-data.txt file, and click Open. The
news-data.txt file is listed under Files to Upload.
i. In the Upload Files window, click OK to upload the files. It might take a
minute to load the files. The window refreshes, and you can see the
blogs-data.txt and news-data.txt files in the Watson_data directory.
3. Select the blogs-data.txt file, and click the Sheet radio button. In the
Preview area of the window, you see that the data is not displayed properly.
It is formatted in a JSON Array structure.
4. Apply a different reader to map the data to the spreadsheet format:

a. Click the Edit icon ( ).


b. Select JSON Array from the drop-down list, and click the green check

mark ( ). You immediately see the data map to the columns and
rows of the spreadsheet-like interface in the Preview area.
c. Since the data columns exceed the viewing space, click Fit column(s). The
first eight columns display in the Preview area.

Note: Depending on the size of your web browser window, you might
need to scroll to see Fit column(s).
d. Click Save as Master Workbook.
e. In the Name field, enter Watson_Blogs. Spaces are valid characters for
workbook names.
f. In the Description field, enter Watson blog data from blogs-data.txt,
then click Save.

Chapter 4. Tutorial: Analyzing big data with BigSheets 15


5. Click the Workbooks link in the breadcrumb at the top of the window. You
are moved to the BigSheets tab, and you see your new master workbook,
Watson_Blogs.
6. Click New Workbook.
7. In the Name field, enter Watson_News.
8. In the Description field, enter Watson news feed data from news-data.txt.
9. Under DFS File, navigate to the /biginsights/sheets/Watson_data directory,
and select the news-data.txt file. The right side of the window displays the
file name and contents. This data is also in JSON Array format.

10. Click the Edit icon ( ), select JSON Array from the drop-down list, and

click the green check mark ( ) to apply the reader.


11. Since the data columns exceed the viewing space, click Fit column(s). The first
eight columns display in the Preview area.

12. Save the master workbook by clicking the green check mark ( ) in the
lower right corner of the screen.

Note: Depending on the size of your web browser window, you might need

to scroll to see the green check mark ( ).


You are moved to the BigSheets tab, and you see your new workbook,
Watson_news.
13. View both new master workbooks by clicking the Workbooks link in the
breadcrumb at the top of the window.

You are now ready to explore the data that you loaded.

Lesson 2: Tailoring your data by creating child workbooks


Typically, before you analyze and explore data, you must tailor its format and
content. In this lesson, you create child workbooks from each master workbook
and remove unwanted columns to refine the amount and type of your data.

In addition to protecting the original data, master workbooks set the data format
(including the data types for the columns). Therefore, you must create child
workbooks in which to modify your data. Child workbooks inherit their format
and data from their master workbooks, but you can tailor their attributes to
display only necessary data.
1. From the BigSheets tab of the InfoSphere BigInsights Console, select the
Watson_News master workbook.
2. Click Build new workbook.
A new workbook is created with the name: Watson_News(1).

3. Rename the workbook by clicking the Edit icon ( ), entering Watson News

Revised, and clicking the green check mark ( ).


4. To see columns A through H within your web browser, click Fit column(s).

16 IBM InfoSphere BigInsights Version 3.0: Tutorials


5. For your analysis, you do not need the IsAdult column (column E). Remove it
by clicking the down arrow in the column heading and selecting Remove.

Learn more about column actions: Notice all the column actions that are
available to you in the drop-down list. You can rename, hide, and remove a
column; insert a new column; sort the data in a column; and organize the
columns.
When you remove columns from a child workbook, you delete only the data
from the child workbook. The master workbook on which this child workbook
is based always contains the original data as it was loaded. If you decide later
that you want the IsAdult data in your analysis, you can create another child
workbook from the Watson_News master workbook.

Why not just hide the IsAdult column?: When you hide a column, the data
in that column is still included when you run the workbook or create a chart.
The only way to remove the data from the analysis or chart is to remove the
column.
6. As you review the data in this Watson News Revised child workbook, you
decide that you do not need several other columns. You can use the same
method, as in the previous step, to remove them one at a time or remove
multiple columns at once:
a. Click the down arrow in any column heading, and select Organize
Columns.
b. Click the red X ( ) next to the following columns to mark them for
removal:
v Crawled
v Inserted
v MoveoverUrl
v PostSize

c. Click the green check mark ( ) to remove the columns.

Tip: If you accidentally remove more columns than you intend, you can
click Undo to undo your last action.
7. Click Fit column(s) to resize the remaining columns. You now see columns A
through H:
Table 1. View of the table after you click Fit column(s)
A B C D E F G H
Country FeedInfo Language Published SubjectHtml Tags Type Url

8. Save and exit the workbook by clicking Save and selecting Save & Exit. If you
are prompted with a Save workbook window, you can save the workbook
with or without entering a description.
9. You are prompted with the message This workbook has never been run.
Press Run to run it or Close to dismiss this message. Click Run. You see
a progress indicator in the upper right corner of the window.
Until now, you have been working with a subset of the Watson and internal
IBM data. BigSheets keeps only a limited number of rows in memory. The
lower right corner displays a message that indicates you are seeing only a
simulated sample of 50 rows of data. When you run the data, you apply all
changes that you made since the last time you saved the workbook to the full
data set.

Chapter 4. Tutorial: Analyzing big data with BigSheets 17


The progress bar monitors the progress of the job. Behind the scenes, Pig
scripts initiate MapReduce jobs. The runtime performance depends upon the
volume of data that is associated with your data collection and the system
resources that are available.
10. Now, create a child workbook from the Watson_Blogs master workbook, and
remove the columns that are not needed for your analysis:
a. To return to the page that displays all your workbooks, click the
Workbooks link.
b. Select the Watson_Blogs master workbook, and click Build new workbook.
A new workbook is created with the name: Watson_blogs(1).

Learn more about the differences in icons for master workbooks and
child workbooks: Notice that the Watson News Revised workbook has a

child workbook icon ( ) that looks like a mini spreadsheet next to


it, whereas the Watson_Blogs and Watson_News master workbooks have a

different icon ( ) that looks like a lock that requires a key over the
spreadsheet image, indicating that the master workbook is read-only. You
can quickly distinguish master workbook from child workbooks by these
icons.

c. Rename the new child workbook by clicking the Edit icon ( ), typing

Watson Blogs Revised, and clicking the green check mark ( ).


d. Use the Organize Columns function to remove the following columns:
v Crawled
v Inserted
v IsAdult
v PostSize
Remember to select the green check mark in the Organize Columns
window. Now, the Watson News Revised and Watson Blogs Revised
workbooks contain the same columns. To merge workbooks, each
workbook must contain the same data types and columns, or schema.
e. Save and exit the workbook.
f. When prompted, click Run to apply the changes that you made to the
child workbook.

Because both new child workbooks have the same schema, you can merge them
into a new workbook, where you can explore and analyze your data.

Lesson 3: Combining the data from two workbooks


In this lesson, you combine the data from the two child workbooks into a single
data collection. By merging the data, you have a central place to explore, analyze,
and chart the coverage of the IBM Watson data.

To merge the data, create a new workbook from an existing workbook, then load
the data from the second workbook into the new workbook.

18 IBM InfoSphere BigInsights Version 3.0: Tutorials


1. In the InfoSphere BigInsights Console, click the BigSheets tab, and select the
Watson News Revised workbook.
2. Click Build new workbook. The name of the workbook is Watson News
Revised(1), indicating that it is a child workbook of Watson News Revised.
You change the name of this workbook later when you save and exit the
workbook.
3. Click Add sheets, and select Load.

Learn more about types of sheets: Each type of sheet provides different
predefined logic for analyzing data. Use the Load sheet to include the data of
another workbook as a sheet in the current workbook.
4. In the Load window, select the Watson Blogs Revised workbook link from the
list of existing workbooks.
5. In the Sheet Name field, enter Watson Blogs Revised. In the Load window,
you see details of the columns and the first few rows of data in that
workbook.

6. Click the green check mark ( ). At the bottom of your workbook, you see
two tabs, Watson News Revised and Watson Blogs Revised.
7. Click Add sheets, and select Union.
8. In the Sheet Name field of the New sheet: Union dialog, enter News and Blogs
to indicate that this sheet contains the merged data.
9. From the Select sheet drop-down list, select the Watson News Revised sheet,
click the green plus sign ( ) to add the sheet (you see the sheet move to the
bottom of the dialog). From the Select sheet drop-down list, select the Watson

Blogs Revised sheet, click the green plus sign ( ) to add the sheet (you see
the sheet move to the bottom of the dialog). Then click the green check mark (

) to add both sheets. Your workbook now displays the new tab, News
and Blogs, at the bottom of your screen.
10. Click Save. When prompted for a name and description, enter Watson News
Blogs in the Name field and Combined news and blogs data in the
Description text box, and click Save.

You successfully combined the blog and news data into one workbook, where you
can analyze and explore the data. Next, you group similar data from multiple
columns into one column.

Lesson 4: Creating columns by grouping data


In this lesson, you learn how to create columns by grouping similar information.
You want to discover how many news articles and blog posts are written in each
language. You accomplish this goal by using the Group sheet and its functions to
combine, calculate, and sort the language data.

First, use the Calculate function to count the number of articles and posts by
language. Then, sort the column by language to display the most popular
languages first.

Chapter 4. Tutorial: Analyzing big data with BigSheets 19


1. Make sure that the Watson News Blogs workbook is open. If the workbook is
not open, from the BigSheets tab, click the Edit link next to Watson News Blogs.
Clicking the Edit link opens the selected workbook in edit mode, so you do not

have to click the Edit icon ( ).


2. Click Add Sheets, and select Group.
3. In the New sheet: Group window, complete the required information:
a. In the Sheet name field, enter Group by language.
b. From the Group by columns drop-down list, select Language, and click the
green plus sign ( ) to add the column. The Language column name
displays in the bottom of the dialog.
c. At the bottom of the window, click the Calculate tab.
d. In the Create columns based on groups text box, enter

NumberArticlesandPosts, and click the green plus sign ( ).


e. From the NumberArticlesandPosts drop-down list, select COUNT.
f. From the Column drop-down list, select Language, then click the green

check mark ( ).
On the Group by language sheet, you see two columns, Language and
NumberArticlesandPosts. The Language column displays all the languages
from the News and Blogs sheet. The NumberArticlesandPosts column counts
the number of posts in each language.
4. To see the most common languages for posts about IBM Watson, sort the
Group sheet by the number of posts. Click the drop-down arrow to the right of
the NumberArticlesandPosts column, select Sort, and select Descending. You
see that English is the most popular language with 3169 posts, followed by
Russian, Spanish, and Chinese - Simple. But notice that Chinese (spelling) and
Chinese - Traditional are also near the top of the list. You combine these values
into one Chinese language value later when you create a chart.
5. Click Save > Save & Exit to save and close the workbook.
6. Click Run to save, sort, and process the entire data set for the workbook. You
see a progress indicator in the upper right corner of the window. After you run
the workbook, you see different results for the number of English posts in the
NumberArticlesandPosts column, 5464.

View your worksheets and sheets in the BigSheets diagrams. There, you visualize
the results of your analysis by creating and refining charts.

Lesson 5: Viewing data in BigSheets diagrams


In this lesson, you view diagrams in BigSheets to understand the relationships
between workbooks and sheets and the process of modifying data in a workbook.

The Workbook Diagram ( ) beside a child workbook shows the sheets and
processes that created the selected workbook, the relationships between
workbooks, and on which master or child workbook the current workbook is
based.

20 IBM InfoSphere BigInsights Version 3.0: Tutorials


The Workflow Diagram ( ) shows how you used the Watson Blogs Revised and
Watson News Revised child workbooks to create the Watson News Blogs workbook.
You can also see the source (master workbook) for each child workbook.
1. From the Watson News Blogs workbook, click the Workflow Diagram icon (

). you are finished looking at the diagram, click the red X (


2. When ) in the
upper right corner.

3. Click the Workbook Diagram icon ( ). In the diagram, you can see the
types of sheets and history of the current workbook.
4. When you are finished looking at the diagram, click the red in the upper
right corner.

You now know how the two diagrams in BigSheets can help you visualize the
relationships between workbooks and sheets and the process of creating a
workbook. Now you are ready to explore and analyze the data in the Watson News
and Blogs workbook.

Lesson 6: Visualizing and refining the results in charts


In this lesson, you visualize the results of the sorted and combined Watson blogs
and news data in the Watson News Blogs workbook by creating a simple horizontal
bar chart. Then, you refine your chart to improve your data.

BigSheets provides various charts and maps. A chart plots data points in a grid,
such as a typical pie or bar chart. A cloud shows the importance of values by
displaying the size of the words relative to their importance. A map contains charts
that represent geographic data, such as a heat map that shows the concentration of
data points geographically.
1. Open the Watson News Blogs workbook, click Add chart, and then select chart
> Horizontal Bar.
2. In the New chart: Horizontal Bar window, enter or select the following values:
a. In the Chart Name field, enter Language Coverage. The chart name is the
name that displays on the tab at the bottom of the worksheet.
b. In the Title field, enter IBM Watson Coverage by Language. The title of the
chart displays at the top of the chart.
c. From the X Axis drop-down list, select NumArticlesandPosts.
d. In the X Axis Label, enter Number of posts.
e. From the Y Axis drop-down list, select Language.
f. In the Y Axis Label, enter Language of post.
g. From the Sort By drop-down list, select X Axis. You want to sort by the
number of posts.
h. From the Occurrence Order drop-down list, select Descending. You want
to see the language with the highest number of posts first.
i. In the Limit field, enter 12. You want to see only the top 12 languages by
the number of posts.
j. Leave the Template and Style default values.

k. Click the green check mark ( ) to preview the chart with sample data.
3. Click Run to generate the chart from the full set of workbook data. Even
though you see the preview chart immediately, the actual chart is not

Chapter 4. Tutorial: Analyzing big data with BigSheets 21


displayed until you see 100% on the progress bar. It might take some time to
generate the chart from the full set of data. Use the progress bar to monitor
the status of the completed chart.
After the bar chart is generated, you can see that Russian is the second most
popular language for posts. You also see that the fifth and sixth most popular
languages are variations on the Chinese language. By combining these values,
Chinese is actually the second most popular language for posts. This situation
is common, especially when you combine data from various sources such as
different social media sites.
4. To clean up the data, combine the Chinese languages and post numbers:
a. Click Edit. You return to the Group by language sheet.
b. Select the News and Blogs sheet by clicking the tab name at the bottom of
the window.
c. Insert a new column by clicking the down arrow next to the Language
column and selecting Insert Right > New Column.
d. Enter Language_Revised for the name of the new column, and then click

the green check mark ( ) to create the column. Your cursor moves to
the fx (or function) area, where you provide the function to generate the
contents of the new column.
e. Enter the following formula as the function IF(SEARCH(’Chin*’,
#Language) >0, ’Chinese’, #Language), and click the green check mark (

) to apply the formula and generate the values for the new
Language Revised column.
This formula searches the Language column (indicated by #column_name)
for any value that starts with Chin and combines those values into one
value in the Language Revised column. The wildcard asterisk character
ensures that all variations of the Chinese language, regardless of spelling
or words that follow the word Chinese (such as Chinese Simple), are
included. If the value does not start with Chin, then the formula copies the
value, as is, into the Language Revised column.

Learn more about BigSheets functions and formulas: To understand how


to use the BigSheets functions, and to see some examples of using
formulas, see Formulas.
f. Change the settings for the Group by language sheet to use the new
column by clicking the down arrow next to the Group by language sheet
and selecting Sheet Settings.
g. In the Group window, from the Group by Columns drop-down list, select

Language_Revised, and click the green plus sign ( ) to add the column.
h. Click the red X ( ) next to the Language column to remove it. You want
to group and calculate the number of posts by the Language_Revised
column instead of the Language column.
i. Click the Calculate tab. In the Column drop-down list, select
Language_Revised.

j. Click the green check mark ( ) to apply your changes.

22 IBM InfoSphere BigInsights Version 3.0: Tutorials


The new Language_Revised column replaces the Language column to the
right of the NumArticlesandPosts column. Click the B at the top of the
Language_Revised column and drag it to the left of the NumArticlesandPosts
column.
5. Click Save > Save & Exit, then click Run to update the entire data set for the
workbook.
6. Click the Language Coverage sheet, and you see a message that An error
occurred while sampling the chart. You updated the Group by language
chart to use the Language_Revised column, but the current Language
Coverage chart is based on the Language column.
7. Click OK to close the error message.
8. Delete the previous Language Coverage chart by clicking the triangle to the
right of the Language Coverage sheet, and selecting Delete chart.
9. Click Add chart, and select chart > Horizontal Bar to create another chart that
is based on the updated data.
10. In the New chart: Horizontal Bar window, enter or select the following values:
a. In the Chart Name field, enter Language Coverage. The chart name is the
name that displays on the tab at the bottom of the worksheet.
b. In the Title field, enter IBM Watson Coverage by Language. The title of the
chart displays at the top of the chart.
c. From the X Axis drop-down list, select NumArticlesandPosts.
d. In the X Axis Label, enter Number of posts.
e. From the Y Axis drop-down list, select Language_Revised.
f. In the Y Axis Label, enter Language of post.
g. From the Sort By drop-down list, select X Axis. You want to sort by the
number of posts.
h. From the Occurrence Order drop-down list, select Descending. You want
to see the language with the highest number of posts first.
i. In the Limit field, enter 12.

j. Click the green check mark ( ) to preview the chart with sample data.
11. Click Run to generate a new chart. After the chart completes, all Chinese
languages are combined into one bar in the bar chart, which shows Chinese as
the second most popular language for posts and Russian as the third. If you
hover over the bars in the chart, you can see the actual numbers of posts.

You used BigSheets to generate a simple horizontal bar chart from your social
media data collections. You also analyzed the bar chart and refined the data to
determine the 12 most commonly used languages to generate posts about IBM
Watson.

Next, you learn how to export data from your workbooks.

Lesson 7: Exporting data from your workbooks


In this lesson, you export the data from the Watson News and Blogs workbook into
a web browser tab and a CSV file.

You might want to share the results of your BigSheets analysis with colleagues
who do not have direct access to IBM InfoSphere BigInsights. You can export your
analysis results in various data formats, including CSV (comma-separated values),
JSON Array, and TSV (tab-separated values).

Chapter 4. Tutorial: Analyzing big data with BigSheets 23


You can also export the data to a new web browser tab or to the Hadoop
Distributed File System (HDFS). Or you can save the data, if your privileges
include saving files to the cluster.
1. If the Watson News Blogs workbook is not open, open it. From the BigSheets
tab, select Watson News Blogs. Do not click Edit to open the workbook in Edit
mode. You cannot export workbooks from Edit mode. If you do open the
workbook in Edit mode, you see Add sheets instead of Export data.
2. Export your data to a browser tab:
a. Click Export data. The Export to option is set by default to Browser Tab.
b. Click OK to export the Result sheet data into a new tab on your web
browser.
3. Click the IBM InfoSphere BigInsights tab in your web browser to return to
the InfoSphere BigInsights Console and the Watson News Blogs workbook.
4. Export your data to a CSV file:
a. Click Export data again.
b. In the Format Type drop-down list, select CSV, which produces a
comma-separated value file.
c. In the Export data option, select File.
d. Click Browse and enter or select the following parameters:
1) Set the path by opening the main hdfs:// folder and selecting the tmp
folder.
2) In the file name text box at the bottom of the window, enter
watson_news_blogs as the file name, then click OK. File names can
contain spaces. Avoid entering special characters in the file name.
e. Select the Include Headers check box to include column names in the file.
This option is available only if you selected either CSV or TSV format.
f. Click OK. You receive a message that the Workbook has been successfully
exported. Click OK to close the message. You can check your results by
clicking the Files tab and opening the main hdfs:// folder and then the tmp
folder. You see the watson_news_blogs.csv file listed.

You just exported the results of the Watson News Blogs workbook into both a web
browser tab and a CSV file on your distributed file system (DFS) cluster.

You can use the exported data in worksheets or in Big SQL.

Summary of analyzing data with BigSheets tutorial


In this tutorial, you analyzed social media data from two sources by creating
master and child workbooks, tailored that data to your analysis goals, generated
charts to visualize and refine your results, and exported your results.

Lessons learned

You now have a good understanding of how to:


v Create master workbooks from data files on your cluster
v Create child workbooks from both master workbooks and other child workbooks
v Tailor and explore workbook data by removing unneeded columns and
combining two workbooks using both the Load and Union sheets
v Group similar data and create columns that calculate and sort the data using the
Group sheet

24 IBM InfoSphere BigInsights Version 3.0: Tutorials


v View BigSheets diagrams and see relationships between workbooks and the
operations that you used to modify a workbook
v How to visualize your analysis results in a simple horizontal bar chart
v How to export BigSheets workbook data into both a web browser tab and a CSV
file

Extra resources

To learn more about how to use BigSheets to analyze your big data, see the
following resources:
v Overview of BigSheets
v Analyzing data with BigSheets

Chapter 4. Tutorial: Analyzing big data with BigSheets 25


26 IBM InfoSphere BigInsights Version 3.0: Tutorials
Chapter 5. Tutorial: Developing your first big data application
Learn how to develop your first big data application, which writes data to a file
and stores the results in your distributed file system.

Learn more about the sample code: The code that is shown in this tutorial is
intended for educational purposes only, and is not intended for use in a
production application.

You develop the application in Jaql, a query and scripting language that uses a
data model based on the Javascript Object Notation (JSON) format. You learn how
to create, publish, and deploy the application by using the InfoSphere BigInsights
Tools for Eclipse so that you can run the application from the InfoSphere
BigInsights Console. This module does not cover the full syntax and usage of the
Jaql language, which is explained in the InfoSphere BigInsights Information Center.
However, you can apply many of the application development techniques in this
module to other applications.

Learning objectives

After completing the lessons in this tutorial, you will have learned how to
complete the following tasks:
v Create an InfoSphere BigInsights project.
v Create and populate a Jaql file with application logic.
v Test your application.
v Publish your application to the InfoSphere BigInsights catalog.
v Deploy and run your application on the cluster.
v Upgrade your application to accept input parameters.

Time required

This tutorial should take approximately 40 minutes to complete.

Prerequisites

The InfoSphere BigInsights Tools for Eclipse must be installed in your Eclipse
environment.

Experience with Eclipse is not required, but understanding the concepts and the
development environment might be helpful when working with the InfoSphere
BigInsights Tools for Eclipse.

Lesson 1: Creating an InfoSphere BigInsights project


In this lesson, you create an InfoSphere BigInsights project by using the InfoSphere
BigInsights Tools for Eclipse.

The project that you create will contain the files, applications, programs, and
modules that your application requires to run. After you create a project, you can
create an InfoSphere BigInsights program.
1. Open Eclipse.

© Copyright IBM Corp. 2013, 2014 27


2. Set the perspective to BigInsights.
a. Click Window > Open Perspective > Other.
b. Select BigInsights, and then click OK.
3. Click Help > Task Launcher for Big Data to open the Task Launcher for Big
Data.
4. From the Develop tab, under Quick Links, click Create a new BigInsights
project.
5. Enter WriteMessage as the project name, and then click Finish.

The project that you created, WriteMessage, displays in the Project Explorer pane.

Now that your project is created, you can create a program. In this module, you
are creating a Jaql application.

Lesson 2: Creating and populating a Jaql file with application logic


In this lesson, you create a Jaql file within your new project and include Jaql
statements that determine how the application logic operates. You create a simple
Jaql application with no input parameters.

Before you begin, ensure that the InfoSphere BigInsights Tools for Eclipse is open.

You should also ensure that you have write access to the biadmin directory. You
can check access by opening the InfoSphere BigInsights Console, and then selecting
the Files tab. Under DFS File, select server name > user > biadmin. In the right
pane of the biadmin folder, view the permission column to verify your write
access.
1. From the Task Launcher for Big Data, click the Develop tab.
2. Under Tasks, click Create a BigInsights program.
3. In the Create a BigInsights program window, select JAQL Script, and then click
OK.
4. Select the parent folder, WriteMessage, enter MyJaql.jaql as the file name, and
then click Finish.
Your new file, MyJaql.jaql, opens in an editor within Eclipse.
5. Copy and paste the following code into the MyJaql.jaql file.
The following code writes the results of the Jaql query as a text file (myMsg.txt)
to the /user/biadmin/sampleData directory in your distributed file system. You
might need to modify the specified directory to match your environment.
Ensure that your user ID has write access to the directory that you specify.

// sample message
term=’Hello World’;

// Location of the output file. Modify this location to fit your environment.
output=’/user/biadmin/sampleData/myMsg.txt’;

// Write the sample message as a text file in HDFS.


write ([term], lines(location=output));

// Alternatively, you can hard code input parameters


// as shown in the following example.
// write ([’Hello World!’],
// lines(location="/user/biadmin/sampleData/myMsg.txt"));
6. Save and close the MyJaql.jaql file.

28 IBM InfoSphere BigInsights Version 3.0: Tutorials


Now that your Jaql application contains logic, you can test it to ensure that the
program runs as designed.

Lesson 3: Testing your application


In this lesson, you define a connection to an existing InfoSphere BigInsights cluster
and configure the runtime properties for your application.

Before you begin this lesson, ensure that the InfoSphere BigInsights Tools for
Eclipse is open.
1. Create a server connection to the InfoSphere BigInsights Console. Because Jaql
is run from the Eclipse environment, your operating system user ID from
where you run the Jaql shell must be the same as your InfoSphere BigInsights
user ID.
a. In the Overview tab of the Task Launcher for Big Data, under First Steps,
click Create a BigInsights server connection.

Tip: If you upgrade the InfoSphere BigInsights server to a newer version,


or upgrade your InfoSphere BigInsights Tools for Eclipse, you might need to
refresh the configuration files of your InfoSphere BigInsights sever in
Eclipse. Go to the InfoSphere BigInsights server view in Eclipse, and either
delete and register the InfoSphere BigInsights server again, or expand
the InfoSphere BigInsights server node, right-click Configuration Files and
select Refresh from the server.
b. Enter the URL for your InfoSphere BigInsights Console, including the server
name, user ID, and password.
c. Click Test connection to verify your server connection.
d. When you see a message indicating that the system successfully tested the
connection, click Finish.
2. In the Task Launcher for Big Data, from the Develop tab, under Tasks, click
Create a configuration and run a BigInsights program.
3. Select JAQL as the program type that you want to create, and then click OK.
4. Enter the input parameters for your program.
a. Enter MyJaqlProgram for the name of your program.
b. Select WriteMessage as the project name.
c. Select MyJaql.jaql as the Jaql script.
d. Select the server that you want to connect to.
5. Click Apply and then Run.
6. Review the output in the Eclipse Console pane and verify that no errors were
reported. If errors were reported, fix the errors and run the application again.
7. Log in to the InfoSphere BigInsights Console.

Option Description
In a non-SSL installation Enter the following URL in your browser:
http://host_name:8080

host_name is the name of the host where the


InfoSphere BigInsights Console is running,
and 8080 is the default port.

Chapter 5. Tutorial: Developing your first big data application 29


Option Description
In an SSL installation Enter the following URL in your browser:
https://host_name:8443

host_name is the name of the host where the


InfoSphere BigInsights Console is running,
and 8443 is the default port.

8. In the InfoSphere BigInsights Console, from the Files tab, expand the directory
from step 5 in Lesson 2 in your distributed file system tree to locate the .txt
file that your application created.

Tip: After you identify a server connection and create a Jaql program
configuration, you can test Jaql statements directly from your MyJaql.jaql file.
Highlight the statement that you want to run, right click, and select Run the
JAQL statement.

Now that your Jaql program is working, you can publish it as an application in the
InfoSphere BigInsights applications catalog.

Lesson 4: Publishing your application in the InfoSphere BigInsights


applications catalog
In this lesson, you publish your application in the InfoSphere BigInsights
applications catalog. When packaging and publishing, you identify the icon and
name for your application, define the application workflow, and complete related
tasks.
1. In the Task Launcher for Big Data, under the Publish and run tab, click
Publish a BigInsights application.
2. On the Location panel of the BigInsights Application Publish wizard, select the
WriteMessage project, specify the server that you want to publish the
application to, and then click Next.
3. On the Application panel, select the Create New Application radio button.
Optionally, enter a description for your application, provide a custom icon file,
and specify a category for your application (such as test). Click Next.
4. On the Type panel, select Workflow for your Application Type, and then click
Next.
5. On the Workflow panel, select the Create a new single action workflow.xml
file radio button, and select Jaql as your Action Type.
a. In the Properties table, select the script property, and then click Edit.
b. Accept the supplied value (script) that is shown in the Name field.
c. Enter MyJaql.jaql as the name of your Jaql file in the Value field.
d. Click OK and then click Next.
6. On the Parameters panel, accept the default properties, and then click Next.
The Parameters panel will be empty because your application currently does
not have any input parameters.
7. On the Publish panel, verify that the MyJaql.jaql file displays under the
workflow folder of your application package, and then click Finish.

Your application is published to the InfoSphere BigInsights applications catalog.

30 IBM InfoSphere BigInsights Version 3.0: Tutorials


Now that your application is published to the applications catalog, you can deploy
it on the cluster so that other users can access the application from the InfoSphere
BigInsights Console.

Lesson 5: Deploying and running your application on the cluster


In this lesson, you deploy your application to the cluster so that other users can
run it from the InfoSphere BigInsights Console.
1. Log in to the InfoSphere BigInsights Console in a web browser. Ensure that the
user that you log in as has administrative access for deploying and running
applications.
2. Click the Applications tab, and then click Manage.
3. Search for and select the WriteMessage application that you created from the
list of applications, and then click Deploy. In the Deploy Applications window,
click Deploy. Your application is now deployed to the cluster and can be run.
4. On the Applications tab, click Run.
5. Select the WriteMessage application.
6. In the Execution Name field, enter WriteMessageTrial, and then click Run.
The application displays in the Application History pane, and shows the
progress of the application run.
7. When the application finishes running, click the arrow icon in the Details
column to display further information about the workflow and your
application run.

Lesson 6: Making your application more dynamic


In this lesson, you will make your application more flexible by modifying the code
to accept two input parameters.
1. Log in to the InfoSphere BigInsights Console.
a. Click Applications, and then click Manage.
b. Search for and select your application, and then click Undeploy. You must
undeploy applications to replace them with newer versions.
2. In the InfoSphere BigInsights Tools for Eclipse, right-click the project name,
click Copy, then right-click, and click Paste. In the Copy Project box, name
the project WriteMessageBackUp. Click OK.
3. Expand your WriteMessage project and then open the MyJaql.jaql file.
4. Delete the contents of the MyJaql.jaql file, and then copy and paste the
following code into the file. Block 1 declares two external variables, TERM and
OUTPUT, that represent input parameters that will become part of the user
interface for your application. Each parameter is assigned to a Jaql variable
(term and output).
Block 2 uses the output file information that you provide to write the results
to HDFS.
// Block 1
// Define the search parameters
extern TERM;
extern OUTPUT;

// Search term that the user enters as input.


term=[TERM];

// The full path and file name that the user enters for the output.
output=[OUTPUT];

Chapter 5. Tutorial: Developing your first big data application 31


// Block 2
// The following statement writes the input message as a text file in HDFS.
write ([term[0]],lines(location=output[0]));
5. Test your Jaql script locally with the parameters that you added.
a. In Eclipse, click Run > Run Configurations.
b. From the list of configurations, expand JAQL, and then select
MyJaqlProgram.
c. In the Run Configurations window, click the Arguments tab.
d. Copy the following statement and paste it into the Program arguments
field.
-e "TERM=’Hello World!’;OUTPUT=’/user/biadmin/sampleData/myMsg.txt’"
e. Click Run to test your application.
6. Publish your application to the InfoSphere BigInsights Console applications
catalog.
a. In the Project Explorer pane, right-click your project and select BigInsights
Application Publish.
b. Select the same InfoSphere BigInsights server that you used when
publishing your application previously, and then click Next.
c. On the Specify Applications panel, ensure that the Replace Existing
Application check box is selected. Accept the existing values for the
remaining items, and then click Next.
d. On the Type panel, select Workflow, and then click Next.
e. On the Workflow panel, select Create a new single action workflow.xml
file, and select Jaql as your Action Type. Because you are introducing new
parameters, you cannot accept the default setting to use the existing
workflow.
1) In the Properties table, select the script property, and then click Edit.
2) Accept the supplied value (script) that is shown in the Name field.
3) Enter MyJaql.jaql as the name of your Jaql file in the Value field.
f. Click New to create a new property. In the New Property window, select
eval for the Name field from the dropdown menu, and then enter the
following statement in the Value field.
TERM="${jsonEncode(term)}";OUTPUT="${jsonEncode(output)}";
When you run your application, you want to provide values for the TERM
and OUTPUT variables. To provide these values, you enter the previous
statement to assign an Oozie variable to each Jaql variable. Oozie is the
workflow engine that runs the application, and each Oozie parameter is
enclosed within a dollar symbol and braces, ${}.
To easily correlate the Oozie variables with the Jaql parameters, the same
variable names are used for the Oozie parameters (term and output). The
jsonEncode() function is used to escape special characters and avoid code
injection when users enter input in the InfoSphere BigInsights Console.
g. Click OK and then click Next. On the Parameters panel, all Oozie
parameters that you specified in the workflow are listed. You must select
each parameter and edit its properties to provide information about how
each parameter displays in the InfoSphere BigInsights Console.
h. For the term parameter, set the display name to Search term and the type
to string. Enter Hello World! as the default value, provide a brief
description, ensure that the Required check box is selected, and then click
OK.

32 IBM InfoSphere BigInsights Version 3.0: Tutorials


i. For the output parameter, set the display name to Output file and the type
to File Path. Enter a path name for the default value, provide a brief
description, ensure that the Required check box is selected, and then click
OK.
j. On the Publish panel, verify your parameters, click Next, and then click
Finish.
7. In the InfoSphere BigInsights Console, on the Applications page, click
Manage, refresh the applications catalog, select your application, and then
click Deploy.
8. In the Configuration column, click the settings icon. Under Security, select all
available groups and then click Save.
9. On the Applications panel, click Run, and then select your application from
the list.
Your application prompts the user to specify a search term or phrase and
includes the default value that you specified in your input parameters. You
can browse to select an existing file in the distributed file system, or enter a
new location for the output file.
Accept the default search term, and then under Execution, click Run.
10. After your application runs, open the InfoSphere BigInsights Console, click
Files, and then navigate to the location that you specified for the output file.

You might need to refresh the navigation view by clicking the Refresh
icon.
11. Optional: In the InfoSphere BigInsights Tools for Eclipse, expand the
WriteMessage project. The InfoSphere BigInsights Tools for Eclipse publication
wizard generated the workflow.xml file and the application.xml file.
a. To see the generated workflow, expand BIApp > workflow. Double-click
the workflow.xml file to open it in the InfoSphere BigInsights workflow
editor. From this editor, you can change the workflow without writing
XML code.
b. To view the values that you set for your input parameters, expand BIApp
> application. Double-click the application.xml file to open it in the XML
editor.
c. On the Design tab of the XML editor, expand application-template >
properties > property to view the values that you set.

Summary of developing your first big data application


In this tutorial, you created your first application, published it to the InfoSphere
BigInsights applications catalog, deployed your application to the cluster, and then
ran the application from the InfoSphere BigInsights Console.

Lessons learned

You now have a good understanding of the following tasks:


v Creating an InfoSphere BigInsights project by using the InfoSphere BigInsights
Tools for Eclipse.
v Establishing a server connection and testing your application.
v Publishing your application to the InfoSphere BigInsights applications catalog.
v Deploying your application to the cluster and running it in the InfoSphere
BigInsights Console.

Chapter 5. Tutorial: Developing your first big data application 33


v Upgrading your application to accept input parameters and redeploying it to the
InfoSphere BigInsights Console.

Additional resources

For more information about developing applications with InfoSphere BigInsights,


see this video on the IBM Big Data channel on YouTube.

34 IBM InfoSphere BigInsights Version 3.0: Tutorials


Chapter 6. Tutorial: Developing Big SQL queries to analyze
big data
Learn how to use Big SQL, an SQL language processor to summarize, query, and
analyze data in an Apache Hadoop distributed file system.

Big SQL provides SQL access to data that is stored in InfoSphere® BigInsights™ by
using JDBC, ODBC, and other connections. Big SQL supports large ad hoc queries
by using IBM SQL/PL support, SQL stored procedures, SQL functions, and IBM
Data Server drivers. These queries are low-latency queries that return information
quickly to reduce response time and provide improved access to data.

The Big SQL server is installed with IBM InfoSphere BigInsights.

This tutorial uses data from the fictional Sample Outdoor Company. The Sample
Outdoor company began as a business-to-business operation. It does not
manufacture its own products. The products are manufactured by a third party
and are sold to third-party retailers. The company has a presence on the web and
sells directly to consumers through the online store. For the last several years, the
company has steadily grown into a worldwide operation, selling their line of
products to retailers in nearly every part of the world.

You will learn more about the products and sales of the Sample Outdoor Company
by running Big SQL queries and analyzing the data in the following lessons.

Learning objectives

You will use the InfoSphere BigInsights Tools for Eclipse and the Big SQL Console
to create Big SQL queries so that you can extract large subsets of data for analysis.
In one lesson, you will export your query results to an open source spreadsheet to
see how you can bring your analysis down to a smaller environment.

After you complete the lessons in this module, you will understand the concepts
and know how to do the following actions:
v Use the InfoSphere BigInsights Tools for Eclipse to connect to the Big SQL
server.
v Use the InfoSphere BigInsights Tools for Eclipse to load sample data and to
create and run queries.
v Use BigSheets to analyze data that is generated from Big SQL queries and to
create Big SQL tables.
v Use the InfoSphere BigInsights Tools for Eclipse to export data.

Time required

This tutorial takes approximately 1 hour to complete each module, depending on


whether you also complete the optional lessons.

Skill level

Some familiarity with SQL. This tutorial includes lessons that are relevant to a new
Big SQL user and an advanced Big SQL user.

© Copyright IBM Corp. 2013, 2014 35


Setting up the Big SQL tutorial environment
Before you can complete any of the lessons in this tutorial, you must set up your
Big SQL environment.

About this task

For the purposes of this tutorial, Eclipse is used as a client for the Big SQL server.
A few of the lessons also use Java™ SQL Shell (JSqsh), an open source
command-line client.

Procedure
1. Verify that Big SQL is started.
2. Verify that Eclipse is installed and that you have added the InfoSphere
BigInsights Tools for Eclipse.
3. Create a connection to a Big SQL server.
4. Connect to an InfoSphere BigInsights server.

Results

You have now established a communication between Big SQL and InfoSphere
BigInsights and the Eclipse client environment.

36 IBM InfoSphere BigInsights Version 3.0: Tutorials


For information about how to create a simple JDBC application that opens a
database connection, and runs a Big SQL query, see “Lesson 1.8: Advanced:
Creating and running a simple Big SQL query from a JDBC client application” on
page 63.

Creating a directory in the distributed file system to hold your samples


Before you obtain any data, you need a place on the Hadoop distributed file
system to hold the data and samples.

Procedure
1. In the InfoSphere BigInsights Console, click the Files tab.
2. Click the DFS Files tab, and then create a directory that you can use to hold
the SQL files that contain the DDL you must use:
a. Open the biadmin directory in this path: hdfs://<server-name>:9000/user/
biadmin/.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 37


b. If the bi_sample_data directory does not exist in the /user/biadmin/ path,
click the parent directory, biadmin, and click the Create Directory icon (

).
c. In the Create Directory dialog window, type bi_sample_data as the new
directory name, and then Click OK.

Getting the sample data


The Big SQL tutorials use sample data from two sources. You must access both
sources to do all of the lessons.

Accessing the Big SQL sample data installed with InfoSphere


BigInsights
The Big SQL tutorial uses sample data that is provided in the $BIGSQL_HOME/
samples directory on the Linux file system of the IBM InfoSphere BigInsights
server. By default, the $BIGSQL_HOME environment variable is set to the installed
location, which is /opt/ibm/biginsights/bigsql/.

Before you begin

Quick Start Edition VM Users: If you are running the tutorial with the IBM
InfoSphere BigInsights Quick Start Edition VM image, skip this task, and instead,
follow the steps to access the data in “Quick Start Edition VM Image Users Only”
on page 41.

Because you will use the Distributed File Copy application with the SFTP protocol,
you must create a credential file that you store in the distributed file system (DFS).
A credential properties file contains the IBM Big SQL server credentials. If you do
not already have a credential file stored in the distributed file system, do the
following tasks:
1. Create a credentials property file by running the credstore.sh utility from
$BIGINSIGHTS_HOME/bin. For information about the credential store utility, see

38 IBM InfoSphere BigInsights Version 3.0: Tutorials


Loading and storing credentials in the credentials store. Name the property file
bigsql.prop. The following is an example of the command to run for this
tutorial:
./credstore.sh
store -pub bigsql.prop port=51000
user=bigsql password=bigsql
server=my.abc.com

Remove the -pub parameter to store the credentials in a private directory. If you
used a different user and password for the Big SQL administrator, update those
fields. Remember that the user name must have database administration
privileges. Modify the server information to match the host name of the cluster
from which you are running IBM InfoSphere BigInsights.
2. Open the InfoSphere BigInsights Console and click the Files tab.
3. Verify that your credentials file is in the distributed file system.

About this task

The Distributed File Copy application copies files to and from a remote source to
the InfoSphere BigInsights distributed file system by using Hadoop Distributed
File System (HDFS), GPFS, FTP, or SFTP. You can also copy files to and from your
local Linux file system. In this tutorial, the examples assume that you are using a
Hadoop distributed file system, and that you installed a user account called
biadmin.

Procedure
1. Deploy the Distributed File Copy application to make it available for your use:
a. In the InfoSphere BigInsights Console, click the Applications tab, and then
click Manage.
b. From the navigation tree, expand the Import directory.
c. Select the Distributed File Copy application, and click Deploy (

)
d. In the Deploy Application window, select Deploy.
2. From the toolbar on the navigation tree window, select Run.
3. Select the Distributed File Copy application.
4. Define your application parameters:
a. Type get_samples in the Execution name field. You are creating an instance
of a job that you can run. You can track the results by this Execution name
and then reuse the job later.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 39


b. In the Input path field, specify the fully qualified path to the
$BIGSQL_HOME/bigsql/samples/queries directory on your local Linux file
system. The Big SQL samples are installed with InfoSphere BigInsights in
the queries directory on your Linux file system that is connected to the
cluster: opt/ibm/biginsights/bigsql/samples/queries. Include the user
name and password of that cluster in the Input path. For example, if the
user name is bigsql and the password is bigsql, the Input path field will
contain the following path:
sftp://bigsql:bigsql@my.server.com:22
/opt/ibm/biginsights/bigsql/samples/queries

The port number must be available, so do not use 8080 as the port number.
Make sure that you change my.server.com to the correct server name.
c. In the Output path field, specify the fully qualified path to where you want
to store the data on the distributed file system. For the purposes of this
lesson, specify the following path:
/user/biadmin/bi_sample_data/

The input path points to a directory of SQL (queries), so the output path
contains a directory that is named queries.
d. Use the Browse button to specify the fully qualified path to your credential
properties file in the InfoSphere BigInsights credentials store.
user/biadmin/credstore/public/bigsql.prop
5. Click Run. The queries directory, which is the directory that you requested in
the input path of the Distributed File Copy application, is uploaded to the
distributed file system.
6. Now that you have the statements to DROP, CREATE and LOAD tables in the
distributed file system, get the data that you will use to load the tables.
a. While still in the Distributed File Copy application, type get_data in the
Execution name field.
b. In the Input path field, specify the fully qualified path to the
$BIGSQL_HOME/bigsql/samples/data directory on your local Linux file
system.
sftp://bigsql:bigsql@my.server.com:22
/opt/ibm/biginsights/bigsql/samples/data
c. In the Output path field, specify the same fully qualified path to where you
want to store the data on the distributed file system that you did for the
queries:
/user/biadmin/bi_sample_data/
d. Make sure the Credential file path for SFTP field still contains the path to
the bigsql.prop file:
user/biadmin/credstore/public/bigsql.prop
e. Click Run. The data directory, which is the directory that you requested in
the input path of the Distributed File Copy application, is uploaded to the
distributed file system.
7. From the queries directory in the distributed file system, download the
GOSALESDW_drop.sql, GOSALESDW_ddl.sql, and GOSALESDW_load.sql files to a
local directory so that you can import the files to Eclipse in a later lesson.

Results

These three GOSALESDW_*.sql files are the only SQL scripts that you will use to
drop the tables, or create and populate the tables. More SQL scripts exist in this

40 IBM InfoSphere BigInsights Version 3.0: Tutorials


directory to use when you want to explore more samples.

Quick Start Edition VM Image Users Only


If you are running this tutorial with the IBM InfoSphere BigInsights Quick Start
Edition VM image, use the Big SQL Eclipse projects that are included with the
image to create the Big SQL tables and load the data.

Procedure
1. Open the InfoSphere BigInsights Tools for Eclipse that is installed with your
VM image.
2. Select the project template called myBigSQL_Tutorial5_setup in the project
explorer.
3. Right-click the project and select Open Project.
4. The project contains three SQL files. You will run these files one at a time:
GOSALESDW_drop.sql
You can skip this SQL file if you have never created the tables in the
GOSALESDW schema. Otherwise, open the GOSALESDW_drop.sql file,
and click the Run SQL icon ( ).
Eclipse returns results for each statement. When all of the statements in
the GOSALESDW_drop.sql file are completed successfully, continue to the
next file.
The GOSALESDW_drop.sql file contains SQL statements that drop any
tables in the GOSALESDW schema that might have already been
created.
GOSALESDW_ddl.sql

Open the GOSALESDW_ddl.sql file, and click the Run SQL icon ( ).
Eclipse returns results for each statement. When all of the statements in
the GOSALESDW_ddl.sql file are completed successfully, continue to the
next file.
The GOSALESDW_ddl.sql file contains SQL statements to create the
schema and the tables. The first line of this file creates the
GOSALESDW schema. A Big SQL schema is a way to logically group
objects, such as tables or functions. The second line of this file (the USE
clause) declares a default schema for the session. All unqualified table
names that are referenced in Big SQL statements and DDL statements
default to this schema. If no USE clause is present, the default is your
User ID on the cluster.
In the later lessons, the USE clause is not used, because all of the tables
that are referenced are fully qualified, which means that you include an
unambiguous schema name as part of the table name. Therefore,
instead of running the statement as in Example 1, use the fully
qualified reference as in Example 2:
Example 1: With no schema qualification
SELECT * FROM
go_region_dim;
Example 2: Fully qualified table name
SELECT * FROM
GOSALESDW.go_region_dim;

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 41


In this example, GOSALESDW is the schema name.
GOSALESDW_load.sql
Open the GOSALESDW_load.sql file. This file contains the correct path to
the data, assuming that you are using the VM image, and the host
name is bivm. Edit the path of each LOAD statement to change the host
name if needed. Click the Run SQL icon ( ).
When all of the statements in the GOSALESDW_load.sql file are
completed successfully, each table is populated with data.
The GOSALESDW_load.sql file contains SQL statements to load the data
in the GOSALESDW tables.

Learn more about loading data:

The Big SQL LOAD HADOOP statement offers a powerful way to


import data into your tables. The following is a simple example of the
LOAD HADOOP statement.
v This example shows a load from a DB2® table.
LOAD HADOOP USING JDBC CONNECTION URL
’jdbc:db2://myhost:51000/SAMPLE’
WITH PARAMETERS (
’user’ = ’myuser’,password=’mypassword’)
FROM SQL QUERY
’SELECT * FROM STAFF WHERE YEARS > 5
AND $CONDITIONS’
SPLIT COLUMN ID INTO TABLE STAFF_Q APPEND
;
5. If the results of the CREATE and LOAD statements are successful, view the tables
that you just created on the server. Because you connected to an InfoSphere
BigInsights server when you first opened the file, the tables are created on that
server:
a. Click the Files tab in the InfoSphere BigInsights Console. Then click the
DFS Files page.
b. Look for the GOSALESDW schema by expanding the directories in the
following HDFS navigation path: hdfs://<server-name>:9000//
biginsights/hive/warehouse/gossalsdw.db.

42 IBM InfoSphere BigInsights Version 3.0: Tutorials


6. In the SQL Results view in the current perspective of the InfoSphere
BigInsights Tools for Eclipse, view the results of each Big SQL statement or
script. Any errors are listed and any result tables are displayed.

Learn more about the SQL Results view:

The SQL Results view contains the results of your SQL statements or scripts.
You can change the display of the results page, and also the number of rows
that are returned from each query (the defaults is 500).

By default, the view contains two panes.


v The left pane contains the Status column and the Operation column.
v The right pane contains three tabs: Status, Parameters, and Result1.

Downloading sample data from a developerWorks source


Some of the lessons in this tutorial use sample data that was part of a
developerWorks article about analyzing social media.

About this task

For the module on Publishing an IBM Big SQL application, you need to download
data that was created for a developerWorks article. This article contains data about
the occurrences of the phrase IBM Watson in various social media sources. It will
also be used to demonstrate some of the interaction that is possible between Big
SQL and BigSheets.

Procedure
1. Download the IBM Watson data to your local file system.
The data is in the Download section of the developerWorks article, "Analyzing
social media and structured data with InfoSphere BigInsights: Get a quick start
with BigSheets". Accept the terms and conditions and save the file
article_sampleData to your local system.
2. Extract the file to the following path on your local file system:
/home/biadmin/samples_tutorial.
The article_sampleData directory contains the following files:
v RDBMS_data.csv
v blogs-data.txt
v news-data.txt
v README.txt
3. Note the path to which you extracted the directory . For example, if you
extracted to your Linux directory called samples_tutorial, the full path is
/home/biadmin/samples_tutorial/article_sampleData.
4. Upload the article_sampleData directory to the distributed file system in the

InfoSphere BigInsights Console by using the upload icon ( ) from the


distributed file system Files tab (one file at a time), or by using the Distributed
File Copy application that you deployed in an earlier lesson.
a. Click Run in the Application Tab and double-click the Distributed File
Copy application.
b. Type get_socialMedia in the Execution name field.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 43


c. In the Input path field, type the fully qualified path to the
/home/biadmin/samples_tutorial/article_sampleData directory on your
local Linux file system, which should be the path you used from Step 2 on
page 43.
Include the user name and password of the InfoSphere BigInsights cluster in
the Input path. For example, if the user name is biadmin and the password
is biadmin, the Input path field contains the following path:
sftp://biadmin:biadmin@my.server.com:22
/home/biadmin/samples_tutorial/article_sampleData
d. In the Output path field, type the fully qualified path to where you want to
store the data on the distributed file system. For the purposes of this lesson,
type the following path:
/user/biadmin/bi_sample_data/

The input path points to a directory of data (article_sampleData), so the


output path contains a directory named article_sampleData.
e. Click Browse to find and select the fully qualified path to your bigsql.prop
properties file in the InfoSphere BigInsights credentials store.
f. Click Run. The data from the social media report is loaded.

Creating a project and tables, and loading sample data


The examples in this tutorial use IBM Big SQL tables and data. You use predefined
scripts that contain the statements to create the tables, and load the data. As you
gain familiarity with the tools and product features that are used in this tutorial,
you will run queries and create reports about the Sample Outdoor Company.

About this task

The time range of the fictional Sample Outdoor Company data is three years and
seven months, starting January 1, 2004 and ending July 31, 2007. The 43-month
period reflects the history that you will analyze.

Quick Start Edition VM Users: If you are running the tutorial with the IBM
InfoSphere BigInsights Quick Start Edition VM image, you can skip these steps and
proceed to “Module 1: Creating and running SQL script files” on page 48.

Procedure
1. Open the InfoSphere BigInsights Tools for Eclipse that you installed on your
workstation environment.
2. Create an IBM InfoSphere BigInsights project in Eclipse:
a. From the Eclipse menu bar, click File > New > Other.
b. In the Select a wizard window, expand the BigInsights directory, select
BigInsights Project, and then click Next.
c. Type myBigSQL in the Project name field, and then click Finish.
d. If you are not already in the BigInsights perspective, in the message that
displays, click Yes to switch to the BigInsights perspective.
3. Import the SQL scripts into the Eclipse project:
a. From the Eclipse Project Explorer view, right-click the myBigSQL project
and click Import.
b. In the Import window, select General > File System and click Next.

44 IBM InfoSphere BigInsights Version 3.0: Tutorials


c. Click Browse to find the directory where you downloaded the SQL scripts
and click OK.
d. Select the three SQL scripts and click Finish.

4. From the Project Explorer in the Eclipse BigInsights perspective, expand the
myBigSQL project. Double-click the appropriate *.sql file to open it in the Big
SQL editor. You can then run the statements in the file from the editor. You are
going to run each file in order to drop tables, create tables, and load data into
the tables:

Learn more about the SQL editor window: In the SQL editor window, you can
run SQL statements as you edit them, select connection profiles, and import or
export SQL statements.

You can use the context assistant to help you complete statements. A syntax
checker adds a red indicator next to any invalid line. You can hover over that
indicator to see the reason for the problem.
a. You can skip this step if you have never created the tables in the
GOSALESDW schema. Open the GOSALESDW_drop.sql file, and click the Run

SQL icon ( ).
Eclipse returns results for each statement. When all of the statements in the
GOSALESDW_drop.sql file are completed successfully, continue to the next file
The GOSALESDW_drop.sql file contains SQL statements that drop any tables
in the GOSALESDW schema that might have already been created.

b. Open the GOSALESDW_ddl.sql file, and click the Run SQL icon ( ).
Eclipse returns results for each statement. When all of the statements in the
GOSALESDW_ddl.sql file are completed successfully, continue to the next file.
The GOSALESDW_ddl.sql file contains SQL statements to create the schema
and the tables. The first line of this file creates the GOSALESDW schema. A
Big SQL schema is a way to logically group objects, such as tables or
functions. The second line of this file (the USE clause) declares a default
schema for the session. All unqualified table names that are referenced in
Big SQL statements and DDL statements default to this schema. If no USE
clause is present, the default is your User ID on the cluster.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 45


In the later lessons, the USE clause is not used, because all of the tables that
are referenced are fully qualified, which means that you include an
unambiguous schema name as part of the table name. Therefore, instead of
running the statement as in Example 1, use the fully qualified reference as
in Example 2:
Example 1: With no schema qualification
SELECT * FROM
go_region_dim;
Example 2: Fully qualified table name
SELECT * FROM
GOSALESDW.go_region_dim;

In this example, GOSALESDW is the schema name.


c. Open the GOSALESDW_load.sql file. You must edit the file to specify the
correct source location of the data from which you will load the tables. You
get better LOAD performance when you copy the data you are going to
load to the distributed file system. You did that copy in the previous task,
Importing sample data by using the Distributed File Copy application,
where you set the path to the data as /user/biadmin/bi_sample_data/data.
Now, in the GOSALESDW_load.sql file, replace every instance of url
'file:///opt/ibm/biginsights/bigsql/samples/data/ with url
'/user/biadmin/bi_sample_data/data/.
1) In Eclipse, press Ctrl-F to open the Find/Replace dialog.
2) In the Find field, type url 'file:///opt/ibm/biginsights/bigsql/
samples/data/. In the Replace with field, type url '/user/biadmin/
bi_sample_data/data/. Make sure Case is selected.
3) Click Replace all. You should see a message that 65 matches were
replaced. Close the Find/Replace dialog and save the SQL file.
4) Click the Run SQL icon to run all of the LOAD statements.
When all of the statements in the GOSALESDW_load.sql file are completed
successfully, each table is populated with data.

Tip: To conserve space in your distributed file system, you can delete the
data folder from its download location, /user/biadmin/bi_sample_data/
data/. Click the data directory in that path, and then click the Remove icon

( ). Click Yes at the confirmation window.


The GOSALESDW_load.sql file contains SQL statements to load the data in the
GOSALESDW tables.

Learn more about loading data:

The Big SQL LOAD HADOOP statement offers a powerful way to import
data into your tables. The following is a simple example of the LOAD
HADOOP statement.
v This example shows a load from a DB2 table.
LOAD HADOOP USING JDBC CONNECTION URL
’jdbc:db2://myhost:51000/SAMPLE’
WITH PARAMETERS (
’user’ = ’myuser’,password=’mypassword’)
FROM SQL QUERY

46 IBM InfoSphere BigInsights Version 3.0: Tutorials


’SELECT * FROM STAFF WHERE YEARS > 5
AND $CONDITIONS’
SPLIT COLUMN ID INTO TABLE STAFF_Q APPEND
;
5. If the results of the CREATE and LOAD statements are successful, view the tables
that you just created on the server. Because you connected to an InfoSphere
BigInsights server when you first opened the file, the tables are created on that
server:
a. Click the Files tab in the InfoSphere BigInsights Console. Then click the
DFS Files page.
b. Look for the GOSALESDW schema by expanding the directories in the
following HDFS navigation path: hdfs://<server-name>:9000//
biginsights/hive/warehouse/gossalsdw.db.

6. In the SQL Results view in the current perspective of the InfoSphere


BigInsights Tools for Eclipse, view the results of each Big SQL statement or
script. Any errors are listed and any result tables are displayed.

Learn more about the SQL Results view:

The SQL Results view contains the results of your SQL statements or scripts.
You can change the display of the results page, and also the number of rows
that are returned from each query (the defaults is 500).

By default, the view contains two panes.


v The left pane contains the Status column and the Operation column.
v The right pane contains three tabs: Status, Parameters, and Result1.

Optional: Changing the default Eclipse SQL Results view to


see more results
You can change the display of the Eclipse SQL Results page that is part of the IBM
Big SQL editor. You can also change the number of rows that get returned from
each query.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 47


About this task

When you run statements or scripts in the SQL Editor in the IBM InfoSphere
BigInsights perspective in Eclipse, the default number of rows that is returned is
500. Follow these steps if you want to change the number of rows that get
returned:

Procedure
1. From the Eclipse menu bar, click Window > Preferences.
2. From the Preferences window, click Data Management > SQL Development >
SQL Results View Options.
3. In the SQL Results View Options window, find the Max row count field and
increase the value from the default of 500. This value controls the number of
rows that are retrieved. A value of zero retrieves all rows.
4. In the Max display row count field, increase the value from the default of 500.
This value controls the number of rows that you see. A value of zero displays
all rows. Be aware that making this number too large can produce performance
problems.
5. Click OK to save your changes.

Module 1: Creating and running SQL script files


In this module, you will explore some of the basic IBM Big SQL queries, and begin
to understand the sample data from the Sample Outdoor Company.

Now that you have loaded the data from the Sample Outdoor Company, you are
ready to explore the sales figures and product activity.

This module teaches some of the basic statements of IBM Big SQL, and some of the
different environments that you can use to create Big SQL objects and run queries.

Learning objectives

After completing the lessons in this module you will know how to do the
following tasks:
v Create scripts to run Big SQL statements.
v Create a view.
v Create queries that help you analyze the financial data from the Sample Outdoor
Company.
v Run queries from the InfoSphere BigInsights Console, from InfoSphere
BigInsights Tools for Eclipse, and from open source spreadsheets.

Time required

This module should take approximately 45 minutes to complete.

Prerequisites

You must complete the tasks to set up your environment.

48 IBM InfoSphere BigInsights Version 3.0: Tutorials


Lesson 1.1: Creating an SQL script file
The SQL script file is a container for SQL statements or commands. When you run
SQL statements from a client such as Eclipse, the script file is a convenient way of
manipulating large numbers of statements.

You already know how to run the predefined SQL scripts from the tasks to set up
your environment. In this lesson, you will create your own script.

The script file can contain one or more SQL statements or commands. Within IBM
Big SQL in the Eclipse SQL editor window, you can run the entire file, or any
highlighted part of the file.
1. If you have not already created the myBigSQL project in Eclipse, do the
following steps:
a. From the Eclipse menu bar, click File > New > Other.
b. In the Select a wizard window, expand the BigInsights directory, select
BigInsights Project, and then click Next.
c. Type myBigSQL in the Project name field, and then click Finish.
d. If you are not already in the BigInsights perspective, in the message that
displays, click Yes to switch to the BigInsights perspective.
2. From the Eclipse menu bar, click File > New > Other.
3. In the Select a wizard window, expand the BigInsights directory, and select
SQL Script, and then click Next.
4. In the New SQL File window, in the Enter or select the parent directory field,
select myBigSQL. Your new SQL file is stored in this project directory.
5. In the File name field, type aFirstFile. The .sql file extension is added
automatically.
6. Click Finish.
7. After you create or open an SQL script for the first time, you must specify the
Big SQL connection for your SQL script file:
a. In the Select Connection Profile window, select the Big SQL connection. The
properties of the selected connection display in the Properties field. The Big
SQL database-specific context assistant and syntax checks are now activated
in the editor that is used to edit your SQL file.
b. Click Finish to close the Select Connection Profile window.
8. In the SQL Editor that opens with the aFirstFile.sql file that you created, add
the following Big SQL comments:
--This is a beginning SQL script
--These are comments. Any line that begins with two
-- dashes is a comment line,
-- and is not part of the processed SQL statements.
9. Save the aFirstFile.sql file by using the keyboard shortcut CTRL-S.

Lesson 1.2: Creating and running a simple query to begin


data analysis
In this lesson, you explore some basic Big SQL queries. The goal is to learn how to
query data to analyze it.

The schema that is used in this tutorial is the GOSALESDW schema. It contains
fact tables for the following topics:
v Distribution
v Finance

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 49


v Geography
v Marketing
v Organization
v Personnel
v Products
v Retailers
v Sales
v Time

The analysis that you will do will reference parts of each of those topics. You will
examine product inventory, distribution, sales, and employee data.
1. From the Eclipse Project Explorer, open the myBigSQL project, and double-click
the aFirstFile.sql file.
2. In the SQL editor pane, type the following statement:
SELECT * FROM GOSALESDW.GO_REGION_DIM;
Each complete SQL statement must end with a semicolon. The statement
selects, or fetches, all the rows that exist in the GO_REGION_DIM table, which is
one of the tables in the GOSALESDW schema.

Learn more about SELECT statements:

The SELECT statement is used to select data from a table. The result is stored
in a result table, which is called the result-set. It can be part of another query
or subquery.

3. Click the Run SQL icon ( ).


Depending on how much data is in the table, a SELECT * statement might take
some time to complete. For this statement, your result should contain 21
records or rows. In Figure 1, you see part of the Eclipse output that is
displayed:

Figure 1. Sample output in the SQL Results view

You might have a script that contains several queries. When you want to run
the entire script, click the Run SQL icon or press F5 with nothing highlighted.
When you want to run a specific statement, or set of statements, and you

50 IBM InfoSphere BigInsights Version 3.0: Tutorials


include the schema name with the table name (gosalesdw.go_region_dim),
highlight the statement or statements that you want to run, and press F5.
4. Improve the SELECT * statement by adding a predicate to the statement to
return fewer rows. A predicate is a condition on a query that reduces and
narrows the focus of the result. A predicate on a query with a multi-way join
can improve the performance of the query. For example, add the WHERE
region_en LIKE 'Amer%' predicate to your original SELECT * statement.
SELECT * FROM gosalesdw.go_region_dim
WHERE region_en LIKE ’Amer%’;

Learn more about the WHERE clause: You can filter results from an SQL
query by using a WHERE clause. The WHERE clause specifies a result table
that contains those rows for which the search condition is true. The syntax
looks like the following code:
WHERE search-condition
5. Run the entire script. This query results in four records or rows.

Figure 2. Sample output of a statement that includes a predicate

6. You can learn about the structure of the table GO_REGION_DIM, with some
queries to the syscat schema catalog tables. The Big SQL catalog tables
provide metadata support to the database. For more information about the Big
SQL catalog views, see Hadoop Catalog Views and Catalog Views. Type or
copy the following query, select the statement, and run just this statement:
SELECT * FROM syscat.columns
WHERE tabname=’GO_REGION_DIM’
AND tabschema=’GOSALESDW’;
The output from the catalog tables is folded to upper case. No rows are
returned if you use lower case in the catalog query (for example,
tabname='go_region_dim').
This query uses two predicates in a WHERE clause. The query finds all of the
information from the syscat.columns table when the tabname is
'GO_REGION_DIM' and the tabschema is 'GOSALESDW'. Because you are
using an AND operator, both predicates must be true to return a row. Use
single quotation marks around string values.
The result of the query to the syscat.columns table is the metadata, or the
structure of the table. The SQL Results tab in Eclipse shows 54 rows as your
output. That means that there are 54 columns in the table GO_REGION_DIM.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 51


Figure 3. SQL editor results view

7. Run a query that returns the number of rows in a table. Type or copy the
following query, select the statement, and then run the query.
SELECT COUNT(*) FROM gosalesdw.go_region_dim;
The COUNT aggregate function returns the number of rows in the table, or the
number of rows that satisfy the WHERE clause in the SELECT statement
when a WHERE clause is part of the statement. The result is the number of
rows in the set. A row that includes only null values is included in the count.
In this example, there are 21 rows in the go_region_dim table.

Learn more about aggregate functions: The COUNT and COUNT(*)


statements are part of a group of statements that are called the aggregate
functions. Aggregate functions return a single value per group. Other
aggregate functions are AVG, MAX, MIN, and SUM. These functions are all
built-in functions, and generally do calculations on data. For example, to
calculate the average of a series of numbers in a table, you might run the
following SQL statement:
SELECT AVG(unit_price) FROM stock
WHERE stock_num = 110;
8. Use the COUNT (distinct <expression>) statement to determine the number
of unique values in a column. Run this statement in your SQL file:
SELECT COUNT (distinct region_en) FROM gosalesdw.go_region_dim;
The result is 5. This result means that there are five unique region names in
English (the column name region_en).
9. Use the FETCH FIRST .. ONLY clause to specify a limit on the number of
output rows that are produced by the SELECT statement. Type this statement in
your SQL file, select the statement, and then run it:
SELECT * FROM GOSALESDW.DIST_INVENTORY_FACT fetch first 50 rows only;
The statement returns 53837 rows without the FETCH FIRST ... ONLY clause,
but only 50 rows with the FETCH FIRST ... ONLY clause.
If there are fewer rows than the FETCH FIRST value, then all of the rows are
returned. The FETCH FIRST clause is useful to see a sample or subset of the
output.
10. Save your SQL file.

52 IBM InfoSphere BigInsights Version 3.0: Tutorials


Lesson 1.3: Creating a view that represents the inventory
shipped by branch
You can represent data with a view, which is the result of queries from one or
more tables. In this lesson, you are going to create a view that is the result of a
query that joins two tables.

You can retrieve data from views just like you do from tables. However, views can
be more efficient because they do not require permanent storage.

You might want to create a view to organize the way users see the data, or to
restrict certain information to a defined set of users.

You are going to create a view that gives you information about the quantity of
products that are shipped by branch. The first table (gosalesdw.go_branch_dim)
contains information about the branches of the Sample Outdoor Company. The
second table (gosalesdw.dist_inventory_fact) contains information about the
inventory, including the amount of product that is shipped.
1. Right-click the myBigSQL project, and select New > SQL Script. Name the
new file GOSALESDW_viewddl.sql in project myBigSQL.
2. In the file GOSALESDW_viewddl.sql, type or copy the following code:
CREATE SCHEMA myschema;
USE myschema;

CREATE VIEW myschema.myInventory_view


AS
SELECT if.product_key, if.quantity_shipped
FROM gosalesdw.go_branch_dim AS bd,
gosalesdw.dist_inventory_fact AS if
WHERE if.branch_key = bd.branch_key
AND bd.branch_code > 20;
The CREATE SCHEMA and USE statements establish that the schema,
myschema, is the default schema for the current session.
The view, myschema.myInventory_view, contains the results of a join of tables
gosalesdw.go_branch_dim and gosalesdw.dist_inventory_fact. The join is
based on the branch key, which is a column in both tables. The query from
which the view is made is filtered by the branch code.
3. Save the file and run it to create the view. After the view is created, you can
query the view just as you would for a table.
4. Type or copy the following SELECT statement as the last line in the
GOSALESDW_viewddl.sql file:
SELECT * FROM myschema.myInventory_view;
5. Save the file, and then highlight the complete SELECT statement, to run that
statement only to query the new view.
6. Click the Result1 tab in the SQL Results View to see the output.
The following screen capture displays a portion of the output.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 53


Unlike the tables that you created in previous lessons, you will not find the
view in the DFS tab. Remember, at the start of this lesson, the statement was
made that views do not require permanent storage. They are virtual tables. The
contents of views are not materialized until query run time. The DFS file tab
contains only folders and files that represent physical structures, like tables.

Lesson 1.4: Analyzing products and market trends with Big


SQL Joins and Predicates
In this lesson, you create and run Big SQL queries that will analyze the market and
financial data in the GOSALESDW schema.

You are querying this data so that you can better understand the products and
market trends of the fictional Sample Outdoor Company. You are going to examine
the records of the products that are ordered, the quantities that are ordered, and
the order methods.
1. Right-click the MyBigSQL project and click New > SQL Script. Name it
companyInfo.sql.
2. Your immediate goal is to learn what products were ordered from the fictional
Sample Outdoor Company, and by what method they were ordered. To achieve
your goal, you must join information from multiple tables in the gosalesdw
schema, because it is a relational database where not everything is in one table:
a. Type or copy the following comments and statement into the
companyInfo.sql file:
--Fetch the product name and the quantity and
-- the order method.
--Product name has a key that is part of other
-- tables that we can use as a join predicate.
--The order method has a key that we can use
-- as another join predicate.

--Query 1

SELECT pnumb.product_name, sales.quantity,


meth.order_method_en
FROM

54 IBM InfoSphere BigInsights Version 3.0: Tutorials


gosalesdw.sls_sales_fact sales,
gosalesdw.sls_product_dim prod,
gosalesdw.sls_product_lookup pnumb,
gosalesdw.sls_order_method_dim meth
WHERE
pnumb.product_language=’EN’
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key;
Because there is more than one table that is referenced in the FROM clause,
the query can join rows from those tables. A join predicate specifies a
relationship between at least one column from each table to be joined.
v The use of the predicate, prod.product_number=pnumb.product_number
helps to narrow the results to product numbers that match in two tables.
v This query also uses an alias in the SELECT and FROM clauses,
pnumb.product_name. The pnumb reference is the alias for the
gosalesdw.sls_product_lookup table. That alias can now be used in the
where clause so that you do not need to repeat the complete table name.
And, the WHERE clause is not ambiguous.
v The use of the predicate pnumb.product_language=’EN’ helps to further
narrow the result to English output only. This database contains
thousands of rows of data in various languages, so restricting the
language provides a more focused set of results.
b. Run the statement by selecting the statement, beginning with the keyword
SELECT, and ending with the semicolon, and then pressing F5.
3. Review the results in the SQL Results page. You can now begin to see what
products are sold, and how they are ordered by customers.

By default, the Eclipse SQL Results page limits the output to 500 rows. You
can change that value in the Data Management preferences.
4. To find out how many rows the query returns in a full Big SQL environment,
type the following query into the companyInfo.sql file, then select the query,
and then press F5:
--Query 2
SELECT COUNT(*)
--(SELECT pnumb.product_name, sales.quantity,
-- meth.order_method_en
FROM
gosalesdw.sls_sales_fact sales,
gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb,
gosalesdw.sls_order_method_dim meth

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 55


WHERE
pnumb.product_language=’EN’
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key;
The result for the query is 446,023 rows.
5. Update the query that is labeled --Query 1 to restrict the order method to equal
'Sales visit' only. Add the following string just before the semicolon:
AND order_method_en=’Sales visit’

Now, --Query 1 should look like the following query:


--query 1
SELECT pnumb.product_name, sales.quantity,
meth.order_method_en
FROM
gosalesdw.sls_sales_fact sales,
gosalesdw.sls_product_dim prod,
gosalesdw.sls_product_lookup pnumb,
gosalesdw.sls_order_method_dim meth
WHERE
pnumb.product_language=’EN’
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key
AND order_method_en=’Sales visit’;
6. Run the entire modified --Query 1 statement by selecting it and pressing F5.
The results in the SQL Results page now show the product and the quantity
that is ordered by customers that visit a retail shop.

7. To find out which purchase method of all the methods has the greatest quantity
of orders, you must add a GROUP BY clause (GROUP BY pll.product_line_en,
md.order_method_en). You will also use a SUM aggregate function
(SUM(sf.quantity)) to total the orders by product and method. In addition, you
can clean up the output to substitute a more readable column header by adding
AS Product in the SELECT statement.
SELECT pll.product_line_en AS Product,
md.order_method_en AS Order_method,
SUM(sf.QUANTITY) AS total
FROM gosalesdw.sls_order_method_dim AS md,
gosalesdw.sls_product_dim AS pd,
gosalesdw.sls_product_line_lookup AS pll,
gosalesdw.sls_product_brand_lookup AS pbl,
gosalesdw.sls_sales_fact AS sf
WHERE
pd.product_key = sf.product_key
AND md.order_method_key = sf.order_method_key

56 IBM InfoSphere BigInsights Version 3.0: Tutorials


AND pll.product_line_code = pd.product_line_code
AND pbl.product_brand_code = pd.product_brand_code
GROUP BY pll.product_line_en, md.order_method_en;
8. Run the complete statement by selecting it and pressing F5.
Your results in the SQL Results page should include 35 rows:

9. Save the companyInfo.sql file.

Lesson 1.5: Creating advanced Big SQL queries that include


common table expressions, aggregate functions, and ranking
In this lesson, you create Big SQL queries with common table expressions,
aggregate functions, and ranking to determine how many product units were
shipped and how many units were sold.

Your goal in this lesson is to understand how the Sample Outdoor Company
products that are sold rank in comparison with the products that are shipped You
are going to write SQL statements to analyze the data in the GOSALESDW schema
to achieve this goal.
1. Create an SQL file that is named advanced.sql in the myBigSQL project.
2. To open the advanced.sql file, double-click it.
3. Type or copy the following statement into the advanced.sql file:
WITH
sales AS
(SELECT sf.*
FROM gosalesdw.sls_order_method_dim AS md,
gosalesdw.sls_product_dim AS pd,
gosalesdw.emp_employee_dim AS ed,
gosalesdw.sls_sales_fact AS sf
WHERE pd.product_key = sf.product_key
AND pd.product_number > 10000
AND pd.base_product_key > 30
AND md.order_method_key = sf.order_method_key
AND md.order_method_code > 5
AND ed.employee_key = sf.employee_key
AND ed.manager_code1 > 20),
inventory AS
(SELECT if.*
FROM gosalesdw.go_branch_dim AS bd,
gosalesdw.dist_inventory_fact AS if
WHERE if.branch_key = bd.branch_key
AND bd.branch_code > 20)
SELECT sales.product_key AS PROD_KEY,

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 57


SUM(CAST (inventory.quantity_shipped AS BIGINT)) AS INV_SHIPPED,
SUM(CAST (sales.quantity AS BIGINT)) AS PROD_QUANTITY,
RANK() OVER ( ORDER BY SUM(CAST (sales.quantity AS BIGINT)) DESC) AS PROD_RANK
FROM sales, inventory
WHERE sales.product_key = inventory.product_key
GROUP BY sales.product_key;
By using a common table expression, you define two tables (sales and inventory)
with unique names that you can use in a FROM clause.

Learn more about WITH clauses: The WITH clause is a type of common table
expression that allows defining a result table with a table-name that can be
specified as a table name in any FROM clause of the fullselect that follows.
Multiple common table expressions can be specified following a single WITH
keyword. Each common table expression can also be referenced by name in the
FROM clause of subsequent common table expressions.

A common table expression can be used in place of a view to avoid creating the
view. It can also be used when the same result table must be shared in a
fullselect.
The example also shows multiple tables that are joined together. In most cases,
Big SQL joins the tables together in the order that they are provided in the
statement. In the example, the gosalesdw.sls_order_method_dim table is
accessed by Big SQL first.
When you choose the order of the tables in the query, remember to eliminate
rows as early as possible. Tables that use predicates that filter out many rows,
or those tables with rows that are removed as a result of the join should be
located early in the query. These tables are considered highly selective.
Ordering the tables in this way reduces the number of rows that must be
moved to the next step of the query.

4. Click the Run SQL icon ( ). The result contains 165 rows. The output shows
the product by its product key, by the number of units that were shipped, and
by the number of units that were sold.

Figure 4. Partial results of how many units were shipped and sold.

The INV_SHIPPED column is derived from the SUM aggregate function and a
CAST function as shown in the following Big SQL statement:
SUM(CAST (INVENTORY.QUANTITY_SHIPPED AS BIGINT))
AS INV_SHIPPED...

The example is resolved internally by the following flow:

58 IBM InfoSphere BigInsights Version 3.0: Tutorials


a. The original column, QUANTITY_SHIPPED is created as an integer. The
CAST function converts the output to another data type, in this case a
BIGINT.
b. The SUM function returns a single summed value for the column.
The PROD_RANK column is derived from the RANK function.
RANK() OVER ( ORDER BY SUM(CAST (SALES.QUANTITY AS BIGINT)) DESC) AS PROD_RANK

The example is resolved internally by the following flow:


a. The sales.quantity column is CAST from an integer to a BIGINT.
b. The SUM function is used on that column.
c. The sales.quantity column is then sorted in descending order.
d. The RANK function produces a number that is based on the sorted order.
5. Save the advanced.sql file.

Lesson 1.6: Running Big SQL queries in the Big SQL Console
In this lesson, you learn how to run queries in the IBM Big SQL Console of IBM
InfoSphere BigInsights.

The Big SQL console is a built-in part of the InfoSphere BigInsights Console, and is
available to all users of the console. No additional setup is required.

If you log into the InfoSphere BigInsights Console as the bigsql user, you can run
all of the statements in the Big SQL console that you ran in the previous lessons.
The Big SQL Console runs as the InfoSphere BigInsights Console logged-in user,
and therefore has the authorizations of that user.
1. Open the InfoSphere BigInsights Console. Click the Welcome tab.
2. In the Quick Links pane, click Run BigSQL Queries.
The Big SQL Console opens in your browser where you can enter one or more

Figure 5. BigInsights Console Welcome page

queries. Make sure that the Big SQL radio button is selected

.
3. In the query entry field, type the following statements:
CREATE HADOOP TABLE new4Console (
ProductName VARCHAR(100), Quantity BIGINT, ProductCode int);
INSERT INTO new4Console VALUES (’Weezers’,522,1);
INSERT INTO new4Console VALUES (’Somers’,3566,5);
INSERT INTO new4Console VALUES (’Gowzers’,3566,5);
SELECT * FROM new4Console;
SELECT * FROM new4Console WHERE ProductCode >0 Order by ProductName;

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 59


Since you did not type a schema name, the table is created in the default
schema, which is the user name that you used to log into the InfoSphere
BigInsights Console. The schema is biadmin if you logged in as biadmin.
4. Click Run. You can type each statement individually, and click Run for each,
but this step shows you that you can run many statements together. When you
run multiple statements at one time, for each query that produces output, you
see a separate Result page.

Figure 6. Big SQL web console page

The output appears in the lower half of the console window, in the Result tab.
One Status tab exists to display the success or failure of every result. The
contents of each result is limited to 200 rows.
5. To see statements that you ran previously, expand the list box of previously run
statements, in the top pane that is above your current statement. To rerun one
of those statements, click the statement to place it back into the current
statement window. Then, you can click Run to do the query again.
6. View the new table and its contents in the distributed file system by opening
the Files tab in the InfoSphere BigInsights Console. Click the DFS Files tab and
follow the path to the biadmin schema to find the new4Console table and its
contents.

Lesson 1.7: Advanced: Creating a user defined function to


return total units sold and price with a discount
This optional lesson shows you how to define and register your own function that
generates a scalar value that represents item totals.

60 IBM InfoSphere BigInsights Version 3.0: Tutorials


User-defined functions (UDFs) are extensions or additions to the existing built-in
functions of the Big SQL language. The Big SQL scalar functions are implemented
as static methods on a class.

An external routine is a function that implements dynamic or static SQL statements


in an external programming language, such as C, C++, or Java.

In this lesson, you are going to write a small Java application that contains code to
implement a scalar function that returns total units sold.
1. In the Eclipse client, create a Java project:
a. In the IBM InfoSphere BigInsights Eclipse environment, click File > New >
Project. From the New Project window, select Java Project. Click Next.
b. Type MyUDFProject in the Project Name field. Click Next.
c. Open the Libraries tab, and click Add External Jars. Select the appropriate
JDBC drivers from your local path, which by default includes these two JAR
files:
db2jcc_license_cu.jar
db2jcc4.jar
d. Click Finish. Click No when you are asked if you want to open a different
perspective.
2. Create a Java class:
a. Right-click the MyUDFProject project, and click File > New > Java >
Package. In the Name field, in the New Java Package window, type udf.
Click Finish.
b. Right-click the udf package, and click File > New > Java > Class.
c. In the New Java Class window, type MyUdf in the Name field. Select the
public static void main(String[] args) check box. Click Finish.
3. Copy the following JAVA code into the MyUdf.java file:
package udf;
public final class MyUdf {
public static double getItemTotal
(int units,
double price,
int discount
)
{
if (
units <= 0 ||
price <= 0 ||
discount < 0 ||
discount > 100
)
{
return -1;
}
else
{
return units *
price *
((100 - discount) /100.0);
}
}
}
4. Save the file and then, right-click the MyUdf.java file and click Export. Expand
the Java category and select JAR file. Click Next.
5. In the Select the resources to export pane, select the udf package. You see that
MyUdf.java is also selected. In the Select the export destination field, specify

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 61


or browse to a directory on your Linux local file system to hold the JAR file.
The JAR file must be accessible by the bigsql database. Name the JAR file
tot_JAR.jar. Click Finish.
6. Install the JAR file by using the open source JSqsh client:
One way to organize the classes for a Java™ routine is to collect those classes
into a JAR file. If you do this, you need to install the JAR file into the database
catalog. Use the JSqsh install command \install-jar to put the JAR file into
the database. This procedure creates a new definition of a JAR file in the local
database catalog.
a. Open JSqsh with the bigsql connection. This command assumes that you
already created a connection profile called bigsql.
$JSQSH_HOME/bin/jsqsh bigsql
b. Type the following command:
\install-jar --sqlj=/home/bigsql/tot_JAR.jar --id=My_Jar

The sqlj parameter takes the full path to the JAR file. The id parameter
means that subsequent SQL commands that use the tot_JAR.jar file can refer
to it with the name 'My_Jar'.
7. Register the function by running a CREATE FUNCTION statement from your
Eclipse or JSqsh client:
CREATE FUNCTION gosalesdw.getItemTotal
(INT,DOUBLE,INT)
RETURNS DOUBLE
NO SQL
LANGUAGE JAVA
EXTERNAL NAME ’My_Jar:udf.MyUdf!getItemTotal’
PARAMETER STYLE JAVA
;
In the above example, My_Jar is the short name that you defined in the install
command. It represents the JAR file that contains the function class,
tot_JAR.jar. The package name is udf and MyUdf is the class name. The
function name is the method name, getItemTotal. The function refers to the
three inputs types, an INTEGER, a DOUBLE, and an INTEGER. The output is a
DOUBLE.
The Java routine does not need to exist before you run the CREATE statement.
But the routine must be accessible at the time that you use the function in a
query.
8. Now, use the function:
SELECT EMPLOYEE_KEY,
gosalesdw.getItemTotal(QUANTITY, UNIT_PRICE, 10)
AS "the function result"
FROM GOSALESDW.SLS_SALES_FACT fetch first 5 rows only;

62 IBM InfoSphere BigInsights Version 3.0: Tutorials


9. Optionally, if you want to remove the function, follow these steps:
a. Drop the registered function:
DROP FUNCTION gosalesdw.getItemTotal;
b. Remove the JAR file in the JSqsh client with the following command:
\remove-jar --sqlj JAR_ID

where JAR_ID is the id you used to install the JAR.

Lesson 1.8: Advanced: Creating and running a simple Big


SQL query from a JDBC client application
This optional lesson shows you how to create a simple JDBC application to run Big
SQL queries.

You can create a JDBC application to open a database connection, run a Big SQL
query, and then display the results of the query.
1. Create a Java project:
a. In the IBM InfoSphere BigInsights Eclipse environment, click File > New >
Project. From the New Project window, select Java Project. Click Next.
b. Type MyJavaProject in the Project Name field. Click Next.
c. Open the Libraries tab, and click Add External Jars. Select the Big SQL
JDBC driver from your local path, which by default includes these two JAR
files:
db2jcc_license_cu.jar
db2jcc4.jar
d. Click Finish. Click No when you are asked if you want to open a different
perspective.
2. Create a Java class:
a. Right-click the MyJavaProject project, and click File > New > Package. In
the Name field, in the New Java Package window, type aJavaPackage4me.
Click Finish.
b. Right-click the aJavaPackage4me package, and click File > New > Class.
c. In the New Java Class window, type SampApp in the Name field. Select the
public static void main(String[] args) check box. Click Finish.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 63


3. Copy or type the following code into the SampApp.java file. Make sure that you
modify the value in the static final String db = variable to reflect the host
name that you are using. The code assumes that the user ID is biadmin and the
password is biadmin. Remember, the user ID must have database administrator
privileges.
package aJavaPackage4me;
//Import required packages
import java.sql.*;

public class SampApp {

/**
* @param args
*/
// set JDBC and database information

static final String db = "jdbc:db2://abc.com:51000/bigsql";


static final String user = "bigsql";
static final String pwd = "bigsql";

public static void main(String[] args) {


Connection conn = null;
Statement stmt = null;
System.out.println("Started sample JDBC application.");
try{
// Register JDBC driver
Class.forName("com.ibm.db2.jcc.DB2Driver");

// Get a connection
conn = DriverManager.getConnection(db, user, pwd);
System.out.println("Connected to the database.");

// Execute a query
stmt = conn.createStatement();
System.out.println("Created a statement.");
String sql;
sql = "select * from gosalesdw.sls_product_dim " +
"where product_key=30001";
ResultSet rs = stmt.executeQuery(sql);
System.out.println("Executed a query.");

// Obtain results
System.out.println("Result set: ");
while(rs.next()){
//Retrieve by column name
int product_key = rs.getInt("product_key");
int product_number = rs.getInt("product_number");
//Display values
System.out.print("* Product Key: " + product_key + "\n");
System.out.print("* Product Number: " + product_number + "\n");
}
// Close open resources
rs.close();
stmt.close();
conn.close();
}
catch(SQLException sqlE){
// Process SQL errors
sqlE.printStackTrace();
}
catch(Exception e){
// Process other errors
e.printStackTrace();
}finally{
// Ensure resources are closed before exiting
try{

64 IBM InfoSphere BigInsights Version 3.0: Tutorials


if(stmt!=null)
stmt.close();
}catch(SQLException sqle2){
} // nothing we can do
try{
if(conn!=null)
conn.close();
}catch(SQLException sqlE){
sqlE.printStackTrace();
}// end finally block
}// end try block
System.out.println("Application complete");

}
a. The Java code must first declare the package. Then, you include the
packages that contain the JDBC classes that you need for database
programming.
b. Set up the required database information, including a user name and
password, so that you can refer to it.
c. You must register the JDBC driver so that you can open a communications
channel with the database.
d. Open the connection with the getConnection(db, user, pwd) method. You
pass the variables that you created in an earlier step.
e. Run a query by submitting an SQL statement to the database:
sql =
"select * from gosalesdw.sls_product_dim " +
"where product_key=30001";
f. You extract the data from the result set by issuing the getInt method. You
display the output by using the print method.
g. Clean up the environment by closing all of the database resources.
4. Save the file, and right-click the Java file, and click Run as > Java Application.

The results show in the Console view of Eclipse:

You have now experienced another way to run queries.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 65


Module 2: Analyzing big data by using Big SQL and BigSheets
This module teaches you how to use the IBM InfoSphere BigInsights BigSheets
component with Big SQL to expand the use of the data.

You can use BigSheets and Big SQL together to read data, and then create tables
from that data.

In Lesson 2.1, Lesson 2.2, and Lesson 2.3 you will use data from the Sample
Outdoor Company to examine the result of sales by year, and then export the data
to BigSheets to create a chart that reflects the total sales by year.

In Lesson 2.4 and Lesson 2.5, you will use data from the occurrences of IBM
Watson in social media to illustrate some additional features of working with Big
SQL and BigSheets.

Learning objectives

After you complete the lessons in this module you will understand the concepts
and know how to do the following tasks:
v Create a BigSheets workbook.
v Import and export data to and from Big SQL
v Create tables from other tables in Big SQL.

Time required

This module should take approximately 60 minutes to complete.

Lesson 2.1: Preparing queries to export to BigSheets that


examine the results of sales by year
In this lesson, you run queries and then export the results to other applications,
such as BigSheets. You will also see how to use a BigSheets workbook as input to
an IBM Big SQL table.

Each feature of InfoSphere BigInsights provides powerful insights for manipulating


and analyzing data. But even more powerful is the working relationship between
the BigSheets and Big SQL features.
1. Open InfoSphere BigInsights Tools for Eclipse and create an SQL file that is
called org.sql in the myBigSQL project directory.
2. Double-click the file to open it in the SQL editor.
3. Type or copy the following code into the org.sql file:
WITH SALES
(YEAR, TOTAL_SALES, RANKED_SALES)
AS
(
SELECT CAST(ORDER_DAY_KEY AS VARCHAR(4)) AS YEAR,
SUM (SALE_TOTAL) AS TOTAL_SALES,
RANK() OVER (ORDER BY SUM(SALE_TOTAL) DESC) AS RANKED_SALES
FROM GOSALESDW.SLS_SALES_FACT GROUP BY CAST(ORDER_DAY_KEY AS VARCHAR(4))
)
SELECT YEAR, total_sales, ranked_sales FROM sales
ORDER BY YEAR, ranked_sales DESC;

To see the result of sales by year, the statement uses some of the features of Big
SQL that you used in earlier lessons. Use the WITH clause to create an inline

66 IBM InfoSphere BigInsights Version 3.0: Tutorials


table. Then, in the same statement, use the RANK OVER function to more
efficiently simplify the ranking of the sale_total column value.

Learn more about the RANK function: The RANK function is one of the
On-Line Analytical Processing (OLAP) functions that provide the ability to
return ranking, row numbering and existing aggregate function information as
a scalar value in a query result. For more information about the OLAP
functions, see Olap Specification.

4. Save the file and then click the Run SQL icon ( ). The statement shows the
inline view, sales, which simplifies the final SELECT statement. In addition,
the nested aggregate functions demonstrate how the data types and
presentation can be manipulated.

Figure 7. Results of a query to obtain the top sales by year

To do more analysis on the results, you can use BigSheets to create charts that you
can show on the dashboard in your InfoSphere BigInsights server.

The next three lessons show you how to share data between BigSheets and Big
SQL.
v In “Lesson 2.2: Exporting Big SQL data about total sales by year to BigSheets”
on page 68, you will export a CSV file to BigSheets and create a workbook and a
chart reflecting the total quantity sold by product name.

v In “Lesson 2.4: Exporting BigSheets data about IBM Watson blogs to Big SQL
tables” on page 71, you read the blogs-data.txt file that you downloaded in an
earlier lesson, into a BigSheets workbook using the JSON Array reader.

v In “Lesson 2.5: Creating a catalog table from BigSheets Watson blog data to use
in Big SQL” on page 76, you use the Common Catalog feature of the InfoSphere

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 67


BigInsights Console to create tables from a BigSheets workbook.

Lesson 2.2: Exporting Big SQL data about total sales by year
to BigSheets
In this lesson, you will export your data about total sales by year from Big SQL to
BigSheets.

BigSheets is capable of reading many data types. For this lesson, you will be
exporting output from Big SQL as a comma-separated value (CSV) type.

1. Export the output from the query in the previous lesson, to a CSV file so that
you can use BigSheets to analyze the data:
a. In the SQL Results page, open the Result1 tab. Select at least one row, and
then right-click and select Export > Current Result.
b. In the Select Export Format window, click the Browse button to locate a
destination directory in your local system. The default name is
<path>/result.<filetype> in a Linux environment, or <path>\
result.<filetype> in a Windows environment. Change the
result.<filetype> file name to SampleResults.
c. In the Format field, select CSV file (*.csv).
BigSheets can handle different readers. Select the correct reader when you
are in BigSheets to see the data in tabular format.
d. Click Finish.
2. To make this SampleResults.csv file available to BigSheets, upload it to the
InfoSphere BigInsights server:
a. Open the InfoSphere BigInsights Console, and click the Files page.
b. To create a directory on the server, select the tmp directory, and click the

Create Directory icon .


c. In the Create Directory window, name the directory SamplesOutput, and
click OK.

d. Select the SamplesOutput directory, and click the Upload icon .


e. In the Upload window, click Browse and navigate to the SampleResults.csv
file. Click Open. Click OK to upload the file to the server.
The file is available to users on the cluster.
3. Create a BigSheets workbook:
a. Click the SampleResults.csv. The contents of the file is displayed in the
right pane on the web console.

68 IBM InfoSphere BigInsights Version 3.0: Tutorials


b. Click the Sheets radio button to change the display format to a BigSheets

format.

c. Click the Line Reader edit icon ( ), to change the reader format. Select
Comma Separated Value (CSV) Data from the drop-down list. And then
click the green check mark.

Figure 8. Selecting the CSV reader format

The contents of the file now appear as a table with three columns.
d. Click Save as Master Workbook. In the Name field type SampleResults. In
the Description field, type From a CSV file. Click Save. The BigSheets tab
of the InfoSphere BigInsights Console opens in the View Results page.
From there, you can continue with BigSheets functions.
4. Optional: Create a chart from the data to illustrate the sales quantity by year:
a. Click Add Chart.
b. Select Chart and then select Bar.
c. In the Chart Name field, type Totals by year.
d. In the Title field, type Year totals.
e. In the X-Axis field, select YEAR.
f. In the X-Axis Label field, type Year.
g. In the Y-Axis field, select TOTAL_SALES.
h. In the Y-Axis Label field, type Total sales.
i. In the Sort By field, select X Axis.
j. Click the green check mark. Then click Run.
When the processing is complete, you have a visual representation of the total
sales by YEAR. You can see that 2006 represents the most sales.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 69


Lesson 2.3: Creating tables for BigSheets from other tables
The CREATE TABLE AS... clause can be a powerful tool to create a table from
another table.

Instead of using the Export and Upload features that are described in the previous
steps, use the CREATE TABLE AS... clause of IBM Big SQL.
1. From your Eclipse environment, in the same org.sql file, add this line in front
of the statement that contains the WITH clause:
CREATE HADOOP TABLE gosalesdw.myprod_sales_tot
(Year varchar(4), Sales_tot float, Rank int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’\t’
AS

The new CREATE statement should look this the following statement:
CREATE HADOOP TABLE gosalesdw.myprod_sales_tot
(Year varchar(4), Sales_tot float, Rank int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’\t’
AS
WITH SALES
(YEAR, TOTAL_SALES, RANKED_SALES)
AS
(
SELECT CAST(ORDER_DAY_KEY AS VARCHAR(4)) AS YEAR,
SUM (SALE_TOTAL) AS TOTAL_SALES,
RANK() OVER (ORDER BY SUM(SALE_TOTAL) DESC) AS RANKED_SALES
FROM GOSALESDW.SLS_SALES_FACT GROUP BY CAST(ORDER_DAY_KEY AS VARCHAR(4))
)
SELECT YEAR, total_sales, ranked_sales FROM sales
ORDER BY YEAR, ranked_sales DESC;

70 IBM InfoSphere BigInsights Version 3.0: Tutorials


2. Click the Run SQL icon. The new table is created in the GOSALESDW schema.
You might see a warning message that the Year data is truncated, but this is
acceptable in this case.
3. Open the InfoSphere BigInsights Console, and click the Files page.
4. Locate your new table in hdfs/biginsights/hive/warehouse/gosalesdw.db/
myprod_sales_tot/.
5. Click the file to see the columns in the new table. Click the Sheet radio button.

6. Click the Line Reader edit icon ( ), to change the reader format. Select Tab
Separated Value (TSV) Data from the drop-down list. Clear the check mark in
the Headers Included check box. And then click the green check mark. The
contents of the file now appear as a table with three columns.
7. Click Save as Master Workbook. In the Name field of the dialog, type
TSV_MyTotals. The BigSheets tab of the InfoSphere BigInsights Console opens in
the View Results page. From there, you can continue with BigSheets functions.

Lesson 2.4: Exporting BigSheets data about IBM Watson


blogs to Big SQL tables
In this lesson, you use workbooks that you create in BigSheets as input to Big SQL
tables. This lesson shows you how to export to a tab-separated value (TSV) file,
and then use the output as input to an Big SQL table.

You are going to use the blogs-data.txt file, which is one of the files you
downloaded when you set up your tutorial environment.

The data in the blogs-data.txt file comes from blogs that reference the term IBM
Watson. In this lesson you are going to turn that text data into a BigSheets
workbook, and then use the functions in BigSheets to format the data into
something that is easier to understand. To examine the blogs data in the
blogs-data.txt file, you create a workbook and use that data for a new Big SQL
table.

This lesson introduces a way of creating tables from data that you analyze by
using BigSheets and a TSV reader format and a JSON Array format.

Task 2.4.1: Creating and modifying a BigSheets workbook from


JSON Array formatted data
The data in the blogs-data.txt file is formatted in a JSON Array structure. You
will select a BigSheets reader that conforms to that format.
1. Open the InfoSphere BigInsights Console.
2. Create and modify a BigSheets workbook from the blogs-data.txt file:
a. From the InfoSphere BigInsights Console, click the BigSheets tab.
b. Click New Workbook.
c. In the Name field, type WatsonBlogData.
d. In the DFS File tab, expand the Hadoop Distributed File System (HDFS)
directories until you find the ..\user\biadmin\bi_sample_data\blogs-data-
txt file. Select this file.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 71


e. In the Preview area of the screen, click the edit icon ( ).
f. From the Select a reader list, select the JSON Array, and click the check
mark to apply the reader.

Figure 9. Changing the BigSheets reader

g. Because the data columns exceed the viewing space, click Fit column(s).
The first eight columns display in the Preview area.
h. Click the check mark to save the workbook.
i. In the View Results page of BigSheets, click Build new workbook. Rename
the workbook by clicking the edit icon, entering the new name of
WatsonBlogDataRevised, and clicking the green check mark.
j. To more easily see the columns, click Fit column(s), in the
WatsonBlogDataRevised workbook. Now columns A through H fit within
the width of the sheet.
k. You do not need to use all of the columns in your IBM Big SQL table.
Remove multiple columns by following these steps:
1) Click the down arrow in any column heading and select Organize
columns.
2) Click the X next to the following columns to mark them for removal:
v Crawled
v Inserted
v IsAdult
v PostSize
3) Click the green check mark to remove the marked columns.

72 IBM InfoSphere BigInsights Version 3.0: Tutorials


l. Click Save > Save to save the workbook. In the Save workbook dialog, click
Save. Click Exit to start the run process. Click Run to run the workbook.

Task 2.4.2: Exporting the BigSheets blog data workbook to a TSV


file
You can export your BigSheets workbook to a file. Then, use that file to analyze
the data in IBM Big SQL.
1. In the menu bar of the WatsonBlogDataRevised workbook, click Export data.
2. In the drop-down window, select TSV in the Format Type field.
3. In the Export to radio buttons, select File as the export target.
4. Click Browse to select a destination directory in the distributed file system.
Select your path, and then type WatsonBlogs as the name of the file. Click OK.
5. Make sure that the Include Headers check box is cleared. Click OK.
6. A message dialog shows that the workbook is successfully exported. Click OK
to close that dialog.
7. Make a note of the column names and the type of data from the BigSheets
workbook that you want to define in Big SQL. You exported these columns
from BigSheets:
v Country - contains a two-letter country identifier.
v FeedInfo - contains information from web feeds, with varying lengths.
v Language - contains the string that identifies the language of the feed.
v Published - contains a date and time stamp.
v SubjectHtml - contains a subject that is of varying length.
v Tags - contains a string of varying length that provides categories.
v Type - contains the source of the web feed, whether a news blog or a public
feed.
v URL - contains the web address of the feed, with varying length.

Task 2.4.3: Creating an Big SQL script that creates Big SQL
tables from the exported TSV file
In this task, you create an SQL script to create Big SQL queries based on the
BigSheets blogs data workbook.
1. In the InfoSphere BigInsights Eclipse environment, create a project that is
named MyBigSheetsAnalysis, and a new SQL script named NewsBlogs.
2. In the NewsBlogs.sql file, copy or type the following code:
CREATE SCHEMA IF NOT EXISTS BigSheetsAnalysis;
USE BigSheetsAnalysis;

CREATE HADOOP TABLE BigSheetsAnalysis.sheetsOut


(country VARCHAR(2), FeedInfo VARCHAR(300),
language VARCHAR(25), published VARCHAR(25),
subject VARCHAR(300), tags VARCHAR(100),
type VARCHAR(20), url VARCHAR(100))
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’\t’;

LOAD HADOOP USING FILE URL


’/<distributed file system path>/WatsonBlogs.tsv’
with SOURCE PROPERTIES (’field.delimiter’=’\t’)
INTO TABLE BigSheetsAnalysis.sheetsOut OVERWRITE;

SELECT * FROM BigSheetsAnalysis.sheetsOut;


Replace the distributed file system path with the path of your uploaded data.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 73


3. Click Run ( ). The following screen capture shows a portion of the output
from the BigSheets workbook after it is used in a Big SQL table:

Figure 10. A portion of the output from the BigSheets workbook after it is selected from a Big
SQL table

4. Query the table to get the feed information, publication dates, and URLs of
English-based blog posts about IBM Watson:
SELECT feedinfo, published, url
from BigSheetsAnalysis.sheetsOut WHERE language=’English’;

A portion of the output is shown in the following example:

You started with data in a JSON array format and read it into a BigSheets
workbook. Then you updated the workbook to show only the columns in which
you had an interest. Then you exported the data to the distributed file system and
used that data in a Big SQL table.

Task 2.4.4: Exporting the BigSheets workbook as a JSON Array


for use with a SerDe application in Big SQL
In this lesson, you download a SerDe JAR file that you can use in Big SQL to
process JSON data. The SerDe can be used to transform a JSON record into
something that Hive and then Big SQL can process. Then, you export the Watson
blog data as a JSON Array format from BigSheets.

You can export data from a BigSheets workbook as a JSON Array and then use a
SerDe application (Serializer/Deserializer) to process the JSON data. You then
make the data available to a Big SQL table.

By using the SerDe interface, you instruct Hive as to how a record is processed.
You can write your own SerDe for processing JSON data, or you can use a package

74 IBM InfoSphere BigInsights Version 3.0: Tutorials


that is available from the web. For the purposes of this example, download a JAR
file that helps you with the conversion. SerDe applications (JAR files) can be
downloaded from any open source host.
1. Use a web search engine, and locate and download a SerDe .jar file. Search for
the string JSON Serde.
2. Add the SerDe .jar file to the $BIGSQL_HOME/userlib directory and to the
$HIVE_HOME/lib directory.
3. Open the JAR file and note the class file name so that you can add that to your
CREATE statement.
4. Stop and restart the Big SQL service from the Linux command line:
a. Stop the Big SQL server by running the following command from the
$BIGINSIGHTS_HOME/bin directory:
./stop.sh bigsql
b. Restart the Big SQL server by running the following command from the
$BIGINSIGHTS_HOME/bin directory:
./start.sh bigsql
The .jar file is available to the Big SQL JVM and the MapReduce JVMs.
5. In the menu bar of the WatsonBlogDataRevised workbook, which you created
in “Task 2.4.1: Creating and modifying a BigSheets workbook from JSON Array
formatted data” on page 71, click Export data:
a. In the drop-down window, select the JSON Array type in the Format Type
field. Select File as the target in Export to. Then, click Browse to select a
destination path in the distributed file system.
b. Name the file WatsonBlogsData and click OK.

Task 2.4.5: Creating an Big SQL table in Eclipse using the SerDe
application to process the Watson blog data
In this lesson, you create a table and a query to access the BigSheets JSON array
data.
1. In the InfoSphere BigInsights Eclipse environment, open the NewsBlogs.sql file,
and create a table that accesses the appropriate data in the JSON output from
BigSheets, and that uses the SerDe class. Type or copy the following code:
CREATE HADOOP TABLE BigSheetsAnalysis.watson_json (
Country STRING,
FeedInfo STRING,
Language STRING,
Published STRING,
SubjectHtml STRING,
Tags STRING,
Type STRING,
Url STRING)
ROW FORMAT SERDE ’org.apache.hadoop.hive.contrib.serde2.JsonSerde’
;
2. Select the CREATE TABLE statement and press F5.
3. In the InfoSphere BigInsights Console, in the Files tab, locate the
WatsonBlogsData.json file that you created in the previous lesson and select it.

a. Click the Copy button ( ).


b. Select the schema and table as the destination:

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 75


4. Type the following select statement in the NewsBlogs.sql file, and then
highlight the SELECT statement and press F5.
SELECT * FROM BIGSHEETSANALYSIS.WATSON_JSON;

Your output should look similar to the output from the TSV file.

Figure 11. JSON array output in a Big SQL result

Lesson 2.5: Creating a catalog table from BigSheets Watson


blog data to use in Big SQL
In this lesson, you will analyze data in BigSheets and then create a Big SQL table
from the Common Catalog.

As you have already learned, you can use BigSheets and Big SQL together to read
data, and then create a table from that data.

In this lesson, you learn another way to create a table from a BigSheets workbook.

Task 2.5.1: Creating a common catalog table from a master


workbook
You can work with the common catalog table, which is available from the Files tab,
and its data just as any other table in Big SQL.

76 IBM InfoSphere BigInsights Version 3.0: Tutorials


1. Create a BigSheets workbook from the Watson blogs data.
a. Click the BigSheets tab in the InfoSphere BigInsights Console.
b. Click New Workbook.
c. In the New Workbook window, type MyBlogsWB as the name of the
workbook. In the Description field, type The Blogs data.
d. Expand the directories in the DFS File tab /user/biadmin/bi_sample_data/
article_sampleData to the blogs-data.txt file, and select the file. The data
from the blogs-data.txt file opens in right pane of the New Workbook
window.

e. Click the Edit workbook reader icon ( ) and select JSON Array as the
reader type.
f. Click the green check mark to save the workbook and open it in the View
Results window.
2. Build a new workbook:
a. Click Build new workbook. Rename the workbook by clicking the edit
icon, entering the new name MyBlogsWBRevised, and clicking the green check
mark.
b. Click Fit column(s) to fit the columns within the width of the sheet.
c. Remove some columns that you do not need to use in your IBM Big SQL
table. Remove multiple columns by following these steps:
1) Click the down arrow in any column heading and select Organize
columns.
2) Click the X next to the following columns to mark them for removal:
v Crawled
v Inserted
v IsAdult
v PostSize
3) Click the green check mark to remove the marked columns
3. Click Save > Save to save the workbook. In the Save workbook dialog, click
Save. Click Exit to start the run process. Click Run to run the workbook.
4. Click the Create Table button to save the workbook as a common catalog table
and as a table in the distributed file system:

Figure 12. BigSheets Create Table

a. In the Target Schema field, keep the default sheets schema name.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 77


b. In the Table Name field, type MyBlogsTable.
c. Click Confirm.
5. Click the Files tab, and then open the Catalog Tables page. Expand the sheets
schema, to find the new table, MyBlogsTable.

Task 2.5.2: Reviewing the common catalog table in the Eclipse


client
This part of the lesson demonstrates some useful IBM Big SQL commands. By
using these commands, you can verify the structure and contents of tables that you
create.
1. To verify that your new table exists and is usable, open the InfoSphere
BigInsights Eclipse environment. In the MyBigSheetsAnalysis project, create an
SQL file called common_table.sql.
2. In the common_table.sql file, type the following statement:
SELECT * FROM sheets.MyBlogsTable;

Save the file, and click the Run SQL icon ( ).


3. The table is a Hive external table that has the same schema definition as it did
in BigSheets. Type the following HCAT_DESCRIBETAB command to see the
definition of the Hadoop table as defined by the Hive catalogs.
SELECT SYSHADOOP.HCAT_DESCRIBETAB(
’SHEETS’,’MYBLOGSTABLE’)
FROM sysibm.sysdummy1;

The command produces the following output:

Hive schema : sheets


Hive name : myblogstable
Type : EXTERNAL_TABLE
Table params :
EXTERNAL = TRUE
avro.schema.literal = {
"type":"record",
"name":"TUPLE_9",
"fields":[{
"name":"Country",
"type":["null","string"],
"doc":"autogenerated from Pig Field Schema"},
{"name":"FeedInfo","type":["null","string"],
"doc":"autogenerated from Pig Field Schema"},
{"name":"Language","type":["null","string"],
"doc":"autogenerated from Pig Field Schema"},
{"name":"Published","type":["null","string"],
"doc":"autogenerated from Pig Field Schema"},
{"name":"SubjectHtml","type":["null","string"],
"doc":"autogenerated from Pig Field Schema"},
{"name":"Tags","type":["null","string"],
"doc":"autogenerated from Pig Field Schema"},
{"name":"Type","type":["null","string"],
"doc":"autogenerated from Pig Field Schema"},
{"name":"Url","type":["null","string"],
"doc":"autogenerated from Pig Field Schema"}]}
sheets.collection.id = 2
transient_lastDdlTime = 1401831107
SerDe : null
SerDe lib : org.apache.hadoop.hive.serde2.bigsheetsavro.AvroSerDe
SerDe params :
Location : hdfs://my.abc.com:9000/biginsights/sheets/col_2/data
Inputformat : org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
Outputformat: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat

78 IBM InfoSphere BigInsights Version 3.0: Tutorials


Columns :
Name : country Type : string Comment : null
Name : feedinfo Type : string Comment : null
Name : language Type : string Comment : null
Name : published Type : string Comment : null
Name : subjecthtml Type : string Comment : null
Name : tags Type : string Comment : null
Name : type Type : string Comment : null
Name : url Type : string Comment : null

You now see that you can analyze your data in both BigSheets and in Big SQL.
You can easily change the metadata of the table, reformat the columns, and
manipulate the output to satisfy many goals; and do the work moving the data
from Big SQL to BigSheets, or from BigSheets to Big SQL.

As you will see in the next module, you can make subsets of your information
from Big SQL and use it in open source spreadsheet applications.

Module 3: Analyzing Big SQL data in a client spreadsheet program


In this optional module, you learn how to import your Big SQL query results into
one of the widely distributed spreadsheet applications. Then, you can examine
data in that application.

There are many open source spreadsheet applications. This lesson assumes that
you have the Microsoft Excel spreadsheet application. Depending on the
spreadsheet application and version that you use, you might see differences in the
interface controls that are mentioned in these lessons.

Big SQL provides connectivity for some applications through either a 32-bit or a
64-bit ODBC driver, on either Linux (Red Hat Enterprise Linux (RHEL) 6 or SUSE
Linux Enterprise Server (SLES) 11) or Windows (Microsoft Windows 7 or Microsoft
Windows Server 2008). The Big SQL connectivity conforms to the Microsoft Open
Database Connectivity 3.0.0 specification.

Depending on the spreadsheet application that you use, you might need to select
the ODBC driver that you install from the operating system, or from the
spreadsheet application itself. Refer to information in your particular spreadsheet
application about importing data from external data sources.

Learning objectives

After you complete the lessons in this module, you will know how to do the
following tasks:
v Install the IBM Data Server Driver Package to access the ODBC drivers.
v Import data into your client spreadsheet.
v Query the data in your client spreadsheet.

Time required

This module should take approximately 30 minutes to complete.

Lesson 3.1: Installing the IBM Data Server Driver Package for
the client ODBC drivers
In this lesson, you install the ODBC drivers that you must use with the open
source client spreadsheet.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 79


For more information about the Data Server Driver Package, see Validating IBM
Data Server Driver Package (Windows) installation

Attention: For Linux users, or users of the IBM InfoSphere BigInsights Quick
Start Edition, if you must attach to a remote ODBC client from your Linux
machine, follow these steps:
1. From a Linux command line, type the following command to determine the IP
address of your current InfoSphere BigInsights cluster:
cat /etc/hosts
2. Open a browser outside of the cluster environment by typing the following in
the URL address field:
<ip address>:8080
or
<ip address>:8443 if you are running secure protocol

The InfoSphere BigInsights Console opens in the non-cluster location. You can
continue with the steps to download the driver and attach the ODBC driver to
the correct location.
1. Download the 64-bit:
a. From the InfoSphere BigInsights Console Welcome Page Quick Links pane,
click Download client library and development software.
b. Select Big SQL clients and drivers and click Download.
c. From the Download Fix Packs by version for IBM Data Server Client
Packages, select the correct (IBM Data Server Driver Package ) for your
operating system. For the purposes of this tutorial, select the package for
Windows 64-bit. This package works with a 32-bit or 64-bit spreadsheet
client
DSClients-ntx64-dsdriver-10.5.300.125-FP003

.
d. Click Close in the Download client library and development software
window.
2. Right-click the v10.5fp3_ntx64_dsdriver_EN.exe file and select to Run as
administrator to install the package. This package is installed by default in
C:\Program Files\IBM\IBM DATA SERVER DRIVER\.
As part of the installation, configuration files are created in
C:\ProgramData\IBM\DB2\IBMDBCL1\cfg. For other operating systems or
versions, the installed location, and the configuration file location might be
different.
3. When the install is complete, navigate to the directory that contains the
sample configuration files, and copy db2dsdriver.cfg.sample to
db2dsdriver.cfg. Copy db2cli.ini.sample to db2cli.ini.
4. Edit the db2dsdriver.cfg file so that the result looks like the following file:
<configuration>
<dsncollection>
<dsn alias="MyDSN" name="bigsql" host="abc.com" port="51000"/>
</dsncollection>
<databases>
<database name="bigsql" host="abc.com" port="51000">
</database>
</databases>
</configuration>

Make sure that you update the values for host in two places in the file.

80 IBM InfoSphere BigInsights Version 3.0: Tutorials


5. Validate the DSN alias that you created. From the bin directory of the installed
location of the driver package, C:\Program Files\IBM\IBM DATA SERVER
DRIVER\bin, type the following command:
db2cli validate -dsn MyDSN
If your entries in the configuration file are correct, validation is successful
with the following final output on your Windows screen: The validation is
completed.
6. Test the CLI connection that uses the DSN and database section entries in the
db2dsdriver.cfg file. From the same driver installed location, type the
following command:
db2cli validate -dsn MyDSN
-connect -user bigsql -passwd bigsql

If you have a successful connection, you will see the following in the screen
output:
Connection attempt for data source name "MyDSN":
====================================================

[SUCCESS]
7. Create an ODBC DSN to the alias that you just validated. The example
spreadsheet client in this tutorial is a 32-bit Microsoft Excel application. Use
the db2cli32 command, instead of the db2cli command, if you are using a
32-bit IBM Data Server Driver along with the 64-bit installer in a 64-bit
Windows computer. Type the following command if you are using a 64-bit
application:
db2cli registerdsn -add MyDSN -system

This command creates a system data source name that you can see in the
ODBC administrator tool.
8. Start the ODBC administrator tool
For a 64-bit driver
a. Select the Control Panel from the Start menu.
b. Select Administrative Tools.
c. Click Data Sources (ODBC) for 64 bit binary.
For a 32-bit driver in a 64-bit machine
Right-click the ..\Windows\SysWOW64\odbcad32.exe file and select Run
as administrator
9. The ODBC Data Source Administrator opens. Click the System DSN tab.
10. Select MyDSN, and click Configure.
11. Type bigsql in the user name field, and bigsql in the password field. Click
Connect. The message "Connection tested successfully" is displayed.

Lesson 3.2: Importing Big SQL data in a client spreadsheet


program
In this lesson, you will use the spreadsheet to run Big SQL queries.

Accessing the IBM Big SQL table in an open source spreadsheet application is the
equivalent of the query, select * from <table_name> from BigInsights. The table
can be used in your client application as you would use any spreadsheet data.
1. Open the client spreadsheet application.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 81


2. Click Data > Import External Data > New Database Query. The Choose Data
Source window opens with a list of the ODBC DSNs.

3. Select MyDSN to connect to it, and provide the login details. The list of tables
is displayed in the database.

4. Select the table sheetsOut, which you created in a previous lesson.Click Next.
5. You can continue to click Next in each window, or select filtering and sorting
attributes. In the Query Wizard - Filter Data window, click column <?> to
specify which rows to include in the data. Click Next.
6. In the Query Wizard - Sort Order window, select a column to sort by that
column. Click Next. .
7. In the Query Wizard - Finish window, select the Return Data to Microsoft
Office Excel radio button. Click Finish.
8. In the Import Data window, select the Existing worksheet radio button to put
the data in the current worksheet. Click OK. The data is imported.

The spreadsheet application contains the result of the table that you selected.

Summary of developing Big SQL queries to analyze big data


This tutorial demonstrated some Big SQL statements and some techniques for
analyzing data.

82 IBM InfoSphere BigInsights Version 3.0: Tutorials


Lessons learned

You now have a good understanding of the following tasks:


v How to create an InfoSphere BigInsights project in Eclipse.
v How to create Big SQL tables and load data into them.
v How to use the Big SQL editor in Eclipse to create queries.
v How to see the results of queries.
v How to export the output of a query for use in other applications.
v How to use Big SQL data in BigSheets and how to use BigSheets data in Big
SQL.
v How to access Big SQL data in client spreadsheet applications.

Additional resources

Read the pertinent articles on IBM developerWorks:


v What's the big deal about Big SQL? Introducing relational DBMS users to IBM's
SQL technology for Hadoop.

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data 83


84 IBM InfoSphere BigInsights Version 3.0: Tutorials
Chapter 7. Tutorial: Analyzing big data with IBM InfoSphere
BigInsights Big R
Learn how to use IBM InfoSphere BigInsights Big R to analyze, manipulate, and
visualize big data. You must download, license, and install the appropriate R
software before using Big R.

Big R uses the open source R language to enable rich statistical analysis. You can
use Big R to manipulate data by running a combination of R and Big R functions.
Big R functions are similar to existing R functions, but are designed specifically for
analyzing big data. You can use Big R to analyze data located on the InfoSphere
BigInsights server with an R environment.

This tutorial uses a sample data set that is included in the Big R package. The 11.8
MB sample data set is a random sample of 22 years of flight arrival and departure
information.

This tutorial requires basic R knowledge. To get started with R, view the course on
R programming on the Big Data University website.

The fictional Sample Outdoor Company is creating a complete outdoor package


trip that sends customers to outdoor destinations. To ensure satisfaction with every
aspect of the customer experience, the Sample Outdoor Company is looking into
airline data to choose airlines to partner with and when to schedule flights. You
will learn more about the airlines and travel times with the least amount of delays
by running Big R code, and doing the analysis on the airline data set in the
following lessons.

Learning objectives

After completing the lessons in this module you will understand the concepts and
know how to do the following actions with Big R:
v Use Big R functions
v Connect to InfoSphere BigInsights data sources from the R user interface
v Create visualizations
v Create predictive models

Time required

This module should take approximately 1 hour to complete.

Prerequisites

Attention: R is licensed by the R project under the GNU General Public License.
IBM does not provide R and is not responsible for it or your use of it in any way.

Before you begin this tutorial, ensure that you

Quick Start Edition VM Users: If you are running the tutorial with the IBM
InfoSphere BigInsights Quick Start Edition VM image, to install R and Big R:
1. Double-click the Install BigR file from the Desktop.

© Copyright IBM Corp. 2013, 2014 85


2. When the installer runs complete the following steps:
a. Enter solution 1 for both GCC Fortran problems.
b. Enter y to upgrade packages.

Lesson 1: Uploading the airline data set to InfoSphere BigInsights


server with Big R
In this lesson, you upload the sample airline data set to the InfoSphere BigInsights
server, and then you access it as a bigr.frame object, which is a Big R data.frame,
and it serves as a proxy for the underlying data set.

Load the Big R package, connect to the InfoSphere BigInsights server, and then
confirm that it is connected. Update the following example for your environment
settings, then run the code in your R environment.
library(bigr)
bigr.connect(host="host_name",
port=7052, database="default",
user="biadmin", password="password")
is.bigr.connected()

host_name is the host name of the node where your InfoSphere BigInsights Console
is installed.

biadmin is the name of the InfoSphere BigInsights administrator.

Quick Start Edition VM Users: If you are running the IBM InfoSphere BigInsights
Quick Start Edition VM image, run the following code in the RGui R Console:
library(bigr)
bigr.connect(host="bivm",
port=7052, database="default",
user="biadmin", password="biadmin")
is.bigr.connected()

For this tutorial, you use R to extract an airline data set from the Big R package
and import it into your distributed file system for further analysis. For typical
examples of importing data, see the tutorial on Importing data for analysis.
1. Run the following code to extract the airline data set from the bigr package
directory, and then create a data frame named airR:
airfile <- system.file("extdata", "airline.zip", package="bigr")
airfile <- unzip(airfile, exdir = tempdir())
airR <- read.csv(airfile, stringsAsFactors=F)
2. Convert airR to a bigr.frame named air, then move the data set to the
InfoSphere BigInsights server, and then show the sample airline data set from
the InfoSphere BigInsights server:
air <- as.bigr.frame(airR)
bigr.persist(air, dataSource="DEL",
dataPath="airline_demo.csv", header=T,
delimiter=",", useMapReduce=F)

Important: Moving the file to the InfoSphere BigInsights server can take a few
minutes.
The parameter useMapReduce by default is true. The sample airline data set is
not large, so setting the parameter to false will run the data faster.

Expected output:

86 IBM InfoSphere BigInsights Version 3.0: Tutorials


...
47 0 0
48 0 0
49 0 0
50 0 0
... showing first 50 rows only.
3. Run the following code to access the same data set as a bigr.frame object
named air directly from the InfoSphere BigInsights Console:
air <- bigr.frame(dataSource="DEL",
dataPath="airline_demo.csv", delimiter=",",
header=T, useMapReduce=F)

After uploading the sample data set, the file exists in the following location in the
InfoSphere BigInsights Console: user/bigsql/airline_demo.csv.

Lesson 2: Exploring the structure of the data set with IBM InfoSphere
BigInsights Big R
In this lesson, you learn how to review the structure of the data set.

Complete “Lesson 1: Uploading the airline data set to InfoSphere BigInsights


server with Big R” on page 86.
1. In your R environment, become familiar with the bigr.frame object named air
created in the last lesson, by exploring the column names and column types by
running the following code:
colnames(air)
coltypes(air)

By default, Big R sets all column types to character. However, for this data set,
all column types need to be integers, except for: UniqueCarrier, TailNum,
Origin, Dest, and CancellationCode.

Expected output:
[1] "Year" "Month" "DayofMonth" "DayOfWeek"
[5] "DepTime" "CRSDepTime" "ArrTime" "CRSArrTime"
[9] "UniqueCarrier" "FlightNum" "TailNum" "ActualElapsedTime"
[13] "CRSElapsedTime" "AirTime" "ArrDelay" "DepDelay"
[17] "Origin" "Dest" "Distance" "TaxiIn"
[21] "TaxiOut" "Cancelled" "CancellationCode" "Diverted"
[25] "CarrierDelay" "WeatherDelay" "NASDelay" "SecurityDelay"
[29] "LateAircraftDelay"
> coltypes(air)
[1] "character" "character" "character" "character" "character" "character" "character"
[8] "character" "character" "character" "character" "character" "character" "character"
[15] "character" "character" "character" "character" "character" "character" "character"
[22] "character" "character" "character" "character" "character" "character" "character"
[29] "character"

2. Run the following code to assign type integer to all column types except for the
columns listed in the previous step, and then display the updated column
types:
coltypes(air) <- ifelse(1:29 %in% c(9,11,17,18,23), "character", "integer")
coltypes(air)

Expected output:
[1] "integer" "integer" "integer" "integer" "integer" "integer" "integer"
[8] "integer" "character" "integer" "character" "integer" "integer" "integer"
[15] "integer" "integer" "character" "character" "integer" "integer" "integer"
[22] "integer" "character" "integer" "integer" "integer" "integer" "integer"
[29] "integer"

Chapter 7. Tutorial: Analyzing big data with Big R 87


3. Optional: Continue to explore the data set by running some of the following
functions.

Important: To avoid long processing times, limit the number of observations


to display, for example, by using the bigr.setRowLimit function.

Option Description
nrow(air) Number of flights (number of rows)
ncol(air) Number of flight attributes (number of
columns)
dim(air) Data dimensions (rows x columns)
str(air) Structure of the data set, including sample
data
head(air, 5) First five rows
tail(air, 7) Last seven rows
print(air$UniqueCarrier) Carrier codes of all flights
print(air$Dest) Destination cities of all flights

Lesson 3: Analyzing data with IBM InfoSphere BigInsights Big R


In this lesson, you analyze flight delay information.
v Lesson 1: Connecting to InfoSphere BigInsights and uploading data
v Lesson 2: Exploring the structure of the data set with Big R

The fictional Sample Outdoor Company wants to partner with an airline that has
few delays.
1. Run the following code to attach the air data set to the R search path.
attach(air)

The attach() function makes referencing variables easier. For example,


length(air$UniqueCarrier) computes the number of flights. The attach(air)
function allows you to drop the air$ prefix. Now length(UniqueCarrier) can
compute the number of flights in the air data set.
2. To create a subset of the data set named airSubset that contains flight
information for flights that are delayed by 15 minutes or more, run the
following code:
airSubset <- air[Cancelled == 0 & (DepDelay >= 15 | ArrDelay >= 15),
c("UniqueCarrier", "Origin", "Dest", "DepDelay", "ArrDelay")]
3. To find the overall ratio of flights that were delayed, run the following code:
nrow(airSubset) / nrow(air)

Expected output:
[1] 0.2269586
4. To see how a fictional HA airline compares to flight delays overall, run the
following code:
nrow(airSubset[airSubset$UniqueCarrier == c("HA"),]) /
nrow(air[UniqueCarrier == "HA",])

Expected output:
[1] 0.07434944

88 IBM InfoSphere BigInsights Version 3.0: Tutorials


The ratio of flight delays for the fictional HA airline is much smaller than all
airlines as a whole.
5. Optional: Continue to analyze flight delay information by running some of the
following functions.

Option Description

cor(air[, c("Distance", Correlation between columns


"DepDelay", "ArrDelay", "DepTime",
"ArrTime", "ActualElapsedTime")])

max(air$DepDelay) Maximum departure delay (in


minutes)

summary(mean(ArrDelay) + max(ArrDelay) ~ ., Average and maximum arrival delay


object = air) (in minutes)

Because the delay information for the fictional HA airlines is better than others as
our data analysis has shown, the fictional HA airline is a strong candidate for the
Sample Outdoor Company.

Lesson 4: Visualizing big data with IBM InfoSphere BigInsights Big R


In this lesson, you create visualizations by using the data set to find trends in
flight delays.
v Lesson 1: Connecting to InfoSphere BigInsights and uploading data
v Lesson 2: Exploring the structure of the data set with Big R
v Install the ggplot2 data visualization package in your R environment with the
following code:
install.packages("ggplot2")
v Install the makeR package:
1. Download the makeR data visualization package from the cran archives.

Quick Start Edition VM Users: If you are running the tutorial with the IBM
InfoSphere BigInsights Quick Start Edition VM image, save the makeR file in
the /home/biadmin/ directory.
2. Install the package using the following R command:
install.packages("makeR_1.0.2.tar.gz")
3. Load the library using the following R command:
library (makeR)
,

The fictional Sample Outdoor Company wants to find the times and days with the
least amount of delays.
1. Connect to the ggplot2 library by running the following command:
library(ggplot2)

With the ggplot2 library you can create bar charts and other types of plots.
2. Create a bar chart that shows the number of delays for each hour, by running
the following code:
a. Create a working copy of the air data frame by running the following code.
bf <- air

Chapter 7. Tutorial: Analyzing big data with Big R 89


b. Add columns for the number of flights delayed and the departure times by
hour to the bf data frame by running the following code.
bf$FlightDelayed <- ifelse(bf$DepDelay >= 15, 1, 0)
bf$DepHour <- bf$DepTime / 100
c. Compute the distribution of the number of flights in each hour.
df <- summary(sum(FlightDelayed) ~ DepHour, object = bf)
d. Remove flight data that is missing information, and then calculate the
number of flights for each time interval. "NA" in the data frame represent
values that are missing or not available.
df <- na.omit(df)
df <- df[order(df[,1]),]
colnames(df) <- c("DepHour", "FlightCount")
deps <- df[,2]
names(deps) <- as.vector(df[,1])
e. Label and create the bar chart.
barplot(deps, main="Airline Delays By Hour",
xlab="Hour", ylab="# of Flights Delayed")

Expected output:

The chart shows delays increasing as the day progresses. This is probably due
to an increase in flights as the day progresses.
3. Create a bar chart that shows the number of flights for each hour, by running
the following script:
bigr.histogram(air$ArrTime, nbins=24) +
labs(title = "Flight Volume (Arrival) by Hour")

Expected output:

90 IBM InfoSphere BigInsights Version 3.0: Tutorials


When you compare the delay chart with this flight volume chart you see that
flights between 09:00-11:00 a.m. have a similar flight frequency with flights
between 12:00-06:00 p.m., however the morning flights appear to experience
fewer delays.
There seems to be a correlation between the number flights in a day and the
amount of delays. If the proportion of delays decreased with respect to the
number of flights, finding days with fewer flights might be useful.
4. Create a calendar heat map that shows the number of flights on each day, by
running the following code:
library(makeR)
bf <- air[air$Cancelled == 0,]
df <- summary(count(Month) ~ Year + Month + DayofMonth, object = bf)
df$DateStr <- paste(df$Year, df$Month, df$DayofMonth, sep="-");
df2011 <- df[df$Year == 2000 | df$Year == 2001 | df$Year == 2002,]
calendarHeat(df2011$DateStr, df2011[,4] * 100, varname="Flight Volume")

Expected output:

In the month of February, Saturdays seem to have fewer flights on average. The
Sample Outdoor Company might want to schedule February flights between
09:00-11:00 a.m. on Saturday.

Chapter 7. Tutorial: Analyzing big data with Big R 91


Lesson 5: Creating a predictive model with IBM InfoSphere BigInsights
Big R
In this lesson, you create a decision-tree model and make predictions.
v Lesson 1: Connecting to InfoSphere BigInsights and uploading data
v Lesson 2: Exploring the structure of the data set with Big R
v Install the DMwR package in your R environment with the following code:
install.packages("DMwR")
v Install the rpart package in your R environment with the following code:
install.packages("rpart")

Important: Make sure that R is installed on each node of the cluster (NameNode
and DataNode), and that rpart is installed on each node.

To communicate expected flight delay information to customers, you will create a


decision-tree then try to predict flight arrival delays for the fictional HA and UA
airlines.
1. Create a decision-tree by running the following code.
a. Create a subset of the data set named bf that contains flight information for
only HA and UA airlines, by running the following code:
bf <- air[air$UniqueCarrier %in% c("HA", "UA"),]
b. Split the data into a train and a test data set, and then test to see if the split
percentages are accurate, by running the following code:
splits <- bigr.sample(bf, c(0.7, 0.3))
class(splits)
train <- splits[[1]]
test <- splits[[2]]
nrow(train) / nrow(bf)
nrow(test) / nrow(bf)

When building models, you can use a subset of the data to train the model,
and use the remaining data to test and validate your model.

Expected output: Output may vary.


[1] 0.70582
[1] 0.29418
c. Build a decision-tree model for the HA and UA airlines, and then view the
two models.
models <- groupApply(data = train,
groupingColumns=list(train$UniqueCarrier),
rfunction = function(df) {
library(rpart)
predcols <- c(’ArrDelay’, ’DepDelay’, ’DepTime’,
’CRSArrTime’, ’Distance’)
m <- rpart(ArrDelay ~ ., df[,predcols])
m
})
print(models)

The models will predict flight arrival delay using departure delay, departure
time, air travel time, and distance as predicting variables.

Expected output:

92 IBM InfoSphere BigInsights Version 3.0: Tutorials


bigr.list
group1 status
1 HA OK
2 UA OK
d. Get the contents of the HA model and name it modelHA, then print the
contents of modelHA.
modelHA <- bigr.pull(models$HA)
print(modelHA)

Expected output:
n=196 (2 observations deleted due to missingness)

node), split, n, deviance, yval


* denotes terminal node

1) root 196 30254.750 -1.1071430


2) DepDelay< 5.5 178 11398.300 -3.5337080
4) DepDelay< -3.5 100 4077.790 -6.3900000
8) CRSArrTime< 713 7 1055.429 -15.2857100 *
9) CRSArrTime>=713 93 2426.731 -5.7204300 *
5) DepDelay>=-3.5 78 5458.718 0.1282051
10) Distance< 2582.5 69 2243.072 -1.1159420
20) DepDelay< 0.5 51 1244.510 -2.5686270 *
21) DepDelay>=0.5 18 586.000 3.0000000 *
11) Distance>=2582.5 9 2290.000 9.6666670 *
3) DepDelay>=5.5 18 7443.778 22.8888900 *
e. Create a decision-tree for HA airline.
library(DMwR)
prettyTree(modelHA)

Expected output:

The decision-tree uses the column DepDelay more than other columns showing
that there is a strong relationship between the decision-tree and DepDelay.
2. Use the models that you created in step 1 to make arrival delay predictions.

Chapter 7. Tutorial: Analyzing big data with Big R 93


a. Score the data set. Create columns to compare departure delay, actual
arrival delay, and predicted arrival delay.
preds <- groupApply(test,
list(test$UniqueCarrier),
function(df, models) {
library(rpart)
carrier <- df$UniqueCarrier[1]
m <- bigr.pull(models[carrier])
data.frame(carrier, df$DepDelay, df$ArrDelay, predict(m, df))
},
signature=data.frame(carrier=’Carrier’, DepDelay=1.0,
ArrDelay=1.0, ArrDelayPred=1.0, stringsAsFactors=F),
models)
b. Show 20 predictions.
head(preds, 20)

Expected output:
carrier DepDelay ArrDelay ArrDelayPred
1 UA -5 -14 -3.353143
2 UA -5 -6 -3.353143
3 UA -5 -9 -3.353143
4 UA -2 -8 -3.353143
5 UA -2 -7 -3.353143
6 UA 25 15 20.429878
7 UA -3 -22 -3.353143
8 UA 3 31 6.742727
9 UA 11 -8 6.742727
10 UA -3 0 -3.353143
11 UA -2 3 -3.353143
12 UA 0 14 -3.353143
13 UA -2 -14 -3.353143
14 UA -4 -13 -3.353143
15 UA 1 -5 -3.353143
16 UA -5 2 -3.353143
17 UA -7 -7 -3.353143
18 UA -5 -20 -3.353143
19 UA 0 -38 -3.353143
20 UA 2 -9 -3.353143

Examining row six in the output, the predicted arrival delay is 20.429878;
however, the actual arrival delay is 15 minutes. As expected, there are
discrepancies between the predicted and actual results. It is important to
test the quality of your model to see where predictions are wrong or
different from the actual results.
3. Check the quality of your model.
a. Use the root mean squared deviation (RMSD) error metric.
rmsd <- sqrt(sum((preds$ArrDelay - preds$ArrDelayPred) ^ 2) / nrow(preds))
print(rmsd)

Expected output:
[1] 15.20358

The RMSD shows that the model has a high error ratio. To improve the
model, you can add more predictors like departure and arrival cities.
b. Examine the rows where your model gave the worst predictions.
preds$error <- abs(preds$ArrDelay - preds$ArrDelayPred)
head(sort(preds, by=preds$error, decreasing=T))

Expected output:

94 IBM InfoSphere BigInsights Version 3.0: Tutorials


carrier DepDelay ArrDelay ArrDelayPred error
1 UA 34 217 42.058116 174.9419
2 UA 179 346 180.755102 165.2449
3 UA 13 152 6.742727 145.2573
4 UA 0 141 -3.353143 144.3531
5 UA 9 128 6.742727 121.2573
6 UA 4 117 6.742727 110.2573

The error is very high for the model's worst predictions. The long delays are
probably due to plane maintenance and repair. And the top errors might be
outliers, because the range from the largest error to the sixth largest error is
over 60 minutes.

Summary of analyzing data with IBM InfoSphere BigInsights Big R


tutorial
This tutorial demonstrated some IBM InfoSphere BigInsights Big R functions and
some techniques for analyzing data.

Lessons learned

You now have a good understanding of the following tasks:


v How to use basic Big R functions
v How to format data
v How to create visualizations
v How to use create predictive models
v How to test predictive models

Chapter 7. Tutorial: Analyzing big data with Big R 95


96 IBM InfoSphere BigInsights Version 3.0: Tutorials
Chapter 8. Tutorial: Creating an extractor to derive valuable
insights from text documents
Learn how to use IBM InfoSphere BigInsights Text Analytics, an information
extraction system, to extract information from IBM quarterly reports.

Using InfoSphere BigInsights Text Analytics you define programs written in


Annotation Query Language (AQL) to extract structured information from
unstructured and semi-structured documents. You can apply Text Analytics to big
data at rest in InfoSphere BigInsights and big data in motion in IBM InfoSphere
Streams.

By using text analytics tooling, you can develop, run, and publish extractors that
glean structured information from unstructured documents. The extracted
information can then be analyzed, aggregated, joined, filtered, and managed by
using other InfoSphere BigInsights tools.

In this tutorial, you will extract business information from a series of IBM
quarterly reports, such as the revenue for each IBM division. You can then use that
information in other tools, such as BigSheets, to understand and analyze trends,
and visualize the results in charts or graphs.

The Welcome page of the InfoSphere BigInsights Console includes information


about how to enable your Eclipse environment for developing Text Analytics. For
more information, click Help in the InfoSphere BigInsights Eclipse tools.

We will extract useful information from text documents by using a 5-step process.
The tasks that are associated with this process are supported in the Extraction
Tasks view in the Eclipse tools. This gives us a workflow that we can follow as we
build extractors. The following steps are included in these lessons:
1. Identify the collection of documents from which you want to extract
information.
2. Analyze the documents to identify examples of the information that you want
to extract.
3. Write AQL statements to extract the identified information.
4. Test and refine the AQL statements.
5. Export the final extractor and deploy to a runtime environment such as
InfoSphere BigInsights or IBM InfoSphere Streams.

Lessons 1, 2, and 3 will introduce you to the Text Analytics features and tooling.
These introductory lessons teach you how to use some basic AQL statements, and
how to manipulate the Text Analytics Workflow perspective and the Extraction
Plan. In the more advanced lessons (Lessons 4, 5, 6, and 7), you refine the AQL,
finalize the extractor, and export the Text Analytics Module (TAM) so that it is
ready to deploy to a runtime system.

Learning objectives

After you complete the lessons in this tutorial, you will understand the concepts
and know how to do the following actions:
v Navigate a Text Analytics project in Eclipse.

© Copyright IBM Corp. 2013, 2014 97


v Import documents into a project.
v Understand the Text Analytics development process.
v Use the tooling to write and test AQL statements.
v Export an extractor ready to be deployed to a runtime system.

Time required

Allow 30 minutes to complete the basic parts of this tutorial. Allow another 45
minutes to complete the more advanced lessons.

Lesson 1: Setting up your project


In this lesson, you explore the InfoSphere BigInsights Text Analytics Workflow
perspective in Eclipse, including the Extraction Task view and the Extraction Plan
view. You also create a Text Analytics project and set up the input data.
1. Make sure that you install Eclipse, and enable it to work with InfoSphere
BigInsights. See Installing the InfoSphere BigInsights Tools for Eclipse
2. Download the sample data by following these steps:
a. Open the Welcome page of the InfoSphere BigInsights Console.
b. In the Quick Links section, click Download applications (Eclipse projects).
c. Select SampleTextAnalyticsProject_eclipse.zip. Click Download. Select to
save the file in your local system. Click Close to close the Map Reduce
Sample Applications window.
d. Navigate to the download location and extract the compressed file.

You will now create a text analytics project and import the documents.
1. From your desktop, start Eclipse. Click OK to use the default workspace. The
Task Launcher for Big Data opens.
2. Close the Help pane for now if it is visible. You can always get help by
pressing F1, or by selecting Help > Help Contents from the menu bar.
3. In the Task Launcher for Big Data, click the Develop tab, and click Create a
text extractor from the Tasks panel.
4. Create a project called TA_Training.
a. In the New BigInsights Project window, specify TA_Training as the project
name, then click Finish.
b. Click Yes in the message box to switch to the InfoSphere BigInsights
perspective. The Extraction Plan pane is usually visible on the right of the
window. It is the design pane for Text Analytics projects. The Extraction
Tasks pane is usually visible on the left of the window. It is the workflow
for Text Analytics projects. The actual location of the views might depend
on your Eclipse environment.

Learn more about adding views: If the Extraction Tasks view is not visible,
add that view. From the Eclipse menu click Window > Show view >
Extraction Tasks. You can follow the same steps for the Extraction Plan if it
is not visible.
5. Before you can start working with the sample documents, you must bring them
into Eclipse. Open the Project Explorer and expand the TA_Training project.
a. From the Eclipse menu bar select Window > Show view > Project Explorer.
b. Expand the project TA_Training and open the folder textAnalytics.
c. Access the input documents in one of the following ways:

98 IBM InfoSphere BigInsights Version 3.0: Tutorials


Importing
1) Right-click the project TA_Training and click File > Import.
2) In the Select window, click General > File System. Click Next.
3) In the From directory field, click Browse. Locate and select the
ibmQuarterlyReports folder that you downloaded and extracted
at the beginning of this lesson. Click OK.
4) In the File System window, select the ibmQuarterlyReports to
access all of the files within the folder, then click Finish.
Dragging and dropping
1) In your local file system, navigate to the
SampleTextAnalyticsProject_eclipse/data/ folder that you
downloaded and extracted at the beginning of this lesson.
2) Open the folder and drag the ibmQuarterlyReports folder from
your local file system onto the textAnalytics folder in your
Eclipse project. Specify Copy files and folders and click OK in
the File and Folder Operation dialog.

Lesson 2: Selecting input documents and labeling examples


In this lesson, you analyze the input documents to identify examples of the
information to be extracted and to add examples to the Extraction Plan.

By labeling examples, you also start creating an extraction plan, which is a view of
the design of your extractor. In the extraction plan, you identify, organize, manage,
and navigate elements of the extractor.

As you create the extraction plan by labeling the spans of text (sometimes referred
to as snippets) of interest and their associated clues, you are developing an
understanding of the input documents. It is a good idea to work with someone
who is familiar with the documents during this part of the process.

Since you are interested in extracting revenue by division, you must search the
documents for spans of text that contain this information. As you find and label
examples, be aware of patterns and clues in the text that can help improve the
accuracy of the extractor.

An example that you might find is a phrase such as Revenues from Software were
$3.9 billion. If you labeled this example, you might notice that it has three
important features:
v The term "Software", which is a division name.
v The term “$3.9 billion”, which is a revenue amount.
v The term revenue.
You will use all of these features as context to identify instances of revenue by
division.

Labels are meaningful identifiers of the text that you want to extract. Labels also
serve to categorize various clues that help you develop an extractor. There are two
types of labels:
Top level or parent
A span of text that contains the information that you want to extract. An

Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 99
example of a top level identifier is Revenues from Software were $3.9 billion,
which contains clues to a division and the revenue that is associated with
it.
Clues You decompose the top level identifiers into features and clues. Basic
features are usually parts of the top level or parent that you must extract.
A clue is typically supporting text that provides additional context. In our
example, we would consider the word revenue to be a clue and the division
names and revenue amount would be features.

The process of labeling the document is an iterative process. It helps if you can
work with a subject matter expert who can help you decide if you have identified
enough examples, features, and clues to reliably extract the required information. It
would be unusual to find the same information presented the same way across a
broad set of documents. More often than not, something causes things to change. It
might be a change in the business, a change in regulations or reporting
requirements, a change of writer or editor, a new template for the document, or
simply a change in writing style. Ideally your subject matter expert can alert you
to the changes and variations that you must cover with your text analytics code.
When you read some of the sample input documents, you will see that you have
two basic patterns to deal with: revenues for division were $x.x and division
revenues were $x.x. There are a number of additional variations in the
information around the basic features and clues, but only two basic patterns.
1. Before you start your analysis, you must set up the input documents in the
Extraction Tasks view.
a. Click the Extraction Tasks tab in the left pane of the Text Analytics
Workflow perspective.
b. Expand Step 1 of the Extraction Tasks wizard, Select Data Collection. Click
Browse Workspace and navigate to the ibmQuarterlyReports folder in your
project (TA_Training/textAnalytics/ibmQuarterlyReports). Select the
ibmQuarterlyReports folder, and click OK.
c. From the Language list, select en.
d. Select 4Q2006.txt in the Extraction Tasks wizard. Click Open.
2. Examine the text in the document you just opened by looking for examples that
report revenue by division.
3. Identify RevenueByDivision as the first clue in which you are interested.
a. Search the file until you see the phrase Revenues from the Software segment
were $5.6 billion. Highlight that phrase, right-click, and click Add example
with New Label.
b. In the Add New Label window, type RevenueByDivision in the Label Name
field and leave a Parent Label field blank to make RevenueByDivision the
top level label.
c. Click Finish.
4. Look again at the text from the 4Q2006.txt file. Search for the phrase Revenues
from the Systems and Technology Group (S&TG) segment totaled $7.1 billion. Add as
another example.
a. Right-click that phrase and click Label Example As.
b. Select RevenueByDivision.
5. You have found two examples of the pattern revenues for division were
$x.x. Now, find an example that refers to the other pattern in which you were
interested. Search for and highlight Global Financing segment revenues
increased 3 percent (flat, adjusting for currency) in the fourth quarter
to $620 million. in the 4Q2006.txt file.

100 IBM InfoSphere BigInsights Version 3.0: Tutorials


a. Right-click that phrase and click Label Example As.
b. Select RevenueByDivision.
6. If you look at the Extraction Plan view, you see the three examples that you
labeled. If you click either of the examples under the parent label, such as
Revenues from the Software segment were $5.6 billion, the text from which it
came is highlighted. These spans of text contain some useful clues for
extraction, such as revenues, division names such as Systems and Technology Group
(S&TG), and amounts such as $5.6 billion. Now you want to record clues from
these examples as additional labels in the Extraction plan.

Learn more about the Extraction Plan: You can think of the Extraction Plan as
an interactive design view of your extractor. It helps you to identify, organize,
and navigate the elements that you want to extract. It also helps you write the
associated AQL statements, which makes the Extraction Plan a powerful part of
the design and development process.
a. In the 4Q2006.txt file, select the span of text Revenues from the Software
segment were $5.6 billion. Highlight and right-click the term Revenues, and
select Add Example with New Label.
b. In the Add New Label window, type revenues in the Label Name field.
Type RevenuebyDivision as the parent label. Click Finish.
c. In the same span of text, find the phrase $5.6 billion. Right-click that phrase
and click Add example with New Label.
d. In the Add New Label window, type Money in the Label Name field. Type
RevenueByDivision as the parent label. You can also double-click the
RevenueByDivision parent label to use that name as the parent. Click Finish.
7. It is a good idea to decompose clues to the lowest level. In this way, you can let
the powerful text analytics engine and optimizer do more of the work, rather
than writing complex expressions in your code. This action of decomposing
clues can also give you a more robust and flexible solution. Money, which you
labeled in the previous step, is a good example. Money has three basic features:
a currency sign, followed by a number, followed by a quantifier such as million
or billion. Go ahead and create labels for these three features:
a. In the 4Q2006.txt file, find the span of text $5.6 billion which
was part of the original phrase in a previous step. You have already labeled
this phrase Money.
b. Right-click only the currency symbol, $, and click Add example with New
Label. Type Currency in the Label Name field. In the Parent Label field,
type Money.
c. Right-click 5.6 of the same phrase, and select Add example with New
Label and type Number in the Label Name field. In the Parent Label field,
type or select Money.
d. Right-click billion and select Add example with New Label and type
Quantifier in the Label Name field. In the Parent Label field, type or select
Money.
8. You would usually continue analyzing documents, labeling additional examples
and clues until you had seen enough to be confident that you understood the
features, clues, and patterns for which you will code. To save time with the
additional examples and clues that you should label, use Table 2 on page 102 as
a guide. Search the documents that are identified and add the labels, noting of
which parent the label is a child.
a. Open the document that is listed in the File column of the table in the
editor.

Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 101
b. Press Ctrl+F to search for the string that is listed in the Search term column
of the table.
c. For each clue to add as a label, right-click the word or phrase and click Add
example with New Label. Specify the suggested label name in the Label
name column of the table, type the appropriate parent label name, and click
Finish. If you already added the label and want to add an example of the
label, click Label Example As.
d. Close the file.
Table 2. Additional clues to strengthen your extractor
Label name as child to
RevenueByDivision unless
File Search term otherwise noted
4Q2006.txt $7.1 billion Money
4Q2006.txt Systems and Technology Group Division
(S&TG)
4Q2006.txt Global Technology Services® Division
4Q2006.txt million Quantifier as a child to
Money
4Q2007.txt 12.5 Number as a child to Money
4Q2009.txt 27.2 Number as a child to Money
4Q2010.txt Revenue Metric
4Q2010.txt $29.0 billion Money
4Q2010.txt 8.7 Number as a child to Money
4Q2010.txt 5.3 Number as a child to Money

Lesson 3: Writing and testing AQL


In this lesson, you write AQL statements to extract the basic low-level features that
you identified in the previous lesson.

Now you are going to write AQL statements to extract the basic features that you
identified during the document analysis process. You will see how you can use a
simple pattern to put the basic features in context to give you candidates. In
subsequent lessons, you use similar techniques to combine features to create
concepts, and expand your AQL to further consolidate and filter the results.

Extractors are written in the Annotation Query Language (AQL), which is the core
of text analytics in InfoSphere BigInsights and InfoSphere Streams. You code
custom extractors in AQL. Text Analytics also includes a library of pre-built
extractors and a sophisticated set of tools. The AQL language was designed by
using SQL-like expressions, which makes it familiar and easy to learn.

Learn more about writing AQL: An extractor is a program that is written in AQL
that extracts structured information from unstructured or semistructured text. AQL
is a declarative language, with a syntax that is similar to that of the Structured
Query Language (SQL). For more information about writing AQL, see the AQL
Reference.

If you look at the labels that you created in the Extraction Plan, you see that the
lowest level basic features that you labeled are the three elements of Money: the
currency symbol, a number, and a quantifier. You are now going to write AQL
102 IBM InfoSphere BigInsights Version 3.0: Tutorials
statements to extract those elements by using simple extract syntax to use
dictionaries and regular expressions. As you will see, AQL allows you to create the
views that use extract and select statements. These statements are the three
fundamental elements of AQL. So, it is worth repeating: by using AQL statements,
your data is managed through views, and views are created by using extract and
select statements. In addition, your input data set is referenced as a view called
Document and its contents are referenced as a column called text.
1. You will now create views that use extract statements. You create one view for
each of the three basic features of Money.
a. In the extraction plan, right-click the Currency label that you created in the
previous lesson.
b. From the menu, select New AQL Statement > Basic Feature AQL
Statement.
c. In the Create AQL Statement dialog, in the View Name field, specify
Currency.
d. In the AQL Module field, select RevenuebyDivision_BasicFeatures.
e. In the AQL script, specify RevenueBasic.aql for the name of the AQL
script that you will be writing.
f. In the Type field, select Dictionary.
g. Select the Output view check box.
h. Click OK.
2. The RevenueBasic AQL file opens in the editor. The file is populated with
templates to create a dictionary and a view.

Learn more about views: Views are the primary data structures that are used
with AQL statements. AQL statements create views by selecting, extracting,
and transforming information from other views. AQL views are like the views
in a relational database. They have rows and columns just like a database
view. However, AQL views are not materialized by default. In other words,
the result of the views is not viewable output. To see your output, you must
include an output view statement. You reference input data as a view called
Document with one column called text. Think of each document in your input
data set as one row in the Document view with the document content
mapped onto the text column.
3. Complete the AQL template to create the dictionary and the view.
a. In the create dictionary line, type or copy the following code to replace
the dictionary template:
create dictionary CurrencyDict
as (’$’);

Make sure that you deleted the lines that begin from file and with
language.

Learn more about dictionaries:

To extract elements from text, you can use regular expressions and
dictionaries. When you want to match text that is based on a pattern, you
use a regular expression. When you can match on defined words, use a
dictionary.

AQL dictionaries are more efficient than regular expressions, so it is a


good idea to use dictionaries whenever possible, even in cases where there
is just a single entry. You usually create dictionaries with more entries that

Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 103
are stored in an external file, which makes it easier to add and change
entries without having to open up the code.
End each AQL statement with a semi-colon. You are changing the
dictionary declaration to be a simple inline declaration. In the example,
when the statement is run, the string is the entry in the CurrencyDict
dictionary.

Learn more about another way to add terms to a dictionary: Instead of


typing each entry manually, you can use the features of the Extraction Plan
view to add terms into a dictionary file:
1) In the Extraction Plan, expand the top-level label, RevenueByDivision,
and expand Labels. Inside that label, expand the Currency label.
2) Click Examples to open that folder. You see the clues for Currency that
you labeled in the previous lesson.
3) Select all of the entries in the Examples folder, and right-click, and
select Add to dictionary.
4) In the Select Dictionary window, click Browse Workspace.
5) In the Select a file window, select the src/
RevenuebyDivision_BasicFeatures folder in the TA_Training project,
and click Create Dictionary. The NewDictionary.dict file is created.
6) Click OK. Then, click OK to close the Select Dictionary window. The
terms are now added into a dictionary file that you can use in an
extract statement.
7) Save the file.
8) You can rename the file by selecting NewDictionary.dict in the Project
Explorer, and clicking F2. In the Rename Resource window, type a new
name for the file.

By using a dictionary file instead of inline terms, you can more easily
modify terms without modifying the code. The create dictionary
statement would change as follows:
create dictionary CurrencyDict
from file ’NewDictionary.dict’;
b. In the create view template, replace the template with the following code:
create view Currency as
extract
dictionary ’CurrencyDict’ on R.text as match
from Document R;
The create view statement uses an extract statement that finds all matches
of terms in the dictionary that you created. The dictionary matches are
stored in a column named match.
4. Do not change the output view line.
The output view statement materializes the view. By default, views are not
materialized. They are also likely to be removed when you optimize for better
performance. But, during development, you are likely to want to look at the
contents of intermediate views like this one for debugging purposes, then
later you can comment out the output view statements that are not required.
5. Click File > Save from the menu to save your changes. Verify that your AQL
looks like the following code:
module RevenueByDivision_BasicFeatures;

create dictionary CurrencyDict

104 IBM InfoSphere BigInsights Version 3.0: Tutorials


as (’$’);

create view Currency as


extract
dictionary ’CurrencyDict’ on R.text as match
from Document R;

output view Currency;


6. You are now ready to test the extractor. You run the AQL queries and then
view the results. There are three primary options to run the extractor. You can
run against all of the documents, on selected documents only, or on
documents with labeled information only. In the Extraction Plan, right-click
RevenueByDivision and click Run > Run the extraction plan on the entire
document collection.
7. When the run is complete the results are shown in the Annotation Explorer.
The Annotation Explorer shows each extracted field in the Span Attribute
Value column. You can also see the text from the left and right of the
extracted text, which is known as the left and right context. Double-click one
of the rows to see the extracted text in the original document in the edit pane.
The Span Attribute Value column in the middle of the Annotation Explorer
shows the basic features that are picked up by the extractor. The output that
shows in the Annotation Explorer is by view name.
8. After you complete the code for the Currency view, right-click the Currency
label in the Extraction Plan and select Mark Completed. The label icon
changes to a check mark. This marker is a visual reminder that you have
created a view for this label. This process of checking as you progress through
the Extraction Plan is part of the workflow of text analytics. You are building
up from the simpler labels by creating views that extract the information you
need.
9. Enhance the RevenueBasic.aql file by adding views for the additional basic
features of Money. Use the Number and Quantifier labels in new views
within the same module. For each view that you create, begin by
right-clicking the label that corresponds to the view that you want to create in
the Extraction Plan. Then, from the menu, select New AQL Statement > Basic
Feature AQL Statement.

Option Description
create view Number
View Name
Number
AQL Module
RevenuebyDivision_BasicFeatures
AQL script
RevenueBasic.aql
Type Regular expression
Output view
Enabled

Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 105
Option Description
create view Quantifier
View Name
Quantifier
AQL Module
RevenuebyDivision_BasicFeatures
AQL script
RevenueBasic.aql
Type Dictionary
Output view
Enabled

10. Modify the RevenueBasic.aql file to correct the two templates that were
added.
a. Update the Number view to look like the following code:
create view Number as
extract regex /\d+(\.\d+)?/
on R.text as match
from Document R;

output view Number;

Learn more about another way to add regular expressions: Instead of


typing the regular expression manually, you can use the features of the
Extraction Plan view to add an expression into your statement:
1) In the Extraction Plan, expand the top-level label, RevenueByDivision,
and expand Labels. Inside that label, expand the Number label.
2) Click Examples to open that folder. You see the clues for numbers that
you labeled in the previous lesson.
3) Select all of the entries in the Examples folder, and right-click, and
select Generate Regular Expression.
4) In the Regular Expression Generator window, the samples you selected
are already loaded in the Samples pane. Click Generate regular
expression.
5) You might get several suggestions, but in this case, there is one
suggestion that is based on the samples:
(\d{1,2})?(\.)?\d
6) Click Next. In the Regular Expression Generator window, you can
refine the expression. You might find that because of the clues that you
labeled, the generated expression is more or less restrictive than you
want. Experiment with the options, and when you are satisfied, click
Finish.
7) A confirmation window shows the expression that was generated. The
expression is placed on the clipboard. Click OK.
8) Navigate to the statement that begins extract regex in the
RevenueBasic.aql file, and right-click, and select Paste to add the
generated expression to your code.
b. Update the Quantifier view by first using the Extraction Plan menu to
create a dictionary file, instead of creating an inline dictionary.
1) In the Extraction Plan, expand the Quantifier label and then expand
Examples. Highlight all of the entries in the Examples folder.

106 IBM InfoSphere BigInsights Version 3.0: Tutorials


2) Right-click and select Add to Dictionary.
3) In the Select Dictionary window, click Browse Workspace.
4) Select the /textAnalytics/src/RevenuebyDivision_BasicFeatures, and
click Create Dictionary.
5) A NewDictionary.dict file is created. Click OK.
6) In the Select Dictionary, click OK. The clues that you labeled as
Quantifier entries are now in the dictionary file. There should be at
least two entries from the labeling that you did in the previous lesson:
million and billion. Save the file.
7) Open the Project Explorer view, and right-click the dictionary file. Click
Rename. In the Rename Resource window, type Quantifier.dict in
the New Name field. Click OK.
8) In the RevenueBasic.aql file, in the create dictionary template for
QuantifierDict, replace the code <path to your dictionary here>
with the name of the dictionary that you just created:
from file ’Quantifier.dict’

If you put the dictionary file in a location outside of the module folder,
then you must include the path relative to the project name.
Complete the create view statement by pointing to the dictionary that you
just created, and ensure that the view is case insensitive. Adding the
IgnoreCase parameter ensures that the terms million and Million are both
found. The create dictionary and the create view statements should
look like the following code:
create dictionary QuantifierDict
from file ’Quantifier.dict’
with language as ’en’;

create view Quantifier as


extract dictionary ’QuantifierDict’ with flags ’IgnoreCase’
on R.text as match
from Document R;

output view Quantifier;


c. Click File > Save.
11. Run the extractor.
a. In the Extraction Plan, right-click RevenueByDivision and click Run >
Run the extraction plan on the entire document collection.
b. View the results in the Annotation Explorer.
When it successfully completes, there are three views that are output and are
displayed in the Annotation Explorer. These views are the three views that
you materialized with the output view statement. Select the view that you
want to see in the result table view from the list in the header of the
Annotation Explorer.

12. To mark the Number and Quantifier labels as complete in the extraction plan,
right-click each label (Number and Quantifier) in the Extraction Plan and
select Mark Completed.
13. You are now going to extract instances where these three basic features occur
together, which gives you Money. You will do that extraction by using a

Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 107
pattern to extract candidates for revenue. In the Create AQL Statement dialog,
complete the fields necessary to create a view:
a. Right-click the Money label, select New AQL Statement > Candidate
Generation AQL statement. AQL is modular, which means that you can
package your statements into modules that can then be packaged and
reused. One way to modularize your code is by the type of AQL
statement. By using this design, you would package all basic feature
statements in one module, all candidate generation statements in another
module. The Text Analytics tooling creates default modules to support this
type of modularization. But since the extractor that you are building in
this tutorial is simple, you will package all of your statements into the
RevenuebyDivision_BasicFeatures module.

Learn more about AQL modules: For more information about modules,
see AQL modules
b. Type Money in the View Name field.
c. In the AQL Module field, make sure to specify
RevenuebyDivision_BasicFeatures as the module name.
d. In the AQL script field, type or select RevenueBasic.
e. Specify Pattern in the Type field.
f. Specify the Output view check box.
g. Click OK.
14. You are going to use the Currency, Number and Quantifier views in this new
view, and you will reference those views by assigning the variables C, N, and
Q to the Currency, Number, and Quantifier views in the FROM clause. The
pattern specification looks for the currency symbol, followed by a number,
followed by a unit. As a result, the view contains the following code:
create view Money as
extract pattern <C.match> <N.match> <Q.match>
return group 0 as match
from Currency C, Number N, Quantifier Q;

output view Money;

Learn more about patterns: For more information about patterns in AQL, see
Sequence patterns
15. Save the file and run the extractor in the usual way.
a. In the Extraction Plan, right-click RevenueByDivision and click Run >
Run the extraction plan on the entire document collection.
b. View the results in the Annotation Explorer.
You see the Money view, with sequential occurrences of a currency sign,
followed by number, followed by a unit. You extracted entities by using a
pattern over the input document and the existing annotations. The Money
view returned 333 rows.
16. To mark the Money label as complete in the extraction plan, right-click the
label Money in the Extraction Plan and select Mark Completed.
17. Optional: From the Annotation Explorer, you can export the extracted views
as HTML or CSV files, and you can highlight any of the extracted entities in
the annotated document view and get the drilldown of the views to which
they belong.
a. Click the Export Results icon in the Annotation Explorer.

108 IBM InfoSphere BigInsights Version 3.0: Tutorials


b. In the Export Results dialog, in the Path for the exported results field,
type the name of a valid directory, or click Browse File System to
designate a target output location, and click Finish.
In the target directory, a CSV folder and an HTML folder are created. The CSV
folder contains a <view name>.csv file for each view. The HTML folder
contains a <view name>.html file for each view. That file also contains the
input document.
You can upload these simple CSV files to an IBM InfoSphere BigInsights
server and use them in another component of InfoSphere BigInsights.
If you choose to continue with the advanced lessons, you will learn to create
and publish a deployable extractor that can be used by anyone with access to
your server.

Summary - the basic lessons


At this point, you have successfully extracted three basic features (currency,
number, and quantifier) by using dictionaries and a regular expression. Then, you
extracted instances of money by using a pattern to put the basic features in
context. In the next lesson, you will use this same technique to extract candidates
of revenue by division. You will do that by putting instances of money into
context with instances of the basic features of revenue and division. You can use
this pattern in many situations.

If you decide to not continue with the more advanced lessons, you have learned
how to extract basic features and how to use a pattern to extract candidates. In
these first three lessons, you extracted the basic features of money: a currency
symbol, a number, and a quantifier. You used dictionaries, regular expressions, and
patterns. You created and output views and ran the extractor and examined the
output in the annotation explorer.

With these lessons, you were introduced to the fundamentals of Text Analytics and
some key AQL statements. You have also successfully used the tools to identify
instances of Money in IBM quarterly reports.

Lesson 4: Writing and testing AQL for candidates


In this lesson, you write AQL queries that add context to the basic features.

You will build on the basic features that you defined in previous lessons to extract
revenue by division that is based on the two patterns that you identified during
your initial analysis: revenues for division were $x.x and division revenues
were $x.x.

So far, you successfully extracted all instances of Money. Now you will extract the
basic features of revenue and division.

To generate candidates, you use the extract pattern statement, and build on the
code that you created in the previous lessons.
1. In the previous lesson you extracted Money. Now, you need to extract
instances of revenue and divisions. You extract these basic features by using
dictionaries:
a. Right-click the revenues label and click New AQL Statement. Select Basic
Features AQL statement.
b. Type Revenue in the View Name field.

Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 109
c. In the AQL Module field, make sure to specify
RevenuebyDivision_BasicFeatures as the module name.
d. In the AQL script field, type RevenueBasic.
e. Specify Dictionary in the Type field.
f. Specify the Output view check box.
g. Click OK.
Copy of paste the following code to replace the template:
create dictionary RevenueDict
as (’revenues’, ’revenue’);

create view Revenue as


extract dictionary ’RevenueDict’
with flags ’IgnoreCase’
on R.text as match
from Document R;

output view Revenue;


2. Run the extractor in the usual way. Your output from the Revenue view
should be limited to those spans of information that contain the terms revenue
or revenues in upper and lowercase.
3. Next, you want to use a dictionary to extract division names:
a. Right-click the Division label and click New AQL Statement. Select Basic
Features AQL statement.
b. Type Division in the View Name field.
c. In the AQL Module field, make sure to specify
RevenuebyDivision_BasicFeatures as the module name.
d. In the AQL script field, type RevenueBasic.
e. Specify Dictionary in the Type field.
f. Specify the Output view check box.
g. Click OK.
4. Copy or type the following code to complete the Division view:
create dictionary DivisionDict
as (’Global Technology Services’,’Systems and Technology’,
’S&TG’,’Software’,’Global Financing’,’Global Business Services’ );

create view Division as


extract dictionary ’DivisionDict’
on R.text as match
from Document R;

output view Division;


5. Save the file and run the extractor in the usual way. The output from the
Division view should contain references to the division names in the inline
dictionary. This view contains 139 rows.
6. Notice in the Annotation Explorer view that in the Division view, the terms
software and global financing are being picked up incorrectly as division names.
Since these terms are in lowercase, the chances are good that they do not
represent division names. This problem can be fixed by modifying the create
dictionary statement to use the ‘Exact’ flag to ensure that the text string
matches the dictionary entry exactly, including case. Modify the create
dictionary statement for Division in the RevenueBasic.aql script so that it
looks like the following code:

110 IBM InfoSphere BigInsights Version 3.0: Tutorials


...
create dictionary DivisionDict with case exact as
(’Global Services’,’Global Technology Services’,
’S&TG segment’,’Software’,’Global Financing’,
’Systems and Technology Group’);

...
7. Save the file, and run the extractor in the usual way. In the Annotation
Explorer, the division names now look correct. There are now 95 rows
returned.
8. Mark the labels revenues and Division as complete.
9. You have now extracted the three key basic features: money, revenue, and
division. The next step is to extract candidates that match the two patterns
that you identified earlier.
10. You will use patterns in your code to put the information from the three
views MoneyRevenue and Division in context. If you remember, in “Lesson 2:
Selecting input documents and labeling examples” on page 99, part of your
goal was to find both of the following patterns: revenues for division were
$x.x and division revenues were $x.x. The first pattern looks for examples
where the word revenue is followed by a division name and then a money
amount, with some number of tokens in between each basic feature. For
example, Revenues from the System and Technology Group (S&TG) segment
totaled $7.1 billon
extract pattern <R.match><Token>{1,2}<D.match><Token>{1,20}<M.match>

The second pattern looks for examples where a division name is followed by
the word revenue and a money amount, with some number of tokens in
between each basic feature. For example, Global Financing segment revenues
increased 3 percent (flat, adjusting for currency) in the fourth
quarter to $620 million.
extract pattern <D.match><Token>{1,3}<R.match><Token>{1,30}<M.match>

After you have matched both patterns and have a full set of candidates, you
can union them together into a single view.
a. Right-click the RevenueByDivision label and click New AQL Statement.
Select Candidate Generation AQL statement. Complete the Create AQL
Statement dialog with the following information:
View name
RevenueAndDivision
Module name
RevenuebyDivision_BasicFeatures
Script name
RevenueCandidate.aql

Note: You will be using a new script to contain your candidate


views, but you can continue to use the same module for all scripts.
Type Pattern
Output view
Specify the Output view check box.
b. Click OK.
c. Copy or type the following code to replace the template:

Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 111
create view RevenueAndDivision as
extract pattern
<R.match> <Token>{1,2}
(<D.match>) <Token>{1,20} (<M.match>)
return group 0 as match
and group 1 as Division and group 2 as Amount
from Revenue R, Division D, Money M;

output view RevenueAndDivision;


d. Save the file and run the extractor in the usual way.
e. Create the view for the second pattern. Right-click the RevenueByDivision
label and click New AQL Statement. Select Candidate Generation AQL
statement. Complete the Create AQL Statement dialog with the following
information:
View name
DivisionAndRevenue
Module name
RevenuebyDivision_BasicFeatures
Script name
RevenueCandidate.aql
Type Pattern
Output view
Specify the Output view check box.
f. Click OK.
g. Copy or type the following code to replace the template:
create view DivisionAndRevenue as extract pattern
(<D.match>) <Token>{1,3}
<R.match> <Token>{1,30} (<M.match>)
return group 0 as match and group 1 as Division
and group 2 as Amount
from Revenue R, Division D, Money M;

output view DivisionAndRevenue;


When you find all three clues (revenue, division, and money) in close
proximity, there is a good chance that you have found your goal, which is
revenue by division. The Token keyword and the minimum and maximum
arguments limit the gaps between the revenue feature, the division feature,
and the money feature. The specific words in the gaps are not important
for the purposes of this lesson. But you do need to limit the number of
tokens in the gap to make sure that the revenue feature, the division
feature, and the money feature are in close proximity.
h. Save the file, and run the extractor.
11. Now you are ready to see the results of the combination of the two patterns:
a. Right-click the RevenueByDivision label and click New AQL Statement.
Select Candidate Generation AQL statement. Complete the Create AQL
Statement dialog with the following information:
View name
AllRevenueByDivision
Module name
RevenuebyDivision_BasicFeatures

112 IBM InfoSphere BigInsights Version 3.0: Tutorials


Script name
RevenueCandidate.aql
Type Union all
Output view
Specify the Output view check box.
b. Click OK.
Copy or type the following code to replace the template:
create view AllRevenueByDivision as
(select DR.* from DivisionAndRevenue DR)
union all (select RD.* from RevenueAndDivision RD);

output view AllRevenueByDivision;


12. Click File > Save.
13. Run the extractor. In the result, you see mentions of division names and their
revenues. The next step in the development of this extractor is to finalize the
output, such as removing duplicates and unnecessary numbers.

Learn more about the value of the AQL templates: The AQL templates
reduce the need to look up syntax, retype the same expressions multiple
times, and debug spelling mistakes.
14. As you finalize the extractor, you no longer need the intermediate views. If
users of your AQL module must materialize, or use output view statements in
any of your externalized views, they can do so in their own code. You can
comment out the intermediate views so that the optimizer knows that they do
not need to be materialized. From the Project Explorer, edit the
RevenueBasic.aql and the RevenueCandidate.aql files and comment out the
output view statements:
a. From the Project Explorer, find the RevenueBasic.aql file and open it.
b. Add two dashes before the words output view for each of those
statements. This comments the entire line and it is not compiled.
c. Click File > Save.
d. From the Project Explorer, find the RevenueCandidate.aql file and repeat
the process of adding comments in front of the output view statements.
e. Click File > Save.

Learn more about extending your extractor with pre-built extractors: To


see examples of extending your extractor with the pre-built extractors, see
Pre-built extractor libraries. You can use the pre-built extractor libraries to
enhance your custom extractors. For example, to use a view that is
exported by the pre-built extractors, do the following steps:
1) Inside your InfoSphere BigInsights Eclipse environment, on a file
system that is connected to the InfoSphere BigInsights cluster,
right-click the TA_Training project, and select Properties.
2) In the Properties for TA_Training window, click BigInsights > Text
Analytics.
3) On the General tab, click Browse File System to specify your extractor
libraries.
4) Select the path to $TEXTANALYTICS_HOME/data/tam/ to select
BigInsightsWesternNERMultilingual.jar. To find the
$TEXTANALYTICS_HOME path, from your local file system that is
associated with your InfoSphere BigInsights cluster, type echo
$TEXTANALYTICS_HOME.

Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 113
5) Click OK.
6) After you specify the pre-built extractor libraries, you can extend your
Revenue extractor by using one of the Named entity views, such as
Organization. Include the following statement at the top of the
RevenueCandidate.aql script, immediately after the module declaration:
import view Organization
from module BigInsightsExtractorsExport
as Organization;
The Organization extractor identifies mentions of organization names.
After importing the view, then add this view in your
RevenueCandidate.aql script:
create view myOrg
as
select
GetText (R.organization) as TheOrg
from Organization R;
output view myOrg;

The result shows you all of the organizations that are mentioned in the
input text.

Lesson 5: Writing and testing final AQL


In this lesson, you further refine the AQL to deliver the required results.
1. The next step in AQL development is to consolidate and filter the candidates
that you just extracted. The first part of consolidating the output is to get rid of
duplicate information. From the Extraction Plan, right-click the
RevenueByDivision label. Click New AQL Statement > Filter and Consolidate
AQL Statement.
2. Complete the Create AQL Statement dialog to create a view:
a. In the View Name field, type RevenuePerDivision.
b. In the AQL Module field, make sure that the name is
RevenueByDivison_BasicFeatures.
c. In the AQL script field, type RevenueFilter as the new AQL file name.
d. In the Type field, specify Consolidate.
e. Enable the Output view check box.
f. Click OK. The AQL file opens in the editor pane.
3. Type or copy the following code to replace the template:
create view RevenuePerDivision as
select R.* from AllRevenueByDivision R
consolidate on R.match using ’NotContainedWithin’;

output view RevenuePerDivision;

In this code, you are consolidating the output from the view
AllRevenueByDivision to remove duplicate entries.

Learn more about consolidation: You use consolidation strategies to refine


candidate results by removing invalid annotations, and resolve overlap between
annotations. The consolidate on clause specifies how overlapping spans are
resolved across tuples that are output by a select or extract statement.
4. Click File > Save.

114 IBM InfoSphere BigInsights Version 3.0: Tutorials


5. From the Extraction Plan, right-click the parent label, and click Run the
extraction plan on the entire document collection.
The output from this view shows some results for an entire year, which is
duplicating the quarterly results. Now you need to need to filter by using a
select statement with a predicate.
6. Create a view that filters out amounts that are not relevant for the quarterly
numbers.
a. From the Extraction Plan, right-click the RevenueByDivision label. Click
New AQL Statement > Filter and Consolidate AQL Statement.
b. In the View Name field, type RevenueByDivision.
c. In the AQL Module field, make sure that the name is
RevenueByDivision_BasicFeatures.
d. In the AQL script field, select RevenueFilter as file name.
e. In the Type field, specify Predicate-based filter.
f. Enable the Output view check box.
g. Click OK. The AQL file opens in the editor pane.
Type or copy the following code to replace the template:
create view RevenueByDivision as
select R.* from RevenuePerDivision R
where Not(ContainsRegex(/Full-Year \d{4} Results/,
LeftContextTok(R.Amount,1200)));
output view RevenueByDivision;
7. Save the file and run the extractor as usual.

The output should show 25 rows. This view contains exactly the information that
you need for further analysis. When you apply text analytics to more complex
documents, and when you are extracting more sophisticated information, you
would expect to spend time improving the precision and recall of your extractor.
You can also profile your extractor to understand and improve its performance
characteristics. There are utilities in the Text Analytics Workflow perspective to
help with both of these tasks.

Learn more about some of the Text Analytics utilites: In the InfoSphere
BigInsights Eclipse Text Analytics Workflow perspective, you can find help with
several of the Text Analytics utilities. The following is a list of some of the utilities
that you might want to explore in the Help contents:
Annotation Difference Viewer
Displays a side-by-side comparison of the extracted results from the same
input file. You can use the Annotation Difference Viewer to understand
how modifying the AQL statements in an extractor affects the results. Also,
you can use the Annotation Difference Viewer to understand how the
extracted results compare with a labeled data collection.
Provenance View
Displays the results from viewing the lineage of analysis results and is
useful for understanding the results of an extractor. It explains in detail the
provenance, or lineage of an output tuple, that is, how that output tuple is
generated by the extractor. You access the Provenance View through the
Result Table View.
Profiler View
Helps you to troubleshoot performance problems in the AQL code. The

Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 115
Profiler also calculates the throughput of the extractor (in KB/seconds) by
dividing the size of the data that was processed by the total duration of
the Profiler execution.
Pattern Discovery View
Displays results from discovering patterns in text input. Pattern discovery
identifies contextual clues from documents in the data collection that help
you refine the accuracy and coverage of an extractor.
Explain Module View
Displays the metadata of the module and the compiled form of the
extractor.

Lesson 6: Finalizing and exporting the extractor


In this lesson, you finalize and export your extractor.

You now want to export the AQL module to make it available.


1. Before you export, in the RevenueFilter.aql file, comment out the output view
statement in the RevenuePerDivision view. Keep the output view statement in
the RevenueByDivision view. You now have one view that you want to
materialize. Save the file.
2. Export the module.
a. In Project Explorer, right-click the project TA_Training, and click Export.
b. In the Export dialog, expand BigInsights.
c. Click Export Text Analytics Extractor and click Next.
d. In the Modules to be exported list, select the module
RevenueByDivision_BasicFeatures.
e. Select the Export dependent modules check box.
f. Click Browse File System and select a destination for the export.
g. Specify Export to a jar or zip archive under the destination directory. Type
RevenueByDivision.zip in the File Name field.
h. Click Finish.
3. If the export process was successful, you see a confirmation message. Click OK
to close the message.

You can now use this extractor in new modules that you create in the Eclipse
environment. You can also publish and then deploy the extractor to the InfoSphere
BigInsights Console, as you will learn in the next lesson.

Lesson 7: Publishing the AQL module


In this lesson, you will learn to publish your extractor to the InfoSphere
BigInsights Console so that it can be deployed and run by anyone with access to
that server.

Complete “Lesson 6: Finalizing and exporting the extractor,” and have access to an
IBM InfoSphere BigInsights server.
1. From the Text Analytics Workflow perspective, open the Project Explorer.
2. Right-click the TA_Training project.
3. Click BigInsights Application Publish.
4. In the BigInsights Application Publish wizard, complete the workflow
information:

116 IBM InfoSphere BigInsights Version 3.0: Tutorials


a. In the Location page, specify an IBM InfoSphere BigInsights server as the
destination of the extractor. If you did not register a server, you can click
Create to create a connection. Click Next.
b. In the Application page, specify Create New Application.
c. In the Name field, type a unique application name, such as TA_TrainingV1.
d. Optional: In the Description field, enter text that you want the users of
your application to see. This text might be instructions or hints on how to
start.
e. Optional: Select an icon that you want to associate with your application. A
default application icon is used if you select nothing.
f. In the Categories field, type a keyword by which you can identify this
application. For the purposes of this tutorial, type extractors.
g. Click Next.
h. On the Type page, specify Text Analytics as the application type. Click
Next.
i. In the Text Analytics page, select the AQL module to publish, and the output
views to use. For the purposes of this tutorial, select
RevenueByDivision_BasicFeatures in the Module field. Select
RevenueByDivision_BasicFeatures.RevenueByDivision in the Output
Views field.
j. In the BigSheets page, specify the plug-in definitions. Accept the defaults,
and click Next.
k. In the Parameters page, locate any valid parameters that will be used in the
InfoSphere BigInsights Console, and add values if they are needed as
defaults. For the purposes of this tutorial, accept the defaults. Click Next.
l. In the Publish page, click Add to select the TA_Training project.
m. Click Finish.

The application is placed in the IBM InfoSphere BigInsights server. Open the
InfoSphere BigInsights Console and open the Applications tab. Click Manage to
find your application.

Summary of creating your first Text Analytics application


In this module, you analyzed text about the IBM quarterly reports from various
input files.

These steps summarize what you did to complete your extractor.


1. You first identified the collection of documents from which you wanted to
extract information.
2. You analyzed the documents to identify examples of the information from
which you wanted to extract.
3. You wrote AQL statements to extract the identified information.
4. You tested and refined the AQL statements.
5. Then, you exported the final extractor and published the extractor to the
InfoSphere BigInsights Console.

Lessons learned

You now have a good understanding of the following tasks:


v How to create a Text Analytics project in Eclipse.

Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 117
v How to use the Text Analytics development process and the supporting tools.
v How to analyze text documents to populate an extraction plan by identifying
interesting text and clues.
v How to create and test AQL scripts to extract candidates.
v How to create AQL statements to filter the candidates to extract useful insights.

Additional resources

There are articles on IBM developerWorks that give you further information about
Text Analytics.
v Analyzing social media and structured data with InfoSphere BigInsights
v Analyze text from social media sites with InfoSphere BigInsights: Use
Eclipse-based tools to create, test, and publish text extractors

118 IBM InfoSphere BigInsights Version 3.0: Tutorials


Chapter 9. Tutorial: Identifying and analyzing errors in
machine data
Learn how to use IBM Accelerator for Machine Data Analytics to import, extract,
index, search, and analyze your machine data files.

IBM Accelerator for Machine Data Analytics is a component of IBM InfoSphere


BigInsights and consists of a set of prepackaged applications. You can use these
applications to import your machine data, prepare it for analysis, search it for
keywords, and then analyze it for patterns and significance.

In this tutorial, you download and run sample machine data from the servers that
host the IBM Watson website. This Watson data represents a controlled data set
and is composed of raw web access log files that contain HTTP requests to the
Watson site. Each log record contains information about the request, for example
the date and time of the request, the requested page, and the result of the request.

This tutorial guides you through a typical use case of downloading, importing, and
extracting raw sample data files and then indexing, searching, and analyzing those
files to understand patterns of errors and events. You can apply the same process
of importing, preparing, and analyzing to your own machine data.

Learning objectives

After completing the lessons in this module, you will understand the concepts and
processes associated with:
v Identifying and preparing your data
v Extracting meaningful information from your data
v Indexing and searching your data
v Understanding the impact of events (for example, time periods or outages) on
errors
v Viewing the results of your analysis in workbooks, charts, and dashboards.

Time required

This module takes approximately 75 minutes to complete.

Prerequisites

You must install IBM InfoSphere BigInsights and IBM Accelerator for Machine
Data Analytics.

Lesson 1: Downloading the sample data


In this lesson, you download and prepare the sample data that you will use during
this tutorial. Downloading the data is the first step in the process of preparing the
data for analysis. This sample data contains raw log records and configuration files
that will be analyzed in this tutorial by the IBM Accelerator for Machine Data
Analytics applications.

© Copyright IBM Corp. 2013, 2014 119


The sample data for this tutorial is available on the IBM developerWorks website.
This batch of Watson machine data represents log files of one type and contains a
metadata file that describes the key characteristics and assumptions that are
inherent in the files.

Learn more about the Data Download application: The Data Download
application is a sample application that ships with InfoSphere BigInsights. It
downloads sample data sets that are used in tutorials from IBM developerWorks.
For more information about this application, see the Data Download application.
1. Open the InfoSphere BigInsights Console.
2. Select the Applications tab.
3. Locate and select the Data Download application:
a. In the Search field, type Data Download.
b. Optional: If the Data Download application has not been run, it may not be
available in the list, and you need to deploy it before you can run it.
4. Complete the required application parameters:
a. In the Execution name field, enter watson as the name for this execution of
the application. An execution name saves the parameter values for this run of
the application so that you can run the application again with the same
parameters.
b. To accept the download terms and conditions, select the Agree to terms
check box.
c. From the Select data set drop-down menu, select Sample log data set.
d. In the Target directory field, enter /watson as the distributed file system
directory where you want to save the output file.
5. To run the application, click Run.
6. In the Application History panel in the lower half of the window, monitor the
progress of the application.

The sample data and metadata file are downloaded and uncompressed to the
/watson/input/batch_watson directory, and the sample configuration files are
downloaded to the /watson/config directory.

View the downloaded sample data. On the Files tab, navigate to the
/watson/input/batch_watson directory. If the Files tab is already open, you might

need to click the Refresh icon ( ). The batch_watson directory contains two
files:
log.txt
Contains the contents of the downloaded sample data files, by default, in
text format. For each HTTP request to the Watson site, the sample data
contains a line that shows information about that request, for example the
IP address of proxy server, the date and time of the request, the request
itself, the path of the requested page, the result of the request, and the
requesting client. To view the data in a grid-like format, click the Sheet
radio button:

Tip: To ensure that the data displays across the entire viewing pane, click
Fit Column(s).

120 IBM InfoSphere BigInsights Version 3.0: Tutorials


metadata.json
Contains the metadata for the sample machine data, which includes the
type of log file, batch ID, server name, and datetime format. To view the
file in a grid-like format, click the Sheet radio button:

Lesson 2: Extracting the sample data


In this lesson, you run the Extraction application to prepare the sample data files
that you downloaded to be indexed and searched. The extraction process prepares
the raw data files for indexing by splitting the sample data into records, extracting
fields from those records, normalizing the time stamps, and examining and
enriching the metadata.

The /watson/config/extract_config/extract.config configuration file defines the


fields that the application extracts from the sample data. This configuration file is
specific to the sample data.

Learn more about the Extraction application: The Extraction application is an


application that ships with IBM Accelerator for Machine Data Analytics, and it:
v Analyzes the metadata of machine data
v Splits the data into records
v Extracts data fields
v Normalizes the time stamps of the data files in a batch
v Enriches the records with metadata
v Saves the machine data files to a user-specified directory on the distributed file
system.
For more information about this application, see Running the Extraction
application.
1. In the InfoSphere BigInsights Console, select the Applications tab.
2. To locate and select the Extraction application, type Extraction in the Search
field. If the Extraction application is not available, deploy the application.
3. Complete the required application parameters:
a. In the Execution name field, enter watson.
b. In the Source directory field, enter /watson/input as the top-level directory
that contains the subdirectory that holds your batch data.
c. In the Output path field, enter /watson/output/extract_out as the directory
that will contain the copies of the source directory and subdirectories.

Chapter 9. Tutorial: Identifying and analyzing errors in machine data 121


d. In the Extract.config file field, point to the /watson/config/extract_config/
extract.config file.
4. To run the application, click Run.

View the results of the extraction:


1. On the Files tab, navigate to and select the /watson/output_sample/
extract_out/batch_watson.csv file.
The batch_watson.csv file contains the results of the extraction, by default, in
text format. Each row of the extracted results is a record, and each field in the
record is a column when you view it with the CSV reader.
2. To organize the information into rows, with one record per line, click the Sheet
radio button:

3. View your output as comma-separated value (CSV) data:

a. To edit the collection reader, click the Pencil icon ( ).


b. From the Select a reader drop-down menu, select Comma Separated Value
(CSV) Data.
c. Select the Headers included? check box, then click the green check mark (

).
This view organizes the information into a grid-like format with rows for each
record and columns for each aspect of the file. Notice the normalized DateTime
format, URI paths, and CodesAndValues data:

Lesson checkpoint
The application read the files in the input directory, extracted fields from your
sample data according to the specifications in the extract.config file, and
generated output files to the specified output directory. Then, you viewed the

122 IBM InfoSphere BigInsights Version 3.0: Tutorials


extracted data in text, sheet, and CSV formats in the InfoSphere BigInsights
Console. Next, you index the extracted fields, making them searchable.

Lesson 3: Indexing the sample data


In this lesson, you use the Index application to create facets, which are entries for
each field in the machine data records. These facets are created from the fields that
were extracted when you ran the Extraction application, and they enable the
machine data to be keyword-searchable.
1. On the Files tab of the InfoSphere BigInsights Console, navigate to the
/user/applications/MDA/sample_logmonitoring_connections.properties file.
2. Save a copy of the sample_logmonitoring_connections.properties file, and
update the copy with the InfoSphere BigInsights administrative user name and
password for the node where the InfoSphere BigInsights Console is installed:

a. Click the Copy icon ( ), navigate to the location where you want to
save the copy, then click OK.
b. Navigate to the copy of the sample_logmonitoring_connections.properties
file, and click Edit.
c. Update the file with the password, user name, and host name of the node
where the InfoSphere BigInsights Console is installed, for example:
#BigInsights Credential Store file
#Contains the Console Node Host ID,
#the login Username/Password for the console node
password=mypassword
username=biadmin
host=ConsoleNodeHostID
d. Click Save.

The /watson/config/index_config/index.config configuration file is specific to the


sample data and defines how the application indexes each field in the machine
data.

Learn more about the Index application: The Index application is an application
that ships with IBM Accelerator for Machine Data Analytics. It creates facets for
each field in a batch of extracted machine data and uses the facets to create
indexes of the batches. For more information about this application, see Running
the Index application.
1. In the InfoSphere BigInsights Console, select the Applications tab.
2. To locate and select the Index application, type Index in the Search field. If the
Index application is not available, deploy the application.
3. Complete the required application parameters:
a. In the Execution name field, enter watson.
b. In the Operations drop-down menu, select All.
c. In the Source Directory field, enter /watson/output/extract_out as the
top-level directory that contains the subdirectory that holds your batch data.
d. In the Output Directory field, enter /watson/output/index_out as the
directory that will contain the output of the Index application.
e. In the Index.config File field, point to the /watson/config/index_config/
index.config file.
f. In the Credentials File Path field, specify the path for the modified
sample_logmonitoring_connections.properties file.

Chapter 9. Tutorial: Identifying and analyzing errors in machine data 123


g. In the Log Workflow field, select Other Logs. This option indicates that the
log files are not InfoSphere BigInsights log files. In this case, they are log
files from the servers that host the IBM Watson website.
h. Deselect the Recreate Indexes check box, and select the Index Only New
Logs check box.
4. To run the application, click Run.

View your indexed results. On the Welcome tab of the InfoSphere BigInsights
Console, click Search machine data under the Quick Links section. The Faceted
Search user interface opens in a new window and displays 31,469 results:

Learn more about Faceted Search: The Faceted Search UI enhances search results
by:
v Identifying the file name, type, batch ID, URL, timestamp, and other pertinent
information about the files that contain your search term
v Filtering entries by text, time range, and date range across servers and time
zones
v Displaying the results in graph form, categorized by the months that you specify
in the filter
v Aiding in analysis and troubleshooting.

Lesson 4: Identifying frequent sequences of errors


In this lesson, you run the Time Window Transformation - Frequent Sequence
Analysis chained application to organize your indexed and prepared data into
sessions and then identify events that occur just before errors in the data. These
errors and their associated events identify potential causes and patterns of errors in
the data. This process of identifying and understanding errors can help you better
understand your own machine data.

Sessions are sets of data that are grouped by the following characteristics:
v A partition key, which identifies and divides a data file
v A time window, or time gap, that exceeds user-specified thresholds.

A chained application is two or more linked, or chained, applications, such that the
output of one application is the input to the next.

124 IBM InfoSphere BigInsights Version 3.0: Tutorials


The Time Window Transformation application groups your machine data records
into sessions. Then, the Frequent Sequence Analysis application analyzes those
sessions for frequently occurring sequences of events to illustrate potential causes
and patterns of errors.

Learn more about the Time Window Transformation - Frequent Sequence


Analysis chained application: The Time Window Transformation - Frequent
Sequence Analysis chained application is an application that ships with IBM
Accelerator for Machine Data Analytics. The chained application first groups your
machine data records into sessions. In the sample machine data, the partition key
is UserAgent, the time gap threshold is set to 20 seconds, and the UserAgent refers
to the browser that sends the HTTP request. Therefore, when the Time Window
Transformation application is run, the records are first divided by UserAgent, and
when the application detects a gap of 20 seconds or more in log activity, the
records are further divided by combining log entries until the time gap threshold is
reached.

Next, the application analyzes the transformed sessions and identifies frequently
occurring sequences of events to illustrate potential causes and patterns of errors.

For more information about this application, see Running the Time Window
Transformation - Frequent Sequence Analysis chained application.
1. In the InfoSphere BigInsights Console, select the Applications tab.
2. To locate and select the Time Window Transformation - Frequent Sequence
Analysis chained application, type Time Window Transformation - Frequent
Sequence Analysis in the Search field. If the Time Window Transformation -
Frequent Sequence Analysis application is not available, deploy the application.
3. Complete the required application parameters:
a. In the Execution name field, enter watson as the name for this run of the
application.
b. In the Applications drop-down menu, select All.
c. In the Time Window input path field, enter /watson/output/extract_out as
the directory of the output of the Extraction application.
d. In the Time Window output path field, enter /watson/output/timegap_out
as the directory where you want to save the output of the Time Window
Transformation.
e. In the Time Window configuration file field, point to the
/watson/config/timeGapTransform_config/timeGapTransform.config file.
This configuration file is specific to the sample data and controls how the
application analyzes the time windows.
f. In the Frequent Sequence output path field, enter /watson/output/
output_freqseq_out as the directory where you want to save the output file
for the Time Window Transformation - Frequent Sequence Analysis chained
application.
g. In the Frequent Sequence configuration file field, point to the
/watson/config/frequentSequence_config/frequentSequenceMiner.config
file. This configuration file is specific to the sample data and controls how
the application analyzes the frequently occurring sequences of events.
4. Add a new row to the Frequent Sequences workbook to contain the output of
this application:

Learn more about updating workbooks: A workbook is a collection of data


from a results file, which is represented in a grid-like format with rows and

Chapter 9. Tutorial: Identifying and analyzing errors in machine data 125


columns. This step adds a parameter to this execution name so the application
updates the FrequentSequences workbook with the output path of the Frequent
Sequence Analysis application. For more information about InfoSphere
BigInsights workbook types, see Master workbooks, workbooks, and sheets.
a. Under Schedule and Advanced Settings, select the Update Workbook check
box.
b. From the Select Workbook drop-down menu, select FrequentSequences.
c. From the Select Output drop-down menu, select Frequent Sequences
output path, and click Add row.
5. Scroll to the top of the Applications tab, and click Run. Because this
application is processing parameters from two separate applications and
because of the complexity of the transformation and analysis, the application
might take several minutes to run.

Lesson 5: Viewing frequent sequence results


In this lesson, you view the results of the Time Window Transformation - Frequent
Sequence Analysis chained application in a workbook, chart, and dashboard in the
InfoSphere BigInsights Console. Each visualization option offers different
advantages, depending on how you want to view and filter your analysis results.
These visualizations create consumable output for machine data that can be
difficult to read without proper formatting.

On the BigSheets tab of the InfoSphere BigInsights Console, you can view data in
a workbook and chart to visually represent your data. On the Dashboard tab, you
can create a dashboard to see multiple graphical charts simultaneously and
compare results across various views.

Learn more about the visualization options: Workbooks display data from a
results file in sheet-like representations with rows for each record and columns for
different aspects of the data. Charts display data either in a grid with X and Y axes
or in a pie chart. Dashboards can be customized to show multiple charts if, for
example, you want to compare results from multiple runs over time or different
application runs after you update the application configuration files.
1. On the BigSheets tab of the InfoSphere BigInsights Console, select the
FrequentSequences workbook. This workbook shows the results of the Time
Window Transformation - Frequent Sequence Analysis chained application,
which are frequently occurring sequences of events, in a grid-like format:

Tip: You may have to click Fit column(s) and scroll to the right to see all the
workbook columns.

126 IBM InfoSphere BigInsights Version 3.0: Tutorials


2. Click the FrequentSequences tab at the bottom of the workbook. You see a bar
chart that shows frequently occurring sequences of events. Viewing your results
as a chart allows you to see a graphic representation of your data. Hover over
the bars in the chart to see details about the particular data that the bar
represents:

3. On the Dashboard tab, select Machine Data Accelerator from the Select
dashboard drop-down menu. Right now, you see only one widget in the
dashboard because you have run only one application. In the next lesson, you
run another application and then view the results of both applications. At that
point, the dashboard will have two widgets:

Lesson 6: Identifying event significance on errors


In this lesson, you run the Join Transformation - Significance Analysis chained
application to identify correlations between uniform resource identifiers (URIs), which
are characters that identify a record (in the sample machine data, the URIs are seed
events), and CodesAndValues, which are fields that contains an HTTP response code
and associated value. This analysis helps you understand the significance of events
on errors in the sample data and is a process that you can apply to analysis of
your own machine data.

The Join Transformation application joins your machine data records into sessions,
and the Significance Analysis application identifies correlations within those
sessions.

Learn more about the Join Transformation - Significance Analysis chained


application: The Join Transformation - Significance Analysis chained application is
an application that ships with IBM Accelerator for Machine Data Analytics. The
chained application first joins the sessions in your machine data records, analyzes

Chapter 9. Tutorial: Identifying and analyzing errors in machine data 127


the URIs, joins URI and CodesAndValues data based on user-specified join
conditions (like time window), and associates errors with the URIs that precede
them.

Next, the application reads the output of the Join Transform application and finds
correlations between the URIs and CodesAndValues. These correlations indicate
the significance of URI events and errors.

For more information about this application, see Running the Join Transformation -
Significance Analysis chained application.
1. In the InfoSphere BigInsights Console, select the Applications tab.
2. To locate and select the Join Transformation - Significance Analysis chained
application, type Time Window Transformation - Frequent Sequence Analysis
in the Search field. If the Join Transformation - Significance Analysis
application is not available, deploy the application.
3. Complete the required application parameters:
a. In the Execution name field, enter watson.
b. In the Applications drop-down men, select All.
c. In the Event log and Context log fields, enter /watson/output/extract_out
as the output directory of the Extraction application.
d. In the Join output path field, enter /watson/output/join_out as the
directory where you want to save the output file for the Join Transformation
application.
e. In the Join configuration file field, point to the /watson/config/
joinTransform_config/joinTransform.config file. This configuration file is
specific to the sample data and controls how the application joins the
sessions in the machine data.
f. In the Significance output path field, enter /watson/output/
significance_out as the directory where you want to save the output for
the Significance Analysis application.
g. In the Significance configuration file field, point to the
/watson/config/significanceAnalysis_config/
significanceAnalysis.config file. This configuration file is specific to the
sample data and controls how the application analyzes correlations between
URIs and CodesAndValues.
4. Add a new row to the Significance Analysis workbook to contain the output of
this application:
a. Under Schedule and Advanced Settings, select the Update Workbook check
box.
b. From the Select Workbook drop-down menu, select SignificanceAnalysis.
c. From the Select Output drop-down menu, select Significance Output Path,
and click Add row.
5. Scroll to the top of the Applications tab, and click Run. Because this
application is processing parameters from two separate applications and
because of the complexity of the transformation and analysis, it might take
several minutes to run.

Lesson 7: Viewing significance analysis results


In this lesson, you view the results of the Join Transformation - Significance
Analysis chained application in a workbook, chart, and dashboard in the
InfoSphere BigInsights Console.

128 IBM InfoSphere BigInsights Version 3.0: Tutorials


1. On the BigSheets tab of the InfoSphere BigInsights Console, select the
SignificanceAnalysis workbook. This workbook shows the results of the Join
Transformation - Significance Analysis chained application, which are
correlations between URIs and CodesAndValues, in a grid-like format:

2. Create a bar chart for your data:


a. Click Add chart > chart > Bar.
b. In the Chart Name field, enter WatsonSignificanceAnalysisBar.
c. In the Title field, enter Watson Significance Analysis.
d. In the X Axis field, enter Watson Significance Analysis.
e. In the X Axis drop-down menu, select discriminative_power.
f. In the X Axis Label field, select Discriminative Power.
g. In the Y Axis Series drop-down menu, select Count occurrences of X axis
values.
h. In the Y Axis Label, enter Count.
i. In the Sort By drop-down menu, select Y Axis.
j. In the Occurrence Order drop-down menu, select Descending.
k. In the Limit field, enter 20.
l. In the Template drop-down menu, select One UI.
m. In the Style drop-down menu, select Regular Bar.

n. To save the chart settings, click the green checkmark ( ), then click
Run. You see a bar chart that calculates the total number of each Watson
feature (column 2 in the SignificanceAnalysis workbook) in the sample
data:

Chapter 9. Tutorial: Identifying and analyzing errors in machine data 129


3. View the results of the Join Transformation - Significance Analysis and Time
Window Transformation - Frequent Sequence Analysis chained applications in
the Machine Data Accelerator dashboard:
a. On the Dashboard tab, select Machine Data Accelerator from the Select
dashboard drop-down menu. You see the Frequent Sequences chart
displayed as one widget.
b. In the upper right corner of the Dashboard pane, click Add Widget.
c. Locate Watson Significance Analysis, click Add Widget below the name,
then close the Add Widget window by clicking the red X in the upper
right corner.
d. Either drag and drop the widgets so you can view them both, or click
Arrange to fit the widgets in the window. You now see the two bar charts
that you created during this tutorial:

e. To save the Machine Data Accelerator dashboard, click the Save icon (

).

Summary: Analyzing machine data errors


In this tutorial, you identified and analyzed errors in the sample machine data, and
then you viewed the frequent sequences and significance of those errors.

These steps summarize your process of analyzing errors in machine data.


1. You downloaded the batch of sample machine data.
2. You extracted fields from the sample data.
3. You created a facets for each field in the sample data.
4. You viewed the results of the index in the Faceted Search UI.
5. You identified and viewed frequently occurring sequences of errors in the
sample data.
6. You identified and viewed the significance of events on errors in the sample
data.
7. You identified and viewed the results of your analysis in workbook, chart, and
dashboard formats.

Lessons learned

You now have a good understanding of the following tasks:


v How to prepare and analyze your machine data.

130 IBM InfoSphere BigInsights Version 3.0: Tutorials


v How to specify application parameters to customize your analysis.
v How to access the Faceted Search UI.
v How to view your analysis results in a workbook, chart, and dashboard.
v How to add charts to an existing dashboard.

Chapter 9. Tutorial: Identifying and analyzing errors in machine data 131


132 IBM InfoSphere BigInsights Version 3.0: Tutorials
Chapter 10. Tutorial: Identifying user feedback in social data
Learn how to use IBM Accelerator for Social Data Analytics to import and analyze
your social data files. IBM Accelerator for Social Data Analytics includes a rich set
of prepackaged, text analytics extractors for IBM InfoSphere BigInsights and IBM
InfoSphere Streams that enable enterprises to accurately assess how consumer
perceive their products and offerings, as well as their competitors and their
offerings.

In this tutorial, you download and analyze sample social data. The sample data
contains tweets about Sample Outdoor Company, a fictional company that sells
outdoor recreation equipment, sporting goods, and clothing, in Twitter format.
Your goal is to identify consumer perceptions of Sample Outdoor Company
products.

To manage brands, which is the result of understanding consumer sentiment, the


analytics applications include:
v Analytics to identify a snapshot of a user profile, for example demographic or
behavioral, based on a particular chunk of information over a particular interval
of time, called local analysis
v Analytics that combine these snapshots to produce a global profile of that user,
called global analysis

Learning objectives

After completing the lessons in this module, you will understand the concepts and
processes associated with:
v Identifying and preparing your data
v Identifying meaningful information in your data
v Viewing your analysis in a workbook, chart, and dashboard.

Time required

This module should take approximately 60 minutes to complete.

Prerequisites
1. Install IBM InfoSphere BigInsights.
2. Install IBM InfoSphere Streams.
3. Install IBM Accelerator for Social Data Analytics.

Lesson 1: Downloading the sample data


In this lesson, you download and prepare sample data that you will use during
this tutorial.

The sample data for this tutorial is available on the IBM developerWorks website.
This batch of sample social data represents a set of text files that contains tweets
about a fictional company, Sample Outdoor Company.

© Copyright IBM Corp. 2013, 2014 133


Learn more about Gnip: Gnip is a company that collects, normalizes, and then
delivers web data, including social data, in a specific, detailed format. Each record,
or line, in a data file includes the tweet text and other information about that tweet
and the person who posted it, such as the date and time of the tweet and, in some
cases, the location where the tweet was posted. You can view all of this
information as a worksheet by using BigSheets in InfoSphere BigInsights.

Learn more about the Data Download application: The Data Download
application is a sample application that ships with InfoSphere BigInsights. It
downloads sample data sets that are used in tutorials from IBM developerWorks.
For more information about this application, see the Data Download application.
For more information about this application, see the Data Download application.
1. Open the InfoSphere BigInsights Console. Open the InfoSphere BigInsights
Console.
2. Select the Applications tab.
3. Locate and select the Data Download application:
a. In the Search field, type Data Download.
b. Optional: If the Data Download application has not been run, it may not be
available in the list, and you need to deploy it before you can run it. If the
Data Download application has not been run, it may not be available in the
list, and you need to deploy it.
4. Complete the required application parameters:
a. In the Execution name field, enter SocialSampleDataDownload as the name
for this execution of the application. An execution name saves the parameter
values for this run of the application so that you can run the application
again with the same parameters.
b. Select the Agree to terms check box to accept the download terms and
conditions.
c. From the Select Data Set drop-down list, select Sample social data set.
d. In the Target directory field, enter /BigOut/data/raw_tweets/ as the
distributed file system directory where you want to save the output file.
5. Click Run to run the application.
6. In the Application History panel in the lower half of the window, monitor the
progress of the application.

Result

The sample data files are downloaded and uncompressed to the


/BigOut/data/raw_tweets/input/batch_tweets directory, and the sample
configuration files are downloaded to the /BigOut/data/raw_tweets/config
directory.

View the downloaded sample data:


1. Click the Files tab, and then navigate to the /BigOut/data/raw_tweets/input
directory. If the Files tab is already open, you may need to click the Refresh

icon ( ) to view the downloaded sample data files.


The batch_tweets directory contains three files:
All_Products_tweets1
Contains the contents of the downloaded All_Products_tweets1 sample
data file, by default, in text format. The file contains Twitter
information about Sample Outdoor Company products, for example the

134 IBM InfoSphere BigInsights Version 3.0: Tutorials


date and time of the message, any hashtags or URLs that the message
contains, and whether the message has been added to a favorites list.
Click the Sheet radio button to view the data in a grid-like format:

Tip: You may have to click Fit Column(s) so the data displays across
the entire viewing pane.

CourseProGolfBag_Neg_tweets3.dat
Contains the contents of the downloaded
CourseProGolfBag_Neg_tweets3.dat sample data file, by default, in text
format. The file contains negative feedback from Twitter about the
Sample Outdoor Company Course Pro Golf Bag. This file contains the
same details about the Twitter message that the All_Products_tweets1
file does, but it is focused on negative buzz, sentiment, and feedback
about a particular product, the Course Pro Golf Bag. Click the Sheet
radio button to view the data in a grid-like format:

CourseProGolfBag_Pos_tweets3.dat
Contains the contents of the downloaded
CourseProGolfBag_Pos_tweets3.dat sample data file, by default, in text
format. The file contains positive feedback from Twitter about the
Sample Outdoor Company Course Pro Golf Bag. This file contains the
same details about the Twitter message that the All_Products_tweets1
and CourseProGolfBag_Neg_tweets3.dat files do, but it is focused on
positive buzz, sentiment, and feedback about a particular product, the
Course Pro Golf Bag. Click the Sheet radio button to view the data in a
grid-like format:

Chapter 10. Tutorial: Identifying user feedback in social data 135


Lesson 2: Configuring the sample data
In this lesson, you edit the sample data files to prepare them to be analyzed and to
optimize the results of your analysis.

Due to the inherently short length of tweets, Twitter users often abbreviate
common terms, phrases, and proper names. Tweets also often contain words that
might be extraneous to the analytical context. To ensure that the IBM Accelerator
for Social Data Analytics applications properly analyze the tweets, it is important
that you define and update the search objects file and the aliases file.
1. On the Files tab of the InfoSphere BigInsights Console, navigate to and select
the /BigOut/data/raw_tweets/config/searchobjects_BigOutdoors.csv file. If
the Files tab is already open in another window, you may need to click the

Refresh icon ( ) to view the downloaded sample data files.

Learn more about the search objects file: The searchobjects_BigOutdoors.csv


file lists search objects and their features for which the sentiment, intent, and
buzz are extracted from tweets, boards, and blogs and then analyzed. A search
object is a product or company name that you want to enter into a search, and a
search object feature is brand, family, subfamily, or product that is associated with
your search object. A search objects file lists the features of an object for which
sentiment, intent, and buzz are extracted and then analyzed. The first line of
the file contains the following schema, or pattern, that the rest of the file
follows:
SearchObject | Category | Format | Brand | Family | SubFamily | Product

The rest of the file contains information about the search object and its features,
one per line, following the schema.
2. Click Edit.
3. Add the following text at the end of the file:
TRAILCHEF TENT|CAMPING EQUIPMENT||TRAILCHEF|CAMPING GEAR||TENT

This line adds the tent search object to the file. In the next lesson, the Brand
Management Retail Configure – Local – Global Analysis chained application
searches for and returns matches to these search objects.
4. Click Save.
5. On the Files tab, navigate to and select the /BigOut/data/raw_tweets/config/
aliases_BigOutdoors.csv file.

Learn more about the aliases file: The alias_bigOutdoors.csv file lists aliases,
separated by a caret () and, at the end of the line, the actual search object or
search object feature, separated by the pipe character (|), for example. An
aliases file maps abbreviations and alternative references to the brand (company

136 IBM InfoSphere BigInsights Version 3.0: Tutorials


or product) to the full brand name. Each line of the file is a set of aliases
separated by a caret () and, at the end of the line, the pipe character (|)
identifies the actual search object or search object feature (brand, family,
sub-family, or product), for example: :
Golf Set Tee Set Tee Sets Golf Sets|Golf and Tee Set
6. Note the last line of the file:
Great Outdoor^ GreatOutdoors|Great Outdoors

This line adds the Great Outdoors brand name to the alias file, enabling the
Brand Management Retail Configure – Local – Global Analysis chained
application to capture all references to the Great Outdoors brand.

Lesson checkpoint
By modifying and noting changes to the CSV files, you successfully configured
how your applications will analyze the sample data. Next, you run the Brand
Management Retail Configure – Local – Global Analysis chained application to
identify user sentiment in the data files.

Lesson 3: Analyzing the sample data


In this lesson, you analyze the social data to understand consumer sentiment,
buzz, intent, and ownership of all key products of the company.

Configure security for the applications.

A chained application is two or more linked, or chained, applications, such that the
output of one application is the input to the next.

The Configuration application enables you to set parameters and search terms for
feedback analysis, or analysis of expressions of sentiment, buzz, intent to purchase,
and other dimensions of your analysis.

The Local Analysis application performs narrow but deep data analysis, running
all the annotators and extractors that were compiled by the Configuration
application on one tweet, blog post, or board post at a time and generating
feedback and profile information.

The Global Analysis application aggregates the results of multiple runs of the Local
Analysis application to generate either a global profile of a user or a global profile
of a user associated with feedback of that user. The global profile is based on
boards, blogs, and tweets and includes both demographic and behavioral
information.

Learn more about the Configure - Local - Global chained application: The
Configure - Local - Global chained application links the Configuration, Local
Analysis, and Global Analysis applications, enabling you to configure, or
customize your analysis, and then run local and global analysis at the same time.
For more information about this chained application, see Running the Configure -
Local - Global chained application.
1. In the InfoSphere BigInsights Console, select the Applications tab.
2. To locate and select the Brand Management Retail Configure – Local – Global
Analysis chained application, type Brand Management Retail Configure –
Local – Global Analysis in the Search field. If the Brand Management Retail
Configure – Local – Global Analysis chained application has not been run, it

Chapter 10. Tutorial: Identifying user feedback in social data 137


may not be available in the list, and you need to deploy it before you can run
it. If the Brand Management Retail Configure – Local – Global Analysis chained
application has not been run, it may not be available in the list, and you need
to deploy it.
3. Complete the required application parameters:
a. In the Execution name field, enter BigOut1.
b. From the Applications drop-down menu, select All.
c. In the Scenario Name field, enter BigOut-BigPicture-Run1.
| d. In the Objects File field, enter /BigOut/data/raw_tweets/config/
| searchobjects_BigOutdoors.csv. This searchobjects_BigOutdoors.csv file is
| the same file that you modified in the previous step of the tutorial.
| e. In the Alias File field, enter /BigOut/data/raw_tweets/config/
| aliases_BigOutdoors.csv. This aliases_BigOutdoors.csv file is the same
| file that you viewed in the previous step of the tutorial.
f. Leave the Companies file, Positive filter file, and Negative filter file fields
blank.
g. Keep the en_US and Twitter default values for the Language and Source
Type fields, respectively.
| h. In the Input Path field, enter /BigOut/data/raw_tweets/input/
| batch_tweets.
i. In the Start Date, field, select October 1, 2009.
j. Accept the default values for the Start Time, End Date, and End Time
fields.
k. In the Local output path field, enter /BigOut/results/localanalysisoutput.
The output is in CSV format.
l. In the Global output path field, enter /BigOut/results/
globalanalysisoutput. The output is in CSV format.
m. Leave the Generate User Profiles check box unchecked, and ensure that
the Generate User Profiles with Feedback check box is selected.
4. Add a new row to the Brand Management Retail Global Analysis Feedback
workbook to contain the output of this application:

Learn more about updating workbooks: A workbook is a collection of data


from a results file, represented in a grid-like format with rows and columns.
This step adds a parameter to this execution name so the application updates
the BM_Retail_GA_Feedback workbook with the Global Analysis application
output path. For more information about InfoSphere BigInsights workbook
types, see Master workbooks, workbooks, and sheets. For more information
about InfoSphere BigInsights workbook types, see Master workbooks,
workbooks, and sheets.
a. Under Schedule and Advanced Settings, select the Update Workbook check
box.
b. From the Select Workbook drop-down menu, select
BM_Retail_GA_Feedback.
c. From the Select Output drop-down menu, select Global Output path, and
click Add row.
5. Scroll to the top of the window, and click Run. Because this application is a
chained application and is processing parameters from three separate
applications and because of the complexity of the analysis, it might take several
minutes to run.

138 IBM InfoSphere BigInsights Version 3.0: Tutorials


View the results of the rand Management Retail Configure – Local – Global
Analysis chained application:
1. On the Files tab, navigate to and select the /BigOut/results/
globalanalysisoutput/globalFeedback.csv file.
This file contains the results of the Global Analysis application, by default, in
text format. Each row of the extracted results is a Twitter message, and each
field in the record is a column when you view it with the CSV reader.
2. To organize the information into rows, with one record per line, click the Sheet
radio button:

3. View your output as comma-separated value (CSV) data:

a. Click the Pencil icon ( ) to edit the collection reader.


b. From the Select a reader drop-down menu, select Comma Separated Value
(CSV) Data.
c. Select the Headers included? check box, then click the green check mark (

).
This view organizes the information into a grid-like format with rows for each
record and columns for each aspect of the file. You may have to click Fit
column(s) to arrange the columns more evenly across the viewing pane. Notice
the columns for the search object and the category, brand, and product of the
search object. If you scroll to the right of the file, you see additional feedback
dimensions, like whether the feedback contains buzz, sentiment, and ownership
information, the date and time of the message, and the original message text:

4. Navigate to and select the /BigOut/results/localanalysisoutput/


localFeedback.csv file.
This file contains the results of the Local Analysis application, by default, in
text format. Each row of the extracted results is a Twitter message, and each
field in the record is a column when you view it with the CSV reader.
5. To organize the information into rows, with one record per line, click the Sheet
radio button:

Chapter 10. Tutorial: Identifying user feedback in social data 139


6. View your output as comma-separated value (CSV) data:

a. Click the Pencil icon ( ) to edit the collection reader.


b. From the Select a reader drop-down menu, select Comma Separated Value
(CSV) Data.
c. Select the Headers included? check box, then click the green check mark (

).
This view organizes the information into a grid-like format with rows for each
record and columns for each aspect of the file. Notice the columns for the
search object and the category, brand, and product of the search object. If you
scroll to the right of the file, you see additional feedback dimensions, like
whether the feedback contains buzz, sentiment, and ownership information, the
date and time of the message, and the original message text:

Lesson checkpoint
You successfully ran the analysis, according to the configurations in the
searchobjects_BigOutdoors.csv and aliases_BigOutdoors.csv files, on the sample
data. In the next lesson, you view general Twitter feedback about Sample Outdoor
Company products and positive and negative user feedback about the Course Pro
Golf Bag in the BM_Retail_GA_Feedback workbook.

Lesson 4: Viewing analysis results


In this lesson, you view the results of the Brand Management Retail Configure –
Local – Global Analysis chained application on the BigSheets or Dashboard tab of
the InfoSphere BigInsights Console, depending on your preferred output format, to
identify positively and negatively perceived products as well as the products and
brands about which the marketplace is saying the most.

140 IBM InfoSphere BigInsights Version 3.0: Tutorials


On the BigSheets tab, you can view data in a workbook and create line and bar
charts to visually represent your data. On the Dashboard tab, you can also create a
dashboard to see multiple graphical charts simultaneously and compare results
across various views.

Learn more about dashboards: Dashboards can be customized to show multiple


charts if, for example, you want to compare results from multiple runs over time
or different application runs after you update the application configuration files.
These visualizations create consumable output from machine data that can be
difficult to read without proper formatting.
1. View your results as a workbook:
a. On the BigSheets tab, select the BM_Retail_GA_Buzz workbook.
b. Click Run to update the workbook data with the results of the application.
Each row of the workbook contains a search object along with information
about that search object, such as its category, brand, and product, the date
and time of the message, and the text of the message.

Tip: You may have to click Fit column(s) and scroll to the right to see all
the workbook columns.

2. View your results as a chart:


a. When the run is complete (100%), click the BM Retail Popular Brands tab
at the bottom of the workbook.
b. You see a message that the chart has not been run. Click Run to update the
BM Retail Popular Brands chart. You see a progress indicator in the upper
right corner of the window. When the chart is finished running, you see the
most popular brands, based on the positive sentiment and the number of
tweets that mention those brands:

c. Optional: You can repeat this process to view the results in the following
charts:

Chapter 10. Tutorial: Identifying user feedback in social data 141


v The BM Retail Most Interesting Products tab in the
BM_Retail_GA_Intent workbook shows the most interesting products,
based on the number of tweets that mention those products, in order
from most to least interesting.
v The BM Retail Positive Sentiment tab in the
BM_Retail_GA_PositiveSentiment workbook shows positive sentiment by
product, reflecting the products with the highest volume of positive
words in positive contexts, in order from highest to lowest volume.
v The BM Retail Negative Sentiment tab in the
BM_Retail_GA_NegativeSentiment workbook shows negative sentiment
by product, reflecting the products with the highest volume of negative
words in negative contexts, in order from highest to lowest volume.
3. View your results as a dashboard:
a. On the Dashboard tab, select Social Data Accelerator from the Select
dashboard drop-down menu, then click the Brand Management Retail at
the bottom of the dashboard. The dashboard contains the following charts:
BM Retail Popular Brands
The most popular brands, based on the positive sentiment and the
number of tweets that mention those brands
BM Retail Most Interesting Products
The most interesting products, based on the number of tweets that
mention those products, in order from most to least interesting
BM Retail Positive Sentiment
Positive sentiment by product, reflecting the products with the
highest volume of positive words in positive contexts, in order from
highest to lowest volume
BM Retail Negative Sentiment
Negative sentiment by product, reflecting the products with the
highest volume of negative words in negative contexts, in order
from highest to lowest volume

4. Optional: If you want to see the workbook from which any of these charts are
derived, click the blue arrow in the upper-right toolbar of the chart name to
open the chart on the BigSheets tab. You see the output of the Global Analysis
application that was used to create the chart as a worksheet. Scroll through the
columns to see the data that is associated with each search object, the polarity
(positive or negative) of the tweet, the tweet text, and information about the
person who sent the tweet, such as screen name and geography. This
information is the beginning portion of the consumer profile.

142 IBM InfoSphere BigInsights Version 3.0: Tutorials


Sample Outdoor Company analysts can use the charts in the Social Data
Accelerator dashboard to identify how customers perceive its products and brands.
By running the analysis applications once and viewing the output, analysts view a
snapshot, or point in time, of the sentiment and popularity of the products and
brands. Sample Outdoor Company can use these results to ensure the analyses are
correct by checking tweet text against aliases and other configuration settings.

Summary: Analyzing social data feedback


In this tutorial, you identified and analyzed user feedback in the sample social
data, and then you viewed the results of the analysis applications to understand
user perceptions of Sample Outdoor Company products.

These steps summarize your process of analyzing feedback in social data.


1. You downloaded the batch of sample social data.
2. You specified analysis variables in the search object and alias configuration
files.
3. You used the Configure – Local – Global Analysis chained application to
understand consumer perceptions of Sample Outdoor Company and its
products.
4. You identified and viewed the results of your analysis in chart, dashboard, and
workbook formats.

Lessons learned

You now have a good understanding of the following tasks:


v How to prepare and analyze your social data.
v How to specify application parameters to customize your analysis.
v How to identify and understand user perceptions of your brand.
v How to view your analysis results in a chart, dashboard, and workbook.

Chapter 10. Tutorial: Identifying user feedback in social data 143


144 IBM InfoSphere BigInsights Version 3.0: Tutorials
Notices and trademarks
This information was developed for products and services offered in the U.S.A.
This material may be available from IBM in other languages. However, you may be
required to own a copy of the product or product version in that language in order
to access it.

Notices

IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may
be used instead. However, it is the user's responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not grant you
any license to these patents. You can send license inquiries, in writing, to:

IBM Director of Licensing


IBM Corporation
North Castle Drive
Armonk, NY 10504-1785 U.S.A.

For license inquiries regarding double-byte character set (DBCS) information,


contact the IBM Intellectual Property Department in your country or send
inquiries, in writing, to:

Intellectual Property Licensing


Legal and Intellectual Property Law
IBM Japan Ltd.
19-21, Nihonbashi-Hakozakicho, Chuo-ku
Tokyo 103-8510, Japan

The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or
implied warranties in certain transactions, therefore, this statement may not apply
to you.

This information could include technical inaccuracies or typographical errors.


Changes are periodically made to the information herein; these changes will be
incorporated in new editions of the publication. IBM may make improvements
and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.

© Copyright IBM Corp. 2013, 2014 145


Any references in this information to non-IBM Web sites are provided for
convenience only and do not in any manner serve as an endorsement of those Web
sites. The materials at those Web sites are not part of the materials for this IBM
product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it
believes appropriate without incurring any obligation to you.

Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact:

IBM Corporation
J46A/G4
555 Bailey Avenue
San Jose, CA 95141-1003 U.S.A.

Such information may be available, subject to appropriate terms and conditions,


including in some cases, payment of a fee.

The licensed program described in this document and all licensed material
available for it are provided by IBM under terms of the IBM Customer Agreement,
IBM International Program License Agreement or any equivalent agreement
between us.

Any performance data contained herein was determined in a controlled


environment. Therefore, the results obtained in other operating environments may
vary significantly. Some measurements may have been made on development-level
systems and there is no guarantee that these measurements will be the same on
generally available systems. Furthermore, some measurements may have been
estimated through extrapolation. Actual results may vary. Users of this document
should verify the applicable data for their specific environment.

Information concerning non-IBM products was obtained from the suppliers of


those products, their published announcements or other publicly available sources.
IBM has not tested those products and cannot confirm the accuracy of
performance, compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products.

All statements regarding IBM's future direction or intent are subject to change or
withdrawal without notice, and represent goals and objectives only.

This information is for planning purposes only. The information herein is subject to
change before the products described become available.

This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.

COPYRIGHT LICENSE:

146 IBM InfoSphere BigInsights Version 3.0: Tutorials


This information contains sample application programs in source language, which
illustrate programming techniques on various operating platforms. You may copy,
modify, and distribute these sample programs in any form without payment to
IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating
platform for which the sample programs are written. These examples have not
been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or
imply reliability, serviceability, or function of these programs. The sample
programs are provided "AS IS", without warranty of any kind. IBM shall not be
liable for any damages arising out of your use of the sample programs.

Each copy or any portion of these sample programs or any derivative work, must
include a copyright notice as follows:

© (your company name) (year). Portions of this code are derived from IBM Corp.
Sample Programs. © Copyright IBM Corp. _enter the year or years_. All rights
reserved.

If you are viewing this information softcopy, the photographs and color
illustrations may not appear.

Privacy policy considerations

IBM Software products, including software as a service solutions, (“Software


Offerings”) may use cookies or other technologies to collect product usage
information, to help improve the end user experience, to tailor interactions with
the end user or for other purposes. In many cases no personally identifiable
information is collected by the Software Offerings. Some of our Software Offerings
can help enable you to collect personally identifiable information. If this Software
Offering uses cookies to collect personally identifiable information, specific
information about this offering’s use of cookies is set forth below.

Depending upon the configurations deployed, this Software Offering may use
session or persistent cookies. If a product or component is not listed, that product
or component does not use cookies.
Table 3. Use of cookies by InfoSphere Information Server products and components
Component or Type of cookie Disabling the
Product module feature that is used Collect this data Purpose of data cookies
Any (part of InfoSphere v Session User name v Session Cannot be
InfoSphere Information management disabled
v Persistent
Information Server web
v Authentication
Server console
installation)
Any (part of InfoSphere v Session No personally v Session Cannot be
InfoSphere Metadata Asset identifiable management disabled
v Persistent
Information Manager information
v Authentication
Server
installation) v Enhanced user
usability
v Single sign-on
configuration

Notices and trademarks 147


Table 3. Use of cookies by InfoSphere Information Server products and components (continued)
Component or Type of cookie Disabling the
Product module feature that is used Collect this data Purpose of data cookies
InfoSphere Big Data File v Session v User name v Session Cannot be
DataStage® stage management disabled
v Persistent v Digital
signature v Authentication
v Session ID v Single sign-on
configuration
InfoSphere XML stage Session Internal v Session Cannot be
DataStage identifiers management disabled
v Authentication
InfoSphere IBM InfoSphere Session No personally v Session Cannot be
DataStage DataStage and identifiable management disabled
QualityStage® information
v Authentication
Operations
Console
InfoSphere Data InfoSphere v Session User name v Session Cannot be
Click Information management disabled
v Persistent
Server web
v Authentication
console
InfoSphere Data Session No personally v Session Cannot be
Quality Console identifiable management disabled
information
v Authentication
v Single sign-on
configuration
InfoSphere InfoSphere v Session User name v Session Cannot be
QualityStage Information management disabled
v Persistent
Standardization Server web
v Authentication
Rules Designer console
InfoSphere v Session v User name v Session Cannot be
Information management disabled
v Persistent v Internal
Governance
identifiers v Authentication
Catalog
v State of the tree v Single sign-on
configuration
InfoSphere Data Rules stage Session Session ID Session Cannot be
Information in the InfoSphere management disabled
Analyzer DataStage and
QualityStage
Designer client

If the configurations deployed for this Software Offering provide you as customer
the ability to collect personally identifiable information from end users via cookies
and other technologies, you should seek your own legal advice about any laws
applicable to such data collection, including any requirements for notice and
consent.

For more information about the use of various technologies, including cookies, for
these purposes, see IBM’s Privacy Policy at http://www.ibm.com/privacy and
IBM’s Online Privacy Statement at http://www.ibm.com/privacy/details the
section entitled “Cookies, Web Beacons and Other Technologies” and the “IBM
Software Products and Software-as-a-Service Privacy Statement” at
http://www.ibm.com/software/info/product-privacy.

148 IBM InfoSphere BigInsights Version 3.0: Tutorials


Trademarks

IBM, the IBM logo, and ibm.com® are trademarks or registered trademarks of
International Business Machines Corp., registered in many jurisdictions worldwide.
Other product and service names might be trademarks of IBM or other companies.
A current list of IBM trademarks is available on the Web at www.ibm.com/legal/
copytrade.shtml.

The following terms are trademarks or registered trademarks of other companies:

Adobe is a registered trademark of Adobe Systems Incorporated in the United


States, and/or other countries.

Intel and Itanium are trademarks or registered trademarks of Intel Corporation or


its subsidiaries in the United States and other countries.

Linux is a registered trademark of Linus Torvalds in the United States, other


countries, or both.

Microsoft, Windows and Windows NT are trademarks of Microsoft Corporation in


the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other
countries.

Java and all Java-based trademarks and logos are trademarks or registered
trademarks of Oracle and/or its affiliates.

The United States Postal Service owns the following trademarks: CASS, CASS
Certified, DPV, LACSLink, ZIP, ZIP + 4, ZIP Code, Post Office, Postal Service, USPS
and United States Postal Service. IBM Corporation is a non-exclusive DPV and
LACSLink licensee of the United States Postal Service.

Other company, product or service names may be trademarks or service marks of


others.

Notices and trademarks 149


150 IBM InfoSphere BigInsights Version 3.0: Tutorials
Providing comments on the documentation
You can provide comments to IBM about this information or other documentation.

About this task

Your feedback helps IBM to provide quality information. You can use any of the
following methods to provide comments:

Procedure
v Send your comments by using the online readers' comment form at
www.ibm.com/software/awdtools/rcf/.
v Send your comments by e-mail to comments@us.ibm.com. Include the name of
the product, the version number of the product, and the name and part number
of the information (if applicable). If you are commenting on specific text, include
the location of the text (for example, a title, a table number, or a page number).

© Copyright IBM Corp. 2013, 2014 151


152 IBM InfoSphere BigInsights Version 3.0: Tutorials


Printed in USA

GC19-4104-03

También podría gustarte