Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Tutorials
GC19-4104-03
IBM InfoSphere BigInsights
Version 3.0
Tutorials
GC19-4104-03
Note
Before using this information and the product that it supports, read the information in “Notices and trademarks” on page
145.
If you are using the InfoSphere BigInsights Quick Start Edition VMWare image,
you will find pre-populated Eclipse projects installed with the Eclipse client. Use
these projects to validate your progress in the Eclipse-related tutorials.
Within minutes, dive into the Collect and import data for Delve into BigSheets, an
world of big data with exploration and analysis that intuitive spreadsheet-like
robust, browser-based helps you make sense of tool, to create analytic queries
control. seemingly unrelated data. without any previous
programming experience.
Develop Query
Predict
Easily develop your first big Quickly master the intricacies
data application by using the of SQL queries for Hadoop Explore, visualize, and model
InfoSphere BigInsights with IBM® Big SQL. big data with IBM InfoSphere
Eclipse plugin. BigInsights Big R using the
interactive R language.
Extract
Accelerate
Accelerate
Discover the power of Text machine data
social data
Analytics by creating
extractors to derive valuable Use IBM Accelerator for Use IBM Accelerator for
insights from text documents. Machine Data Analytics to Social Data Analytics to
import, extract, index, search,
download, import, and
and analyze your machine
analyze your social data files.
data files.
Not available in the
Not available in the
InfoSphere BigInsights Quick
InfoSphere BigInsights Quick
Start Edition.
Start Edition.
Within minutes, you will be able to quickly navigate and use the InfoSphere
BigInsights Console to manage your big data environment.
Learning objectives
After completing the lessons in this tutorial, you will have learned how to
complete the following tasks:
v Use the InfoSphere BigInsights Console to inspect the status of your cluster, start
and stop components, and access tools that are available for open source
components.
v Work with the distributed file system. You will explore the distributed file
system (DFS) directory structure, create subdirectories, and upload files to
HDFS.
v Launch applications and inspect their status. You will also learn how to view
output in BigSheets, a spreadsheet-like tool.
Time required
Option Description
In a non-SSL installation Enter the following URL in your browser:
http://host_name:8080
2. Explore each section of the Welcome tab to learn more about the tasks and
resources that are available.
Option Description
Understand IBM Big Data Tools An interactive model that describes an
overview of the product capabilities in the
InfoSphere BigInsights Knowledge Center.
Tasks Quick access to commonly used InfoSphere
BigInsights tasks.
Quick Links Links to internal and external quick links
and downloads to enhance your
environment.
Learn More Online resources available to learn more
about InfoSphere BigInsights.
Administrators use the InfoSphere BigInsights Console to inspect the overall health
of the system. They also use InfoSphere BigInsights Console to complete basic
functions such as starting and stopping specific servers and components, and
adding nodes to the cluster. Other users can interact with files in the distributed
file system and manage applications.
1. On the Welcome tab, select Access secure cluster servers under the Quick
Links section.
A pop-up window appears with a list of URLs and the alias for each URL. For
example, click the hive link, which opens the Hive Web Interface into a new
browser window. You see an open source tool that is provided with Hive for
administration purposes, such as browsing the database schema and creating a
session. Close the browser window to return to the InfoSphere BigInsights
Console home page.
2. On the Cluster Status tab, ensure that all InfoSphere BigInsights nodes are
running. If any node is not running, select it, and then click Start. If you would
like to see more information about a node, select it. From this view, you can
also stop a node if it is running. After you start these nodes, they should
remain active for the remainder of this tutorial.
By default, monitoring is unavailable to optimize performance.
3. To explore your distributed file system, select the Files tab. Here, you can see
contents of the distributed file system, create new subdirectories, upload small
files for test purposes, and complete other file-related functions.
4. Become familiar with the functions that are provided by using the icons at the
top of the pane in the Files page. These icons are used throughout the tutorials.
Hover over an icon with your cursor to learn its function.
Lessons learned
Additional resources
To learn more about the tasks that you can complete by using InfoSphere
BigInsights, use the interactive conceptual models. These models provide insight
into some of the other tutorials that you can complete by using the product.
v Overview of InfoSphere BigInsights
v Developing applications by using the InfoSphere BigInsights Tools for Eclipse
v Creating text extractors by using Text Analytics
Business data is stored in various formats and sources. Before you import your
data into the InfoSphere BigInsights distributed file system, you must determine
what questions you want to answer through analysis, identify the data type of
your sources, and use the tools and procedures that best fit your business need.
You can use InfoSphere BigInsights with your existing infrastructure or data
warehouse to import data and content in its original formats, or you can import
huge volumes of at-rest (static) data or incoming data in motion (continually
updated data). After you import your data, you can explore the data separately or
combine the data to complete exploration and analysis.
For this tutorial, and the related tutorial on BigSheets, only news and blog data
that was returned by the search is used. The returned data was slightly modified
to contain only a subset of the information that the BoardReader application
collects from blogs and news feeds. The full-text/HTML content of posts, news
items, and certain metadata, was removed to keep the size of each file manageable.
The BoardReader application requires a license for use. If you have a license, you
can choose to follow the steps in the lesson on using the BoardReader application
(Lesson 2), or download the data to your computer and import it to the InfoSphere
BigInsights distributed file system for use with the Distributed File Copy
application (Lesson 3). To obtain a license, see the BoardReader website.
Learning objectives
After you complete the lessons in this tutorial, you will understand the concepts
and know how to:
v Create a folder for your sample data in the InfoSphere BigInsights distributed
file system.
v Collect and import data by using the BoardReader application.
v Import data from your local system or network by using the Distributed File
Copy application.
v Locate imported data in the distributed file system for use in BigSheets, Big
SQL, and Text Analytics.
Time required
The time required to complete this tutorial depends on which method you choose
to use to import your data, and the cluster configuration and the number of nodes
Prerequisites
Before you begin this tutorial, ensure that you installed the InfoSphere BigInsights
tools for Eclipse, and that you have access to the application through the
InfoSphere BigInsights Console.
Business data is stored in various formats and sources. Before you import your
data into the InfoSphere BigInsights distributed file system, you must determine
what questions you want to answer through analysis, identify the data type of
your sources, and use the tools and procedures that best fit your business need.
You can use InfoSphere BigInsights with your existing infrastructure or data
warehouse to import data and content in its original formats, or you can import
huge volumes of at-rest (static) data or incoming data in motion (continually
updated data). After you import your data, you can explore the data separately or
combine the data to complete exploration and analysis.
For this tutorial, and the related tutorial on BigSheets, only news and blog data
that was returned by the search is used. The returned data was slightly modified
to contain only a subset of the information that the BoardReader application
collects from blogs and news feeds. The full-text/HTML content of posts, news
items, and certain metadata, was removed to keep the size of each file manageable.
The BoardReader application requires a license for use. If you have a license, you
can choose to follow the steps in the lesson on using the BoardReader application
(Lesson 2), or download the data to your computer and import it to the InfoSphere
BigInsights distributed file system for use with the Distributed File Copy
application (Lesson 3). To obtain a license, see the BoardReader website.
Learning objectives
After you complete the lessons in this tutorial, you will understand the concepts
and know how to:
v Create a folder for your sample data in the InfoSphere BigInsights distributed
file system.
v Collect and import data by using the BoardReader application.
Time required
The time required to complete this tutorial depends on which method you choose
to use to import your data, and the cluster configuration and the number of nodes
available for your use. If you choose to complete the BoardReader lesson, this
tutorial will take approximately 20 minutes to complete. If you choose to use only
the Distributed File Copy application, this tutorial will take approximately 5
minutes to complete.
Prerequisites
Before you begin this tutorial, ensure that you installed the InfoSphere BigInsights
tools for Eclipse, and that you have access to the application through the
InfoSphere BigInsights Console.
For this module, there are two options for gathering your data. However, to best
manage your information you should first create a folder to store the data.
1. Open the InfoSphere BigInsights Console.
2. From the Files tab, select the DFS Files tab.
3. Create a directory to store this data in the distributed file system. Click the
You now have a directory to store all of your source data files and application
results.
You must create a credential file with the BoardReader key. There are private and
public files in the credentials store. The private credentials store contains your
private information in the /user/username/credstore/private directory.
If you want to import data by using an SFTP or FTP connection, make sure that
this connection is running on your system.
Collecting social media data can be challenging because each site can hold different
information and use varying data structures. Also, visiting numerous sites to
gather your information is a time-consuming process. For this lesson, the
BoardReader sample application that is provided with InfoSphere BigInsights can
search blogs, news feeds, discussion boards, and video sites.
1. Deploy the BoardReader application to make it available for your use.
a. In the InfoSphere BigInsights Console, in the Applications tab, click
Manage.
b. From the navigation tree, expand the Import directory.
c. Select the BoardReader application, and click the Deploy button (
).
d. In the Deploy Application window, select Deploy.
2. From the toolbar on the top of the hierarchy tree window, select Run.
3. Select the BoardReader application.
4. Define the Execution name of your project. This step creates a project, and you
can track the results and reuse the project later. For example, enter the
Execution name br_ibmwatson.
5. Define your application parameters.
a. In the Results path field, specify the directory for the application's output.
Use the Browse button to locate the file /bi_sample_data/bigsheets in the
Hadoop Distributed File System (HDFS) directory. If you are using the
InfoSphere BigInsights Quick Start Edition, the directory is
/user/biadmin/bi_sample_data/bigsheets.
b. Define the Maximum matches that you want to be returned from the
search. Since you want to be able to use this data for full scale analysis, use
the range 1,000.
c. Select a Start date and an End date. Define a specific past time frame for
the BoardReader to search. To search for this Watson data, define the start
date as January 1, 2011. Define the end date as March 31, 2012.
d. Select a Properties file. The Properties file references the file in the
InfoSphere BigInsights credentials store that was populated with the
BoardReader license key.
e. In the Search terms field, enter the term "IBM Watson" as the subject of this
search. This string causes the BoardReader application to search for any
instance of both terms appearing together.
6. Select Run to run the search in the BoardReader application. The data is
imported to the specified results path.
7. Verify that the BoardReader application conducted a successful search. You can
examine the status in the Application History panel. Return to the Files tab,
To use the Distributed File Copy application with SFTP, you can create a credential
file. There are private and public files in the credentials store. The private
credentials store contains the private information for each user that is in the
/user/username/credstore/private directory. The following is an example of what
the properties file for SFTP might look like:
database=db2inst2
dbuser=pascal
password=[base64]LDo8LTor
Note: The Distributed File Copy application is designed to move large amounts of
data. This application is designed to run on a Linux platform. To upload smaller
data sets (less than 2G), you can use the Upload function from the Files tab in the
InfoSphere BigInsights Console.
For this lesson, you will download the IBM Watson data that was the result of the
BoardReader application search to your local system, and then upload it to the file
system for analysis.
Before you begin, you must first download the data to your local system. The data
is in the Download section of the developerWorks article, "Analyzing social media
and structured data with InfoSphere BigInsights: Get a quick start with BigSheets".
Accept the terms and conditions and save the file article_sampleData to your local
system. After you unzip the file, the article_sampleData folder should contain the
files RDBMS_data.csv, blogs-data.txt, news-data.txt, and a README.txt file that
details the data output.
1. Deploy the Distributed File Copy application to make it available for your use.
a. In the InfoSphere BigInsights Console, in the Applications tab, click
Manage.
b. From the navigation tree, expand the Import directory.
c. Select the Distributed File Copy application, and click the Deploy button (
).
d. In the Deploy Application window, select Deploy.
2. From the toolbar on the top of the hierarchy tree window, select Run.
3. Select the Distributed File Copy application.
4. Define your application parameters.
a. Specify an Execution name. This step creates a project, and you can track
the results and reuse the project later. Name the execution dc_ibmwatson.
b. In the Input path field, specify the fully qualified path to the
article_sampleData file on your local file system. For example,
sftp://username:password@localhost/file/path/article_sampleData/
Lessons learned
BigSheets uses a spreadsheet-like interface that can model, filter, combine, and
chart data collected from multiple sources, such as an application that collects
social media data by crawling the Internet.
In this tutorial, you link social media data about IBM Watson with simulated
internal IBM data about media outreach efforts. Your goal is to analyze the
visibility, coverage, and sentiment around IBM Watson, a common requirement of
data analysts about their products
This tutorial teaches you the key aspects of BigSheets so that you can quickly
begin analyzing your own big data.
Learning objectives
After you complete the lessons in this module, you will understand the concepts
and processes associated with:
v Creating master workbooks from files that you upload into your distributed file
system cluster
v Creating child workbooks to tailor and explore data
v Merging data from two sources into one workbook
v Creating columns to group and sort data
v Viewing data in diagrams to see the history of a workbook and relationships
between workbooks
v Charting and refining the results of your analysis
v Exporting your results
Time required
BigSheets uses a spreadsheet-like interface that can model, filter, combine, and
chart data collected from multiple sources, such as an application that collects
social media data by crawling the Internet.
In this tutorial, you link social media data about IBM Watson with simulated
internal IBM data about media outreach efforts. Your goal is to analyze the
visibility, coverage, and sentiment around IBM Watson, a common requirement of
data analysts about their products
This tutorial teaches you the key aspects of BigSheets so that you can quickly
begin analyzing your own big data.
Learning objectives
After you complete the lessons in this module, you will understand the concepts
and processes associated with:
v Creating master workbooks from files that you upload into your distributed file
system cluster
v Creating child workbooks to tailor and explore data
v Merging data from two sources into one workbook
v Creating columns to group and sort data
v Viewing data in diagrams to see the history of a workbook and relationships
between workbooks
v Charting and refining the results of your analysis
v Exporting your results
Time required
Note: For the purposes of this tutorial, you are uploading sample data files that
are less than 2 GB. To load files larger than 2 GB, you must use the Import feature.
For more information, see Tutorial: Importing data for analysis.
Master workbooks protect and preserve the raw data in its original form. If, during
your data explorations, you accidentally remove a column, you can create a new
child workbook from the master workbook without reloading the original data.
Master workbooks also model the data format. This format is determined by
applying a reader, a data format translator that maps data into the spreadsheet-like
structure necessary for BigSheets. BigSheets provides several built-in readers for
working with common data formats.
1. Collect the social media files:
a. In your web browser, enter the following URL: http://www.ibm.com/
developerworks/data/library/techarticle/dm-1206socialmedia/. This
URL takes you to a BigSheets article on IBM developerWorks.
b. Scroll down until you see the Download section. Click the sampleData.zip
file, review the terms and conditions, and then click I ACCEPT THE
TERMS AND CONDITIONS.
Directory icon ( ).
e. In the Name field of the Create Directory window, enter Watson_data, and
click OK.
mark ( ). You immediately see the data map to the columns and
rows of the spreadsheet-like interface in the Preview area.
c. Since the data columns exceed the viewing space, click Fit column(s). The
first eight columns display in the Preview area.
Note: Depending on the size of your web browser window, you might
need to scroll to see Fit column(s).
d. Click Save as Master Workbook.
e. In the Name field, enter Watson_Blogs. Spaces are valid characters for
workbook names.
f. In the Description field, enter Watson blog data from blogs-data.txt,
then click Save.
10. Click the Edit icon ( ), select JSON Array from the drop-down list, and
12. Save the master workbook by clicking the green check mark ( ) in the
lower right corner of the screen.
Note: Depending on the size of your web browser window, you might need
You are now ready to explore the data that you loaded.
In addition to protecting the original data, master workbooks set the data format
(including the data types for the columns). Therefore, you must create child
workbooks in which to modify your data. Child workbooks inherit their format
and data from their master workbooks, but you can tailor their attributes to
display only necessary data.
1. From the BigSheets tab of the InfoSphere BigInsights Console, select the
Watson_News master workbook.
2. Click Build new workbook.
A new workbook is created with the name: Watson_News(1).
3. Rename the workbook by clicking the Edit icon ( ), entering Watson News
Learn more about column actions: Notice all the column actions that are
available to you in the drop-down list. You can rename, hide, and remove a
column; insert a new column; sort the data in a column; and organize the
columns.
When you remove columns from a child workbook, you delete only the data
from the child workbook. The master workbook on which this child workbook
is based always contains the original data as it was loaded. If you decide later
that you want the IsAdult data in your analysis, you can create another child
workbook from the Watson_News master workbook.
Why not just hide the IsAdult column?: When you hide a column, the data
in that column is still included when you run the workbook or create a chart.
The only way to remove the data from the analysis or chart is to remove the
column.
6. As you review the data in this Watson News Revised child workbook, you
decide that you do not need several other columns. You can use the same
method, as in the previous step, to remove them one at a time or remove
multiple columns at once:
a. Click the down arrow in any column heading, and select Organize
Columns.
b. Click the red X ( ) next to the following columns to mark them for
removal:
v Crawled
v Inserted
v MoveoverUrl
v PostSize
Tip: If you accidentally remove more columns than you intend, you can
click Undo to undo your last action.
7. Click Fit column(s) to resize the remaining columns. You now see columns A
through H:
Table 1. View of the table after you click Fit column(s)
A B C D E F G H
Country FeedInfo Language Published SubjectHtml Tags Type Url
8. Save and exit the workbook by clicking Save and selecting Save & Exit. If you
are prompted with a Save workbook window, you can save the workbook
with or without entering a description.
9. You are prompted with the message This workbook has never been run.
Press Run to run it or Close to dismiss this message. Click Run. You see
a progress indicator in the upper right corner of the window.
Until now, you have been working with a subset of the Watson and internal
IBM data. BigSheets keeps only a limited number of rows in memory. The
lower right corner displays a message that indicates you are seeing only a
simulated sample of 50 rows of data. When you run the data, you apply all
changes that you made since the last time you saved the workbook to the full
data set.
Learn more about the differences in icons for master workbooks and
child workbooks: Notice that the Watson News Revised workbook has a
different icon ( ) that looks like a lock that requires a key over the
spreadsheet image, indicating that the master workbook is read-only. You
can quickly distinguish master workbook from child workbooks by these
icons.
c. Rename the new child workbook by clicking the Edit icon ( ), typing
Because both new child workbooks have the same schema, you can merge them
into a new workbook, where you can explore and analyze your data.
To merge the data, create a new workbook from an existing workbook, then load
the data from the second workbook into the new workbook.
Learn more about types of sheets: Each type of sheet provides different
predefined logic for analyzing data. Use the Load sheet to include the data of
another workbook as a sheet in the current workbook.
4. In the Load window, select the Watson Blogs Revised workbook link from the
list of existing workbooks.
5. In the Sheet Name field, enter Watson Blogs Revised. In the Load window,
you see details of the columns and the first few rows of data in that
workbook.
6. Click the green check mark ( ). At the bottom of your workbook, you see
two tabs, Watson News Revised and Watson Blogs Revised.
7. Click Add sheets, and select Union.
8. In the Sheet Name field of the New sheet: Union dialog, enter News and Blogs
to indicate that this sheet contains the merged data.
9. From the Select sheet drop-down list, select the Watson News Revised sheet,
click the green plus sign ( ) to add the sheet (you see the sheet move to the
bottom of the dialog). From the Select sheet drop-down list, select the Watson
Blogs Revised sheet, click the green plus sign ( ) to add the sheet (you see
the sheet move to the bottom of the dialog). Then click the green check mark (
) to add both sheets. Your workbook now displays the new tab, News
and Blogs, at the bottom of your screen.
10. Click Save. When prompted for a name and description, enter Watson News
Blogs in the Name field and Combined news and blogs data in the
Description text box, and click Save.
You successfully combined the blog and news data into one workbook, where you
can analyze and explore the data. Next, you group similar data from multiple
columns into one column.
First, use the Calculate function to count the number of articles and posts by
language. Then, sort the column by language to display the most popular
languages first.
check mark ( ).
On the Group by language sheet, you see two columns, Language and
NumberArticlesandPosts. The Language column displays all the languages
from the News and Blogs sheet. The NumberArticlesandPosts column counts
the number of posts in each language.
4. To see the most common languages for posts about IBM Watson, sort the
Group sheet by the number of posts. Click the drop-down arrow to the right of
the NumberArticlesandPosts column, select Sort, and select Descending. You
see that English is the most popular language with 3169 posts, followed by
Russian, Spanish, and Chinese - Simple. But notice that Chinese (spelling) and
Chinese - Traditional are also near the top of the list. You combine these values
into one Chinese language value later when you create a chart.
5. Click Save > Save & Exit to save and close the workbook.
6. Click Run to save, sort, and process the entire data set for the workbook. You
see a progress indicator in the upper right corner of the window. After you run
the workbook, you see different results for the number of English posts in the
NumberArticlesandPosts column, 5464.
View your worksheets and sheets in the BigSheets diagrams. There, you visualize
the results of your analysis by creating and refining charts.
The Workbook Diagram ( ) beside a child workbook shows the sheets and
processes that created the selected workbook, the relationships between
workbooks, and on which master or child workbook the current workbook is
based.
3. Click the Workbook Diagram icon ( ). In the diagram, you can see the
types of sheets and history of the current workbook.
4. When you are finished looking at the diagram, click the red in the upper
right corner.
You now know how the two diagrams in BigSheets can help you visualize the
relationships between workbooks and sheets and the process of creating a
workbook. Now you are ready to explore and analyze the data in the Watson News
and Blogs workbook.
BigSheets provides various charts and maps. A chart plots data points in a grid,
such as a typical pie or bar chart. A cloud shows the importance of values by
displaying the size of the words relative to their importance. A map contains charts
that represent geographic data, such as a heat map that shows the concentration of
data points geographically.
1. Open the Watson News Blogs workbook, click Add chart, and then select chart
> Horizontal Bar.
2. In the New chart: Horizontal Bar window, enter or select the following values:
a. In the Chart Name field, enter Language Coverage. The chart name is the
name that displays on the tab at the bottom of the worksheet.
b. In the Title field, enter IBM Watson Coverage by Language. The title of the
chart displays at the top of the chart.
c. From the X Axis drop-down list, select NumArticlesandPosts.
d. In the X Axis Label, enter Number of posts.
e. From the Y Axis drop-down list, select Language.
f. In the Y Axis Label, enter Language of post.
g. From the Sort By drop-down list, select X Axis. You want to sort by the
number of posts.
h. From the Occurrence Order drop-down list, select Descending. You want
to see the language with the highest number of posts first.
i. In the Limit field, enter 12. You want to see only the top 12 languages by
the number of posts.
j. Leave the Template and Style default values.
k. Click the green check mark ( ) to preview the chart with sample data.
3. Click Run to generate the chart from the full set of workbook data. Even
though you see the preview chart immediately, the actual chart is not
the green check mark ( ) to create the column. Your cursor moves to
the fx (or function) area, where you provide the function to generate the
contents of the new column.
e. Enter the following formula as the function IF(SEARCH(’Chin*’,
#Language) >0, ’Chinese’, #Language), and click the green check mark (
) to apply the formula and generate the values for the new
Language Revised column.
This formula searches the Language column (indicated by #column_name)
for any value that starts with Chin and combines those values into one
value in the Language Revised column. The wildcard asterisk character
ensures that all variations of the Chinese language, regardless of spelling
or words that follow the word Chinese (such as Chinese Simple), are
included. If the value does not start with Chin, then the formula copies the
value, as is, into the Language Revised column.
Language_Revised, and click the green plus sign ( ) to add the column.
h. Click the red X ( ) next to the Language column to remove it. You want
to group and calculate the number of posts by the Language_Revised
column instead of the Language column.
i. Click the Calculate tab. In the Column drop-down list, select
Language_Revised.
j. Click the green check mark ( ) to preview the chart with sample data.
11. Click Run to generate a new chart. After the chart completes, all Chinese
languages are combined into one bar in the bar chart, which shows Chinese as
the second most popular language for posts and Russian as the third. If you
hover over the bars in the chart, you can see the actual numbers of posts.
You used BigSheets to generate a simple horizontal bar chart from your social
media data collections. You also analyzed the bar chart and refined the data to
determine the 12 most commonly used languages to generate posts about IBM
Watson.
You might want to share the results of your BigSheets analysis with colleagues
who do not have direct access to IBM InfoSphere BigInsights. You can export your
analysis results in various data formats, including CSV (comma-separated values),
JSON Array, and TSV (tab-separated values).
You just exported the results of the Watson News Blogs workbook into both a web
browser tab and a CSV file on your distributed file system (DFS) cluster.
Lessons learned
Extra resources
To learn more about how to use BigSheets to analyze your big data, see the
following resources:
v Overview of BigSheets
v Analyzing data with BigSheets
Learn more about the sample code: The code that is shown in this tutorial is
intended for educational purposes only, and is not intended for use in a
production application.
You develop the application in Jaql, a query and scripting language that uses a
data model based on the Javascript Object Notation (JSON) format. You learn how
to create, publish, and deploy the application by using the InfoSphere BigInsights
Tools for Eclipse so that you can run the application from the InfoSphere
BigInsights Console. This module does not cover the full syntax and usage of the
Jaql language, which is explained in the InfoSphere BigInsights Information Center.
However, you can apply many of the application development techniques in this
module to other applications.
Learning objectives
After completing the lessons in this tutorial, you will have learned how to
complete the following tasks:
v Create an InfoSphere BigInsights project.
v Create and populate a Jaql file with application logic.
v Test your application.
v Publish your application to the InfoSphere BigInsights catalog.
v Deploy and run your application on the cluster.
v Upgrade your application to accept input parameters.
Time required
Prerequisites
The InfoSphere BigInsights Tools for Eclipse must be installed in your Eclipse
environment.
Experience with Eclipse is not required, but understanding the concepts and the
development environment might be helpful when working with the InfoSphere
BigInsights Tools for Eclipse.
The project that you create will contain the files, applications, programs, and
modules that your application requires to run. After you create a project, you can
create an InfoSphere BigInsights program.
1. Open Eclipse.
The project that you created, WriteMessage, displays in the Project Explorer pane.
Now that your project is created, you can create a program. In this module, you
are creating a Jaql application.
Before you begin, ensure that the InfoSphere BigInsights Tools for Eclipse is open.
You should also ensure that you have write access to the biadmin directory. You
can check access by opening the InfoSphere BigInsights Console, and then selecting
the Files tab. Under DFS File, select server name > user > biadmin. In the right
pane of the biadmin folder, view the permission column to verify your write
access.
1. From the Task Launcher for Big Data, click the Develop tab.
2. Under Tasks, click Create a BigInsights program.
3. In the Create a BigInsights program window, select JAQL Script, and then click
OK.
4. Select the parent folder, WriteMessage, enter MyJaql.jaql as the file name, and
then click Finish.
Your new file, MyJaql.jaql, opens in an editor within Eclipse.
5. Copy and paste the following code into the MyJaql.jaql file.
The following code writes the results of the Jaql query as a text file (myMsg.txt)
to the /user/biadmin/sampleData directory in your distributed file system. You
might need to modify the specified directory to match your environment.
Ensure that your user ID has write access to the directory that you specify.
// sample message
term=’Hello World’;
// Location of the output file. Modify this location to fit your environment.
output=’/user/biadmin/sampleData/myMsg.txt’;
Before you begin this lesson, ensure that the InfoSphere BigInsights Tools for
Eclipse is open.
1. Create a server connection to the InfoSphere BigInsights Console. Because Jaql
is run from the Eclipse environment, your operating system user ID from
where you run the Jaql shell must be the same as your InfoSphere BigInsights
user ID.
a. In the Overview tab of the Task Launcher for Big Data, under First Steps,
click Create a BigInsights server connection.
Option Description
In a non-SSL installation Enter the following URL in your browser:
http://host_name:8080
8. In the InfoSphere BigInsights Console, from the Files tab, expand the directory
from step 5 in Lesson 2 in your distributed file system tree to locate the .txt
file that your application created.
Tip: After you identify a server connection and create a Jaql program
configuration, you can test Jaql statements directly from your MyJaql.jaql file.
Highlight the statement that you want to run, right click, and select Run the
JAQL statement.
Now that your Jaql program is working, you can publish it as an application in the
InfoSphere BigInsights applications catalog.
// The full path and file name that the user enters for the output.
output=[OUTPUT];
You might need to refresh the navigation view by clicking the Refresh
icon.
11. Optional: In the InfoSphere BigInsights Tools for Eclipse, expand the
WriteMessage project. The InfoSphere BigInsights Tools for Eclipse publication
wizard generated the workflow.xml file and the application.xml file.
a. To see the generated workflow, expand BIApp > workflow. Double-click
the workflow.xml file to open it in the InfoSphere BigInsights workflow
editor. From this editor, you can change the workflow without writing
XML code.
b. To view the values that you set for your input parameters, expand BIApp
> application. Double-click the application.xml file to open it in the XML
editor.
c. On the Design tab of the XML editor, expand application-template >
properties > property to view the values that you set.
Lessons learned
Additional resources
Big SQL provides SQL access to data that is stored in InfoSphere® BigInsights™ by
using JDBC, ODBC, and other connections. Big SQL supports large ad hoc queries
by using IBM SQL/PL support, SQL stored procedures, SQL functions, and IBM
Data Server drivers. These queries are low-latency queries that return information
quickly to reduce response time and provide improved access to data.
This tutorial uses data from the fictional Sample Outdoor Company. The Sample
Outdoor company began as a business-to-business operation. It does not
manufacture its own products. The products are manufactured by a third party
and are sold to third-party retailers. The company has a presence on the web and
sells directly to consumers through the online store. For the last several years, the
company has steadily grown into a worldwide operation, selling their line of
products to retailers in nearly every part of the world.
You will learn more about the products and sales of the Sample Outdoor Company
by running Big SQL queries and analyzing the data in the following lessons.
Learning objectives
You will use the InfoSphere BigInsights Tools for Eclipse and the Big SQL Console
to create Big SQL queries so that you can extract large subsets of data for analysis.
In one lesson, you will export your query results to an open source spreadsheet to
see how you can bring your analysis down to a smaller environment.
After you complete the lessons in this module, you will understand the concepts
and know how to do the following actions:
v Use the InfoSphere BigInsights Tools for Eclipse to connect to the Big SQL
server.
v Use the InfoSphere BigInsights Tools for Eclipse to load sample data and to
create and run queries.
v Use BigSheets to analyze data that is generated from Big SQL queries and to
create Big SQL tables.
v Use the InfoSphere BigInsights Tools for Eclipse to export data.
Time required
Skill level
Some familiarity with SQL. This tutorial includes lessons that are relevant to a new
Big SQL user and an advanced Big SQL user.
For the purposes of this tutorial, Eclipse is used as a client for the Big SQL server.
A few of the lessons also use Java™ SQL Shell (JSqsh), an open source
command-line client.
Procedure
1. Verify that Big SQL is started.
2. Verify that Eclipse is installed and that you have added the InfoSphere
BigInsights Tools for Eclipse.
3. Create a connection to a Big SQL server.
4. Connect to an InfoSphere BigInsights server.
Results
You have now established a communication between Big SQL and InfoSphere
BigInsights and the Eclipse client environment.
Procedure
1. In the InfoSphere BigInsights Console, click the Files tab.
2. Click the DFS Files tab, and then create a directory that you can use to hold
the SQL files that contain the DDL you must use:
a. Open the biadmin directory in this path: hdfs://<server-name>:9000/user/
biadmin/.
).
c. In the Create Directory dialog window, type bi_sample_data as the new
directory name, and then Click OK.
Quick Start Edition VM Users: If you are running the tutorial with the IBM
InfoSphere BigInsights Quick Start Edition VM image, skip this task, and instead,
follow the steps to access the data in “Quick Start Edition VM Image Users Only”
on page 41.
Because you will use the Distributed File Copy application with the SFTP protocol,
you must create a credential file that you store in the distributed file system (DFS).
A credential properties file contains the IBM Big SQL server credentials. If you do
not already have a credential file stored in the distributed file system, do the
following tasks:
1. Create a credentials property file by running the credstore.sh utility from
$BIGINSIGHTS_HOME/bin. For information about the credential store utility, see
Remove the -pub parameter to store the credentials in a private directory. If you
used a different user and password for the Big SQL administrator, update those
fields. Remember that the user name must have database administration
privileges. Modify the server information to match the host name of the cluster
from which you are running IBM InfoSphere BigInsights.
2. Open the InfoSphere BigInsights Console and click the Files tab.
3. Verify that your credentials file is in the distributed file system.
The Distributed File Copy application copies files to and from a remote source to
the InfoSphere BigInsights distributed file system by using Hadoop Distributed
File System (HDFS), GPFS, FTP, or SFTP. You can also copy files to and from your
local Linux file system. In this tutorial, the examples assume that you are using a
Hadoop distributed file system, and that you installed a user account called
biadmin.
Procedure
1. Deploy the Distributed File Copy application to make it available for your use:
a. In the InfoSphere BigInsights Console, click the Applications tab, and then
click Manage.
b. From the navigation tree, expand the Import directory.
c. Select the Distributed File Copy application, and click Deploy (
)
d. In the Deploy Application window, select Deploy.
2. From the toolbar on the navigation tree window, select Run.
3. Select the Distributed File Copy application.
4. Define your application parameters:
a. Type get_samples in the Execution name field. You are creating an instance
of a job that you can run. You can track the results by this Execution name
and then reuse the job later.
The port number must be available, so do not use 8080 as the port number.
Make sure that you change my.server.com to the correct server name.
c. In the Output path field, specify the fully qualified path to where you want
to store the data on the distributed file system. For the purposes of this
lesson, specify the following path:
/user/biadmin/bi_sample_data/
The input path points to a directory of SQL (queries), so the output path
contains a directory that is named queries.
d. Use the Browse button to specify the fully qualified path to your credential
properties file in the InfoSphere BigInsights credentials store.
user/biadmin/credstore/public/bigsql.prop
5. Click Run. The queries directory, which is the directory that you requested in
the input path of the Distributed File Copy application, is uploaded to the
distributed file system.
6. Now that you have the statements to DROP, CREATE and LOAD tables in the
distributed file system, get the data that you will use to load the tables.
a. While still in the Distributed File Copy application, type get_data in the
Execution name field.
b. In the Input path field, specify the fully qualified path to the
$BIGSQL_HOME/bigsql/samples/data directory on your local Linux file
system.
sftp://bigsql:bigsql@my.server.com:22
/opt/ibm/biginsights/bigsql/samples/data
c. In the Output path field, specify the same fully qualified path to where you
want to store the data on the distributed file system that you did for the
queries:
/user/biadmin/bi_sample_data/
d. Make sure the Credential file path for SFTP field still contains the path to
the bigsql.prop file:
user/biadmin/credstore/public/bigsql.prop
e. Click Run. The data directory, which is the directory that you requested in
the input path of the Distributed File Copy application, is uploaded to the
distributed file system.
7. From the queries directory in the distributed file system, download the
GOSALESDW_drop.sql, GOSALESDW_ddl.sql, and GOSALESDW_load.sql files to a
local directory so that you can import the files to Eclipse in a later lesson.
Results
These three GOSALESDW_*.sql files are the only SQL scripts that you will use to
drop the tables, or create and populate the tables. More SQL scripts exist in this
Procedure
1. Open the InfoSphere BigInsights Tools for Eclipse that is installed with your
VM image.
2. Select the project template called myBigSQL_Tutorial5_setup in the project
explorer.
3. Right-click the project and select Open Project.
4. The project contains three SQL files. You will run these files one at a time:
GOSALESDW_drop.sql
You can skip this SQL file if you have never created the tables in the
GOSALESDW schema. Otherwise, open the GOSALESDW_drop.sql file,
and click the Run SQL icon ( ).
Eclipse returns results for each statement. When all of the statements in
the GOSALESDW_drop.sql file are completed successfully, continue to the
next file.
The GOSALESDW_drop.sql file contains SQL statements that drop any
tables in the GOSALESDW schema that might have already been
created.
GOSALESDW_ddl.sql
Open the GOSALESDW_ddl.sql file, and click the Run SQL icon ( ).
Eclipse returns results for each statement. When all of the statements in
the GOSALESDW_ddl.sql file are completed successfully, continue to the
next file.
The GOSALESDW_ddl.sql file contains SQL statements to create the
schema and the tables. The first line of this file creates the
GOSALESDW schema. A Big SQL schema is a way to logically group
objects, such as tables or functions. The second line of this file (the USE
clause) declares a default schema for the session. All unqualified table
names that are referenced in Big SQL statements and DDL statements
default to this schema. If no USE clause is present, the default is your
User ID on the cluster.
In the later lessons, the USE clause is not used, because all of the tables
that are referenced are fully qualified, which means that you include an
unambiguous schema name as part of the table name. Therefore,
instead of running the statement as in Example 1, use the fully
qualified reference as in Example 2:
Example 1: With no schema qualification
SELECT * FROM
go_region_dim;
Example 2: Fully qualified table name
SELECT * FROM
GOSALESDW.go_region_dim;
The SQL Results view contains the results of your SQL statements or scripts.
You can change the display of the results page, and also the number of rows
that are returned from each query (the defaults is 500).
For the module on Publishing an IBM Big SQL application, you need to download
data that was created for a developerWorks article. This article contains data about
the occurrences of the phrase IBM Watson in various social media sources. It will
also be used to demonstrate some of the interaction that is possible between Big
SQL and BigSheets.
Procedure
1. Download the IBM Watson data to your local file system.
The data is in the Download section of the developerWorks article, "Analyzing
social media and structured data with InfoSphere BigInsights: Get a quick start
with BigSheets". Accept the terms and conditions and save the file
article_sampleData to your local system.
2. Extract the file to the following path on your local file system:
/home/biadmin/samples_tutorial.
The article_sampleData directory contains the following files:
v RDBMS_data.csv
v blogs-data.txt
v news-data.txt
v README.txt
3. Note the path to which you extracted the directory . For example, if you
extracted to your Linux directory called samples_tutorial, the full path is
/home/biadmin/samples_tutorial/article_sampleData.
4. Upload the article_sampleData directory to the distributed file system in the
The time range of the fictional Sample Outdoor Company data is three years and
seven months, starting January 1, 2004 and ending July 31, 2007. The 43-month
period reflects the history that you will analyze.
Quick Start Edition VM Users: If you are running the tutorial with the IBM
InfoSphere BigInsights Quick Start Edition VM image, you can skip these steps and
proceed to “Module 1: Creating and running SQL script files” on page 48.
Procedure
1. Open the InfoSphere BigInsights Tools for Eclipse that you installed on your
workstation environment.
2. Create an IBM InfoSphere BigInsights project in Eclipse:
a. From the Eclipse menu bar, click File > New > Other.
b. In the Select a wizard window, expand the BigInsights directory, select
BigInsights Project, and then click Next.
c. Type myBigSQL in the Project name field, and then click Finish.
d. If you are not already in the BigInsights perspective, in the message that
displays, click Yes to switch to the BigInsights perspective.
3. Import the SQL scripts into the Eclipse project:
a. From the Eclipse Project Explorer view, right-click the myBigSQL project
and click Import.
b. In the Import window, select General > File System and click Next.
4. From the Project Explorer in the Eclipse BigInsights perspective, expand the
myBigSQL project. Double-click the appropriate *.sql file to open it in the Big
SQL editor. You can then run the statements in the file from the editor. You are
going to run each file in order to drop tables, create tables, and load data into
the tables:
Learn more about the SQL editor window: In the SQL editor window, you can
run SQL statements as you edit them, select connection profiles, and import or
export SQL statements.
You can use the context assistant to help you complete statements. A syntax
checker adds a red indicator next to any invalid line. You can hover over that
indicator to see the reason for the problem.
a. You can skip this step if you have never created the tables in the
GOSALESDW schema. Open the GOSALESDW_drop.sql file, and click the Run
SQL icon ( ).
Eclipse returns results for each statement. When all of the statements in the
GOSALESDW_drop.sql file are completed successfully, continue to the next file
The GOSALESDW_drop.sql file contains SQL statements that drop any tables
in the GOSALESDW schema that might have already been created.
b. Open the GOSALESDW_ddl.sql file, and click the Run SQL icon ( ).
Eclipse returns results for each statement. When all of the statements in the
GOSALESDW_ddl.sql file are completed successfully, continue to the next file.
The GOSALESDW_ddl.sql file contains SQL statements to create the schema
and the tables. The first line of this file creates the GOSALESDW schema. A
Big SQL schema is a way to logically group objects, such as tables or
functions. The second line of this file (the USE clause) declares a default
schema for the session. All unqualified table names that are referenced in
Big SQL statements and DDL statements default to this schema. If no USE
clause is present, the default is your User ID on the cluster.
Tip: To conserve space in your distributed file system, you can delete the
data folder from its download location, /user/biadmin/bi_sample_data/
data/. Click the data directory in that path, and then click the Remove icon
The Big SQL LOAD HADOOP statement offers a powerful way to import
data into your tables. The following is a simple example of the LOAD
HADOOP statement.
v This example shows a load from a DB2 table.
LOAD HADOOP USING JDBC CONNECTION URL
’jdbc:db2://myhost:51000/SAMPLE’
WITH PARAMETERS (
’user’ = ’myuser’,password=’mypassword’)
FROM SQL QUERY
The SQL Results view contains the results of your SQL statements or scripts.
You can change the display of the results page, and also the number of rows
that are returned from each query (the defaults is 500).
When you run statements or scripts in the SQL Editor in the IBM InfoSphere
BigInsights perspective in Eclipse, the default number of rows that is returned is
500. Follow these steps if you want to change the number of rows that get
returned:
Procedure
1. From the Eclipse menu bar, click Window > Preferences.
2. From the Preferences window, click Data Management > SQL Development >
SQL Results View Options.
3. In the SQL Results View Options window, find the Max row count field and
increase the value from the default of 500. This value controls the number of
rows that are retrieved. A value of zero retrieves all rows.
4. In the Max display row count field, increase the value from the default of 500.
This value controls the number of rows that you see. A value of zero displays
all rows. Be aware that making this number too large can produce performance
problems.
5. Click OK to save your changes.
Now that you have loaded the data from the Sample Outdoor Company, you are
ready to explore the sales figures and product activity.
This module teaches some of the basic statements of IBM Big SQL, and some of the
different environments that you can use to create Big SQL objects and run queries.
Learning objectives
After completing the lessons in this module you will know how to do the
following tasks:
v Create scripts to run Big SQL statements.
v Create a view.
v Create queries that help you analyze the financial data from the Sample Outdoor
Company.
v Run queries from the InfoSphere BigInsights Console, from InfoSphere
BigInsights Tools for Eclipse, and from open source spreadsheets.
Time required
Prerequisites
You already know how to run the predefined SQL scripts from the tasks to set up
your environment. In this lesson, you will create your own script.
The script file can contain one or more SQL statements or commands. Within IBM
Big SQL in the Eclipse SQL editor window, you can run the entire file, or any
highlighted part of the file.
1. If you have not already created the myBigSQL project in Eclipse, do the
following steps:
a. From the Eclipse menu bar, click File > New > Other.
b. In the Select a wizard window, expand the BigInsights directory, select
BigInsights Project, and then click Next.
c. Type myBigSQL in the Project name field, and then click Finish.
d. If you are not already in the BigInsights perspective, in the message that
displays, click Yes to switch to the BigInsights perspective.
2. From the Eclipse menu bar, click File > New > Other.
3. In the Select a wizard window, expand the BigInsights directory, and select
SQL Script, and then click Next.
4. In the New SQL File window, in the Enter or select the parent directory field,
select myBigSQL. Your new SQL file is stored in this project directory.
5. In the File name field, type aFirstFile. The .sql file extension is added
automatically.
6. Click Finish.
7. After you create or open an SQL script for the first time, you must specify the
Big SQL connection for your SQL script file:
a. In the Select Connection Profile window, select the Big SQL connection. The
properties of the selected connection display in the Properties field. The Big
SQL database-specific context assistant and syntax checks are now activated
in the editor that is used to edit your SQL file.
b. Click Finish to close the Select Connection Profile window.
8. In the SQL Editor that opens with the aFirstFile.sql file that you created, add
the following Big SQL comments:
--This is a beginning SQL script
--These are comments. Any line that begins with two
-- dashes is a comment line,
-- and is not part of the processed SQL statements.
9. Save the aFirstFile.sql file by using the keyboard shortcut CTRL-S.
The schema that is used in this tutorial is the GOSALESDW schema. It contains
fact tables for the following topics:
v Distribution
v Finance
The analysis that you will do will reference parts of each of those topics. You will
examine product inventory, distribution, sales, and employee data.
1. From the Eclipse Project Explorer, open the myBigSQL project, and double-click
the aFirstFile.sql file.
2. In the SQL editor pane, type the following statement:
SELECT * FROM GOSALESDW.GO_REGION_DIM;
Each complete SQL statement must end with a semicolon. The statement
selects, or fetches, all the rows that exist in the GO_REGION_DIM table, which is
one of the tables in the GOSALESDW schema.
The SELECT statement is used to select data from a table. The result is stored
in a result table, which is called the result-set. It can be part of another query
or subquery.
You might have a script that contains several queries. When you want to run
the entire script, click the Run SQL icon or press F5 with nothing highlighted.
When you want to run a specific statement, or set of statements, and you
Learn more about the WHERE clause: You can filter results from an SQL
query by using a WHERE clause. The WHERE clause specifies a result table
that contains those rows for which the search condition is true. The syntax
looks like the following code:
WHERE search-condition
5. Run the entire script. This query results in four records or rows.
6. You can learn about the structure of the table GO_REGION_DIM, with some
queries to the syscat schema catalog tables. The Big SQL catalog tables
provide metadata support to the database. For more information about the Big
SQL catalog views, see Hadoop Catalog Views and Catalog Views. Type or
copy the following query, select the statement, and run just this statement:
SELECT * FROM syscat.columns
WHERE tabname=’GO_REGION_DIM’
AND tabschema=’GOSALESDW’;
The output from the catalog tables is folded to upper case. No rows are
returned if you use lower case in the catalog query (for example,
tabname='go_region_dim').
This query uses two predicates in a WHERE clause. The query finds all of the
information from the syscat.columns table when the tabname is
'GO_REGION_DIM' and the tabschema is 'GOSALESDW'. Because you are
using an AND operator, both predicates must be true to return a row. Use
single quotation marks around string values.
The result of the query to the syscat.columns table is the metadata, or the
structure of the table. The SQL Results tab in Eclipse shows 54 rows as your
output. That means that there are 54 columns in the table GO_REGION_DIM.
7. Run a query that returns the number of rows in a table. Type or copy the
following query, select the statement, and then run the query.
SELECT COUNT(*) FROM gosalesdw.go_region_dim;
The COUNT aggregate function returns the number of rows in the table, or the
number of rows that satisfy the WHERE clause in the SELECT statement
when a WHERE clause is part of the statement. The result is the number of
rows in the set. A row that includes only null values is included in the count.
In this example, there are 21 rows in the go_region_dim table.
You can retrieve data from views just like you do from tables. However, views can
be more efficient because they do not require permanent storage.
You might want to create a view to organize the way users see the data, or to
restrict certain information to a defined set of users.
You are going to create a view that gives you information about the quantity of
products that are shipped by branch. The first table (gosalesdw.go_branch_dim)
contains information about the branches of the Sample Outdoor Company. The
second table (gosalesdw.dist_inventory_fact) contains information about the
inventory, including the amount of product that is shipped.
1. Right-click the myBigSQL project, and select New > SQL Script. Name the
new file GOSALESDW_viewddl.sql in project myBigSQL.
2. In the file GOSALESDW_viewddl.sql, type or copy the following code:
CREATE SCHEMA myschema;
USE myschema;
You are querying this data so that you can better understand the products and
market trends of the fictional Sample Outdoor Company. You are going to examine
the records of the products that are ordered, the quantities that are ordered, and
the order methods.
1. Right-click the MyBigSQL project and click New > SQL Script. Name it
companyInfo.sql.
2. Your immediate goal is to learn what products were ordered from the fictional
Sample Outdoor Company, and by what method they were ordered. To achieve
your goal, you must join information from multiple tables in the gosalesdw
schema, because it is a relational database where not everything is in one table:
a. Type or copy the following comments and statement into the
companyInfo.sql file:
--Fetch the product name and the quantity and
-- the order method.
--Product name has a key that is part of other
-- tables that we can use as a join predicate.
--The order method has a key that we can use
-- as another join predicate.
--Query 1
By default, the Eclipse SQL Results page limits the output to 500 rows. You
can change that value in the Data Management preferences.
4. To find out how many rows the query returns in a full Big SQL environment,
type the following query into the companyInfo.sql file, then select the query,
and then press F5:
--Query 2
SELECT COUNT(*)
--(SELECT pnumb.product_name, sales.quantity,
-- meth.order_method_en
FROM
gosalesdw.sls_sales_fact sales,
gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb,
gosalesdw.sls_order_method_dim meth
7. To find out which purchase method of all the methods has the greatest quantity
of orders, you must add a GROUP BY clause (GROUP BY pll.product_line_en,
md.order_method_en). You will also use a SUM aggregate function
(SUM(sf.quantity)) to total the orders by product and method. In addition, you
can clean up the output to substitute a more readable column header by adding
AS Product in the SELECT statement.
SELECT pll.product_line_en AS Product,
md.order_method_en AS Order_method,
SUM(sf.QUANTITY) AS total
FROM gosalesdw.sls_order_method_dim AS md,
gosalesdw.sls_product_dim AS pd,
gosalesdw.sls_product_line_lookup AS pll,
gosalesdw.sls_product_brand_lookup AS pbl,
gosalesdw.sls_sales_fact AS sf
WHERE
pd.product_key = sf.product_key
AND md.order_method_key = sf.order_method_key
Your goal in this lesson is to understand how the Sample Outdoor Company
products that are sold rank in comparison with the products that are shipped You
are going to write SQL statements to analyze the data in the GOSALESDW schema
to achieve this goal.
1. Create an SQL file that is named advanced.sql in the myBigSQL project.
2. To open the advanced.sql file, double-click it.
3. Type or copy the following statement into the advanced.sql file:
WITH
sales AS
(SELECT sf.*
FROM gosalesdw.sls_order_method_dim AS md,
gosalesdw.sls_product_dim AS pd,
gosalesdw.emp_employee_dim AS ed,
gosalesdw.sls_sales_fact AS sf
WHERE pd.product_key = sf.product_key
AND pd.product_number > 10000
AND pd.base_product_key > 30
AND md.order_method_key = sf.order_method_key
AND md.order_method_code > 5
AND ed.employee_key = sf.employee_key
AND ed.manager_code1 > 20),
inventory AS
(SELECT if.*
FROM gosalesdw.go_branch_dim AS bd,
gosalesdw.dist_inventory_fact AS if
WHERE if.branch_key = bd.branch_key
AND bd.branch_code > 20)
SELECT sales.product_key AS PROD_KEY,
Learn more about WITH clauses: The WITH clause is a type of common table
expression that allows defining a result table with a table-name that can be
specified as a table name in any FROM clause of the fullselect that follows.
Multiple common table expressions can be specified following a single WITH
keyword. Each common table expression can also be referenced by name in the
FROM clause of subsequent common table expressions.
A common table expression can be used in place of a view to avoid creating the
view. It can also be used when the same result table must be shared in a
fullselect.
The example also shows multiple tables that are joined together. In most cases,
Big SQL joins the tables together in the order that they are provided in the
statement. In the example, the gosalesdw.sls_order_method_dim table is
accessed by Big SQL first.
When you choose the order of the tables in the query, remember to eliminate
rows as early as possible. Tables that use predicates that filter out many rows,
or those tables with rows that are removed as a result of the join should be
located early in the query. These tables are considered highly selective.
Ordering the tables in this way reduces the number of rows that must be
moved to the next step of the query.
4. Click the Run SQL icon ( ). The result contains 165 rows. The output shows
the product by its product key, by the number of units that were shipped, and
by the number of units that were sold.
Figure 4. Partial results of how many units were shipped and sold.
The INV_SHIPPED column is derived from the SUM aggregate function and a
CAST function as shown in the following Big SQL statement:
SUM(CAST (INVENTORY.QUANTITY_SHIPPED AS BIGINT))
AS INV_SHIPPED...
Lesson 1.6: Running Big SQL queries in the Big SQL Console
In this lesson, you learn how to run queries in the IBM Big SQL Console of IBM
InfoSphere BigInsights.
The Big SQL console is a built-in part of the InfoSphere BigInsights Console, and is
available to all users of the console. No additional setup is required.
If you log into the InfoSphere BigInsights Console as the bigsql user, you can run
all of the statements in the Big SQL console that you ran in the previous lessons.
The Big SQL Console runs as the InfoSphere BigInsights Console logged-in user,
and therefore has the authorizations of that user.
1. Open the InfoSphere BigInsights Console. Click the Welcome tab.
2. In the Quick Links pane, click Run BigSQL Queries.
The Big SQL Console opens in your browser where you can enter one or more
queries. Make sure that the Big SQL radio button is selected
.
3. In the query entry field, type the following statements:
CREATE HADOOP TABLE new4Console (
ProductName VARCHAR(100), Quantity BIGINT, ProductCode int);
INSERT INTO new4Console VALUES (’Weezers’,522,1);
INSERT INTO new4Console VALUES (’Somers’,3566,5);
INSERT INTO new4Console VALUES (’Gowzers’,3566,5);
SELECT * FROM new4Console;
SELECT * FROM new4Console WHERE ProductCode >0 Order by ProductName;
The output appears in the lower half of the console window, in the Result tab.
One Status tab exists to display the success or failure of every result. The
contents of each result is limited to 200 rows.
5. To see statements that you ran previously, expand the list box of previously run
statements, in the top pane that is above your current statement. To rerun one
of those statements, click the statement to place it back into the current
statement window. Then, you can click Run to do the query again.
6. View the new table and its contents in the distributed file system by opening
the Files tab in the InfoSphere BigInsights Console. Click the DFS Files tab and
follow the path to the biadmin schema to find the new4Console table and its
contents.
In this lesson, you are going to write a small Java application that contains code to
implement a scalar function that returns total units sold.
1. In the Eclipse client, create a Java project:
a. In the IBM InfoSphere BigInsights Eclipse environment, click File > New >
Project. From the New Project window, select Java Project. Click Next.
b. Type MyUDFProject in the Project Name field. Click Next.
c. Open the Libraries tab, and click Add External Jars. Select the appropriate
JDBC drivers from your local path, which by default includes these two JAR
files:
db2jcc_license_cu.jar
db2jcc4.jar
d. Click Finish. Click No when you are asked if you want to open a different
perspective.
2. Create a Java class:
a. Right-click the MyUDFProject project, and click File > New > Java >
Package. In the Name field, in the New Java Package window, type udf.
Click Finish.
b. Right-click the udf package, and click File > New > Java > Class.
c. In the New Java Class window, type MyUdf in the Name field. Select the
public static void main(String[] args) check box. Click Finish.
3. Copy the following JAVA code into the MyUdf.java file:
package udf;
public final class MyUdf {
public static double getItemTotal
(int units,
double price,
int discount
)
{
if (
units <= 0 ||
price <= 0 ||
discount < 0 ||
discount > 100
)
{
return -1;
}
else
{
return units *
price *
((100 - discount) /100.0);
}
}
}
4. Save the file and then, right-click the MyUdf.java file and click Export. Expand
the Java category and select JAR file. Click Next.
5. In the Select the resources to export pane, select the udf package. You see that
MyUdf.java is also selected. In the Select the export destination field, specify
The sqlj parameter takes the full path to the JAR file. The id parameter
means that subsequent SQL commands that use the tot_JAR.jar file can refer
to it with the name 'My_Jar'.
7. Register the function by running a CREATE FUNCTION statement from your
Eclipse or JSqsh client:
CREATE FUNCTION gosalesdw.getItemTotal
(INT,DOUBLE,INT)
RETURNS DOUBLE
NO SQL
LANGUAGE JAVA
EXTERNAL NAME ’My_Jar:udf.MyUdf!getItemTotal’
PARAMETER STYLE JAVA
;
In the above example, My_Jar is the short name that you defined in the install
command. It represents the JAR file that contains the function class,
tot_JAR.jar. The package name is udf and MyUdf is the class name. The
function name is the method name, getItemTotal. The function refers to the
three inputs types, an INTEGER, a DOUBLE, and an INTEGER. The output is a
DOUBLE.
The Java routine does not need to exist before you run the CREATE statement.
But the routine must be accessible at the time that you use the function in a
query.
8. Now, use the function:
SELECT EMPLOYEE_KEY,
gosalesdw.getItemTotal(QUANTITY, UNIT_PRICE, 10)
AS "the function result"
FROM GOSALESDW.SLS_SALES_FACT fetch first 5 rows only;
You can create a JDBC application to open a database connection, run a Big SQL
query, and then display the results of the query.
1. Create a Java project:
a. In the IBM InfoSphere BigInsights Eclipse environment, click File > New >
Project. From the New Project window, select Java Project. Click Next.
b. Type MyJavaProject in the Project Name field. Click Next.
c. Open the Libraries tab, and click Add External Jars. Select the Big SQL
JDBC driver from your local path, which by default includes these two JAR
files:
db2jcc_license_cu.jar
db2jcc4.jar
d. Click Finish. Click No when you are asked if you want to open a different
perspective.
2. Create a Java class:
a. Right-click the MyJavaProject project, and click File > New > Package. In
the Name field, in the New Java Package window, type aJavaPackage4me.
Click Finish.
b. Right-click the aJavaPackage4me package, and click File > New > Class.
c. In the New Java Class window, type SampApp in the Name field. Select the
public static void main(String[] args) check box. Click Finish.
/**
* @param args
*/
// set JDBC and database information
// Get a connection
conn = DriverManager.getConnection(db, user, pwd);
System.out.println("Connected to the database.");
// Execute a query
stmt = conn.createStatement();
System.out.println("Created a statement.");
String sql;
sql = "select * from gosalesdw.sls_product_dim " +
"where product_key=30001";
ResultSet rs = stmt.executeQuery(sql);
System.out.println("Executed a query.");
// Obtain results
System.out.println("Result set: ");
while(rs.next()){
//Retrieve by column name
int product_key = rs.getInt("product_key");
int product_number = rs.getInt("product_number");
//Display values
System.out.print("* Product Key: " + product_key + "\n");
System.out.print("* Product Number: " + product_number + "\n");
}
// Close open resources
rs.close();
stmt.close();
conn.close();
}
catch(SQLException sqlE){
// Process SQL errors
sqlE.printStackTrace();
}
catch(Exception e){
// Process other errors
e.printStackTrace();
}finally{
// Ensure resources are closed before exiting
try{
}
a. The Java code must first declare the package. Then, you include the
packages that contain the JDBC classes that you need for database
programming.
b. Set up the required database information, including a user name and
password, so that you can refer to it.
c. You must register the JDBC driver so that you can open a communications
channel with the database.
d. Open the connection with the getConnection(db, user, pwd) method. You
pass the variables that you created in an earlier step.
e. Run a query by submitting an SQL statement to the database:
sql =
"select * from gosalesdw.sls_product_dim " +
"where product_key=30001";
f. You extract the data from the result set by issuing the getInt method. You
display the output by using the print method.
g. Clean up the environment by closing all of the database resources.
4. Save the file, and right-click the Java file, and click Run as > Java Application.
You can use BigSheets and Big SQL together to read data, and then create tables
from that data.
In Lesson 2.1, Lesson 2.2, and Lesson 2.3 you will use data from the Sample
Outdoor Company to examine the result of sales by year, and then export the data
to BigSheets to create a chart that reflects the total sales by year.
In Lesson 2.4 and Lesson 2.5, you will use data from the occurrences of IBM
Watson in social media to illustrate some additional features of working with Big
SQL and BigSheets.
Learning objectives
After you complete the lessons in this module you will understand the concepts
and know how to do the following tasks:
v Create a BigSheets workbook.
v Import and export data to and from Big SQL
v Create tables from other tables in Big SQL.
Time required
To see the result of sales by year, the statement uses some of the features of Big
SQL that you used in earlier lessons. Use the WITH clause to create an inline
Learn more about the RANK function: The RANK function is one of the
On-Line Analytical Processing (OLAP) functions that provide the ability to
return ranking, row numbering and existing aggregate function information as
a scalar value in a query result. For more information about the OLAP
functions, see Olap Specification.
4. Save the file and then click the Run SQL icon ( ). The statement shows the
inline view, sales, which simplifies the final SELECT statement. In addition,
the nested aggregate functions demonstrate how the data types and
presentation can be manipulated.
To do more analysis on the results, you can use BigSheets to create charts that you
can show on the dashboard in your InfoSphere BigInsights server.
The next three lessons show you how to share data between BigSheets and Big
SQL.
v In “Lesson 2.2: Exporting Big SQL data about total sales by year to BigSheets”
on page 68, you will export a CSV file to BigSheets and create a workbook and a
chart reflecting the total quantity sold by product name.
v In “Lesson 2.4: Exporting BigSheets data about IBM Watson blogs to Big SQL
tables” on page 71, you read the blogs-data.txt file that you downloaded in an
earlier lesson, into a BigSheets workbook using the JSON Array reader.
v In “Lesson 2.5: Creating a catalog table from BigSheets Watson blog data to use
in Big SQL” on page 76, you use the Common Catalog feature of the InfoSphere
Lesson 2.2: Exporting Big SQL data about total sales by year
to BigSheets
In this lesson, you will export your data about total sales by year from Big SQL to
BigSheets.
BigSheets is capable of reading many data types. For this lesson, you will be
exporting output from Big SQL as a comma-separated value (CSV) type.
1. Export the output from the query in the previous lesson, to a CSV file so that
you can use BigSheets to analyze the data:
a. In the SQL Results page, open the Result1 tab. Select at least one row, and
then right-click and select Export > Current Result.
b. In the Select Export Format window, click the Browse button to locate a
destination directory in your local system. The default name is
<path>/result.<filetype> in a Linux environment, or <path>\
result.<filetype> in a Windows environment. Change the
result.<filetype> file name to SampleResults.
c. In the Format field, select CSV file (*.csv).
BigSheets can handle different readers. Select the correct reader when you
are in BigSheets to see the data in tabular format.
d. Click Finish.
2. To make this SampleResults.csv file available to BigSheets, upload it to the
InfoSphere BigInsights server:
a. Open the InfoSphere BigInsights Console, and click the Files page.
b. To create a directory on the server, select the tmp directory, and click the
format.
c. Click the Line Reader edit icon ( ), to change the reader format. Select
Comma Separated Value (CSV) Data from the drop-down list. And then
click the green check mark.
The contents of the file now appear as a table with three columns.
d. Click Save as Master Workbook. In the Name field type SampleResults. In
the Description field, type From a CSV file. Click Save. The BigSheets tab
of the InfoSphere BigInsights Console opens in the View Results page.
From there, you can continue with BigSheets functions.
4. Optional: Create a chart from the data to illustrate the sales quantity by year:
a. Click Add Chart.
b. Select Chart and then select Bar.
c. In the Chart Name field, type Totals by year.
d. In the Title field, type Year totals.
e. In the X-Axis field, select YEAR.
f. In the X-Axis Label field, type Year.
g. In the Y-Axis field, select TOTAL_SALES.
h. In the Y-Axis Label field, type Total sales.
i. In the Sort By field, select X Axis.
j. Click the green check mark. Then click Run.
When the processing is complete, you have a visual representation of the total
sales by YEAR. You can see that 2006 represents the most sales.
Instead of using the Export and Upload features that are described in the previous
steps, use the CREATE TABLE AS... clause of IBM Big SQL.
1. From your Eclipse environment, in the same org.sql file, add this line in front
of the statement that contains the WITH clause:
CREATE HADOOP TABLE gosalesdw.myprod_sales_tot
(Year varchar(4), Sales_tot float, Rank int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’\t’
AS
The new CREATE statement should look this the following statement:
CREATE HADOOP TABLE gosalesdw.myprod_sales_tot
(Year varchar(4), Sales_tot float, Rank int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’\t’
AS
WITH SALES
(YEAR, TOTAL_SALES, RANKED_SALES)
AS
(
SELECT CAST(ORDER_DAY_KEY AS VARCHAR(4)) AS YEAR,
SUM (SALE_TOTAL) AS TOTAL_SALES,
RANK() OVER (ORDER BY SUM(SALE_TOTAL) DESC) AS RANKED_SALES
FROM GOSALESDW.SLS_SALES_FACT GROUP BY CAST(ORDER_DAY_KEY AS VARCHAR(4))
)
SELECT YEAR, total_sales, ranked_sales FROM sales
ORDER BY YEAR, ranked_sales DESC;
6. Click the Line Reader edit icon ( ), to change the reader format. Select Tab
Separated Value (TSV) Data from the drop-down list. Clear the check mark in
the Headers Included check box. And then click the green check mark. The
contents of the file now appear as a table with three columns.
7. Click Save as Master Workbook. In the Name field of the dialog, type
TSV_MyTotals. The BigSheets tab of the InfoSphere BigInsights Console opens in
the View Results page. From there, you can continue with BigSheets functions.
You are going to use the blogs-data.txt file, which is one of the files you
downloaded when you set up your tutorial environment.
The data in the blogs-data.txt file comes from blogs that reference the term IBM
Watson. In this lesson you are going to turn that text data into a BigSheets
workbook, and then use the functions in BigSheets to format the data into
something that is easier to understand. To examine the blogs data in the
blogs-data.txt file, you create a workbook and use that data for a new Big SQL
table.
This lesson introduces a way of creating tables from data that you analyze by
using BigSheets and a TSV reader format and a JSON Array format.
g. Because the data columns exceed the viewing space, click Fit column(s).
The first eight columns display in the Preview area.
h. Click the check mark to save the workbook.
i. In the View Results page of BigSheets, click Build new workbook. Rename
the workbook by clicking the edit icon, entering the new name of
WatsonBlogDataRevised, and clicking the green check mark.
j. To more easily see the columns, click Fit column(s), in the
WatsonBlogDataRevised workbook. Now columns A through H fit within
the width of the sheet.
k. You do not need to use all of the columns in your IBM Big SQL table.
Remove multiple columns by following these steps:
1) Click the down arrow in any column heading and select Organize
columns.
2) Click the X next to the following columns to mark them for removal:
v Crawled
v Inserted
v IsAdult
v PostSize
3) Click the green check mark to remove the marked columns.
Task 2.4.3: Creating an Big SQL script that creates Big SQL
tables from the exported TSV file
In this task, you create an SQL script to create Big SQL queries based on the
BigSheets blogs data workbook.
1. In the InfoSphere BigInsights Eclipse environment, create a project that is
named MyBigSheetsAnalysis, and a new SQL script named NewsBlogs.
2. In the NewsBlogs.sql file, copy or type the following code:
CREATE SCHEMA IF NOT EXISTS BigSheetsAnalysis;
USE BigSheetsAnalysis;
Figure 10. A portion of the output from the BigSheets workbook after it is selected from a Big
SQL table
4. Query the table to get the feed information, publication dates, and URLs of
English-based blog posts about IBM Watson:
SELECT feedinfo, published, url
from BigSheetsAnalysis.sheetsOut WHERE language=’English’;
You started with data in a JSON array format and read it into a BigSheets
workbook. Then you updated the workbook to show only the columns in which
you had an interest. Then you exported the data to the distributed file system and
used that data in a Big SQL table.
You can export data from a BigSheets workbook as a JSON Array and then use a
SerDe application (Serializer/Deserializer) to process the JSON data. You then
make the data available to a Big SQL table.
By using the SerDe interface, you instruct Hive as to how a record is processed.
You can write your own SerDe for processing JSON data, or you can use a package
Task 2.4.5: Creating an Big SQL table in Eclipse using the SerDe
application to process the Watson blog data
In this lesson, you create a table and a query to access the BigSheets JSON array
data.
1. In the InfoSphere BigInsights Eclipse environment, open the NewsBlogs.sql file,
and create a table that accesses the appropriate data in the JSON output from
BigSheets, and that uses the SerDe class. Type or copy the following code:
CREATE HADOOP TABLE BigSheetsAnalysis.watson_json (
Country STRING,
FeedInfo STRING,
Language STRING,
Published STRING,
SubjectHtml STRING,
Tags STRING,
Type STRING,
Url STRING)
ROW FORMAT SERDE ’org.apache.hadoop.hive.contrib.serde2.JsonSerde’
;
2. Select the CREATE TABLE statement and press F5.
3. In the InfoSphere BigInsights Console, in the Files tab, locate the
WatsonBlogsData.json file that you created in the previous lesson and select it.
Your output should look similar to the output from the TSV file.
As you have already learned, you can use BigSheets and Big SQL together to read
data, and then create a table from that data.
In this lesson, you learn another way to create a table from a BigSheets workbook.
e. Click the Edit workbook reader icon ( ) and select JSON Array as the
reader type.
f. Click the green check mark to save the workbook and open it in the View
Results window.
2. Build a new workbook:
a. Click Build new workbook. Rename the workbook by clicking the edit
icon, entering the new name MyBlogsWBRevised, and clicking the green check
mark.
b. Click Fit column(s) to fit the columns within the width of the sheet.
c. Remove some columns that you do not need to use in your IBM Big SQL
table. Remove multiple columns by following these steps:
1) Click the down arrow in any column heading and select Organize
columns.
2) Click the X next to the following columns to mark them for removal:
v Crawled
v Inserted
v IsAdult
v PostSize
3) Click the green check mark to remove the marked columns
3. Click Save > Save to save the workbook. In the Save workbook dialog, click
Save. Click Exit to start the run process. Click Run to run the workbook.
4. Click the Create Table button to save the workbook as a common catalog table
and as a table in the distributed file system:
a. In the Target Schema field, keep the default sheets schema name.
You now see that you can analyze your data in both BigSheets and in Big SQL.
You can easily change the metadata of the table, reformat the columns, and
manipulate the output to satisfy many goals; and do the work moving the data
from Big SQL to BigSheets, or from BigSheets to Big SQL.
As you will see in the next module, you can make subsets of your information
from Big SQL and use it in open source spreadsheet applications.
There are many open source spreadsheet applications. This lesson assumes that
you have the Microsoft Excel spreadsheet application. Depending on the
spreadsheet application and version that you use, you might see differences in the
interface controls that are mentioned in these lessons.
Big SQL provides connectivity for some applications through either a 32-bit or a
64-bit ODBC driver, on either Linux (Red Hat Enterprise Linux (RHEL) 6 or SUSE
Linux Enterprise Server (SLES) 11) or Windows (Microsoft Windows 7 or Microsoft
Windows Server 2008). The Big SQL connectivity conforms to the Microsoft Open
Database Connectivity 3.0.0 specification.
Depending on the spreadsheet application that you use, you might need to select
the ODBC driver that you install from the operating system, or from the
spreadsheet application itself. Refer to information in your particular spreadsheet
application about importing data from external data sources.
Learning objectives
After you complete the lessons in this module, you will know how to do the
following tasks:
v Install the IBM Data Server Driver Package to access the ODBC drivers.
v Import data into your client spreadsheet.
v Query the data in your client spreadsheet.
Time required
Lesson 3.1: Installing the IBM Data Server Driver Package for
the client ODBC drivers
In this lesson, you install the ODBC drivers that you must use with the open
source client spreadsheet.
Attention: For Linux users, or users of the IBM InfoSphere BigInsights Quick
Start Edition, if you must attach to a remote ODBC client from your Linux
machine, follow these steps:
1. From a Linux command line, type the following command to determine the IP
address of your current InfoSphere BigInsights cluster:
cat /etc/hosts
2. Open a browser outside of the cluster environment by typing the following in
the URL address field:
<ip address>:8080
or
<ip address>:8443 if you are running secure protocol
The InfoSphere BigInsights Console opens in the non-cluster location. You can
continue with the steps to download the driver and attach the ODBC driver to
the correct location.
1. Download the 64-bit:
a. From the InfoSphere BigInsights Console Welcome Page Quick Links pane,
click Download client library and development software.
b. Select Big SQL clients and drivers and click Download.
c. From the Download Fix Packs by version for IBM Data Server Client
Packages, select the correct (IBM Data Server Driver Package ) for your
operating system. For the purposes of this tutorial, select the package for
Windows 64-bit. This package works with a 32-bit or 64-bit spreadsheet
client
DSClients-ntx64-dsdriver-10.5.300.125-FP003
.
d. Click Close in the Download client library and development software
window.
2. Right-click the v10.5fp3_ntx64_dsdriver_EN.exe file and select to Run as
administrator to install the package. This package is installed by default in
C:\Program Files\IBM\IBM DATA SERVER DRIVER\.
As part of the installation, configuration files are created in
C:\ProgramData\IBM\DB2\IBMDBCL1\cfg. For other operating systems or
versions, the installed location, and the configuration file location might be
different.
3. When the install is complete, navigate to the directory that contains the
sample configuration files, and copy db2dsdriver.cfg.sample to
db2dsdriver.cfg. Copy db2cli.ini.sample to db2cli.ini.
4. Edit the db2dsdriver.cfg file so that the result looks like the following file:
<configuration>
<dsncollection>
<dsn alias="MyDSN" name="bigsql" host="abc.com" port="51000"/>
</dsncollection>
<databases>
<database name="bigsql" host="abc.com" port="51000">
</database>
</databases>
</configuration>
Make sure that you update the values for host in two places in the file.
If you have a successful connection, you will see the following in the screen
output:
Connection attempt for data source name "MyDSN":
====================================================
[SUCCESS]
7. Create an ODBC DSN to the alias that you just validated. The example
spreadsheet client in this tutorial is a 32-bit Microsoft Excel application. Use
the db2cli32 command, instead of the db2cli command, if you are using a
32-bit IBM Data Server Driver along with the 64-bit installer in a 64-bit
Windows computer. Type the following command if you are using a 64-bit
application:
db2cli registerdsn -add MyDSN -system
This command creates a system data source name that you can see in the
ODBC administrator tool.
8. Start the ODBC administrator tool
For a 64-bit driver
a. Select the Control Panel from the Start menu.
b. Select Administrative Tools.
c. Click Data Sources (ODBC) for 64 bit binary.
For a 32-bit driver in a 64-bit machine
Right-click the ..\Windows\SysWOW64\odbcad32.exe file and select Run
as administrator
9. The ODBC Data Source Administrator opens. Click the System DSN tab.
10. Select MyDSN, and click Configure.
11. Type bigsql in the user name field, and bigsql in the password field. Click
Connect. The message "Connection tested successfully" is displayed.
Accessing the IBM Big SQL table in an open source spreadsheet application is the
equivalent of the query, select * from <table_name> from BigInsights. The table
can be used in your client application as you would use any spreadsheet data.
1. Open the client spreadsheet application.
3. Select MyDSN to connect to it, and provide the login details. The list of tables
is displayed in the database.
4. Select the table sheetsOut, which you created in a previous lesson.Click Next.
5. You can continue to click Next in each window, or select filtering and sorting
attributes. In the Query Wizard - Filter Data window, click column <?> to
specify which rows to include in the data. Click Next.
6. In the Query Wizard - Sort Order window, select a column to sort by that
column. Click Next. .
7. In the Query Wizard - Finish window, select the Return Data to Microsoft
Office Excel radio button. Click Finish.
8. In the Import Data window, select the Existing worksheet radio button to put
the data in the current worksheet. Click OK. The data is imported.
The spreadsheet application contains the result of the table that you selected.
Additional resources
Big R uses the open source R language to enable rich statistical analysis. You can
use Big R to manipulate data by running a combination of R and Big R functions.
Big R functions are similar to existing R functions, but are designed specifically for
analyzing big data. You can use Big R to analyze data located on the InfoSphere
BigInsights server with an R environment.
This tutorial uses a sample data set that is included in the Big R package. The 11.8
MB sample data set is a random sample of 22 years of flight arrival and departure
information.
This tutorial requires basic R knowledge. To get started with R, view the course on
R programming on the Big Data University website.
Learning objectives
After completing the lessons in this module you will understand the concepts and
know how to do the following actions with Big R:
v Use Big R functions
v Connect to InfoSphere BigInsights data sources from the R user interface
v Create visualizations
v Create predictive models
Time required
Prerequisites
Attention: R is licensed by the R project under the GNU General Public License.
IBM does not provide R and is not responsible for it or your use of it in any way.
Quick Start Edition VM Users: If you are running the tutorial with the IBM
InfoSphere BigInsights Quick Start Edition VM image, to install R and Big R:
1. Double-click the Install BigR file from the Desktop.
Load the Big R package, connect to the InfoSphere BigInsights server, and then
confirm that it is connected. Update the following example for your environment
settings, then run the code in your R environment.
library(bigr)
bigr.connect(host="host_name",
port=7052, database="default",
user="biadmin", password="password")
is.bigr.connected()
host_name is the host name of the node where your InfoSphere BigInsights Console
is installed.
Quick Start Edition VM Users: If you are running the IBM InfoSphere BigInsights
Quick Start Edition VM image, run the following code in the RGui R Console:
library(bigr)
bigr.connect(host="bivm",
port=7052, database="default",
user="biadmin", password="biadmin")
is.bigr.connected()
For this tutorial, you use R to extract an airline data set from the Big R package
and import it into your distributed file system for further analysis. For typical
examples of importing data, see the tutorial on Importing data for analysis.
1. Run the following code to extract the airline data set from the bigr package
directory, and then create a data frame named airR:
airfile <- system.file("extdata", "airline.zip", package="bigr")
airfile <- unzip(airfile, exdir = tempdir())
airR <- read.csv(airfile, stringsAsFactors=F)
2. Convert airR to a bigr.frame named air, then move the data set to the
InfoSphere BigInsights server, and then show the sample airline data set from
the InfoSphere BigInsights server:
air <- as.bigr.frame(airR)
bigr.persist(air, dataSource="DEL",
dataPath="airline_demo.csv", header=T,
delimiter=",", useMapReduce=F)
Important: Moving the file to the InfoSphere BigInsights server can take a few
minutes.
The parameter useMapReduce by default is true. The sample airline data set is
not large, so setting the parameter to false will run the data faster.
Expected output:
After uploading the sample data set, the file exists in the following location in the
InfoSphere BigInsights Console: user/bigsql/airline_demo.csv.
Lesson 2: Exploring the structure of the data set with IBM InfoSphere
BigInsights Big R
In this lesson, you learn how to review the structure of the data set.
By default, Big R sets all column types to character. However, for this data set,
all column types need to be integers, except for: UniqueCarrier, TailNum,
Origin, Dest, and CancellationCode.
Expected output:
[1] "Year" "Month" "DayofMonth" "DayOfWeek"
[5] "DepTime" "CRSDepTime" "ArrTime" "CRSArrTime"
[9] "UniqueCarrier" "FlightNum" "TailNum" "ActualElapsedTime"
[13] "CRSElapsedTime" "AirTime" "ArrDelay" "DepDelay"
[17] "Origin" "Dest" "Distance" "TaxiIn"
[21] "TaxiOut" "Cancelled" "CancellationCode" "Diverted"
[25] "CarrierDelay" "WeatherDelay" "NASDelay" "SecurityDelay"
[29] "LateAircraftDelay"
> coltypes(air)
[1] "character" "character" "character" "character" "character" "character" "character"
[8] "character" "character" "character" "character" "character" "character" "character"
[15] "character" "character" "character" "character" "character" "character" "character"
[22] "character" "character" "character" "character" "character" "character" "character"
[29] "character"
2. Run the following code to assign type integer to all column types except for the
columns listed in the previous step, and then display the updated column
types:
coltypes(air) <- ifelse(1:29 %in% c(9,11,17,18,23), "character", "integer")
coltypes(air)
Expected output:
[1] "integer" "integer" "integer" "integer" "integer" "integer" "integer"
[8] "integer" "character" "integer" "character" "integer" "integer" "integer"
[15] "integer" "integer" "character" "character" "integer" "integer" "integer"
[22] "integer" "character" "integer" "integer" "integer" "integer" "integer"
[29] "integer"
Option Description
nrow(air) Number of flights (number of rows)
ncol(air) Number of flight attributes (number of
columns)
dim(air) Data dimensions (rows x columns)
str(air) Structure of the data set, including sample
data
head(air, 5) First five rows
tail(air, 7) Last seven rows
print(air$UniqueCarrier) Carrier codes of all flights
print(air$Dest) Destination cities of all flights
The fictional Sample Outdoor Company wants to partner with an airline that has
few delays.
1. Run the following code to attach the air data set to the R search path.
attach(air)
Expected output:
[1] 0.2269586
4. To see how a fictional HA airline compares to flight delays overall, run the
following code:
nrow(airSubset[airSubset$UniqueCarrier == c("HA"),]) /
nrow(air[UniqueCarrier == "HA",])
Expected output:
[1] 0.07434944
Option Description
Because the delay information for the fictional HA airlines is better than others as
our data analysis has shown, the fictional HA airline is a strong candidate for the
Sample Outdoor Company.
Quick Start Edition VM Users: If you are running the tutorial with the IBM
InfoSphere BigInsights Quick Start Edition VM image, save the makeR file in
the /home/biadmin/ directory.
2. Install the package using the following R command:
install.packages("makeR_1.0.2.tar.gz")
3. Load the library using the following R command:
library (makeR)
,
The fictional Sample Outdoor Company wants to find the times and days with the
least amount of delays.
1. Connect to the ggplot2 library by running the following command:
library(ggplot2)
With the ggplot2 library you can create bar charts and other types of plots.
2. Create a bar chart that shows the number of delays for each hour, by running
the following code:
a. Create a working copy of the air data frame by running the following code.
bf <- air
Expected output:
The chart shows delays increasing as the day progresses. This is probably due
to an increase in flights as the day progresses.
3. Create a bar chart that shows the number of flights for each hour, by running
the following script:
bigr.histogram(air$ArrTime, nbins=24) +
labs(title = "Flight Volume (Arrival) by Hour")
Expected output:
Expected output:
In the month of February, Saturdays seem to have fewer flights on average. The
Sample Outdoor Company might want to schedule February flights between
09:00-11:00 a.m. on Saturday.
Important: Make sure that R is installed on each node of the cluster (NameNode
and DataNode), and that rpart is installed on each node.
When building models, you can use a subset of the data to train the model,
and use the remaining data to test and validate your model.
The models will predict flight arrival delay using departure delay, departure
time, air travel time, and distance as predicting variables.
Expected output:
Expected output:
n=196 (2 observations deleted due to missingness)
Expected output:
The decision-tree uses the column DepDelay more than other columns showing
that there is a strong relationship between the decision-tree and DepDelay.
2. Use the models that you created in step 1 to make arrival delay predictions.
Expected output:
carrier DepDelay ArrDelay ArrDelayPred
1 UA -5 -14 -3.353143
2 UA -5 -6 -3.353143
3 UA -5 -9 -3.353143
4 UA -2 -8 -3.353143
5 UA -2 -7 -3.353143
6 UA 25 15 20.429878
7 UA -3 -22 -3.353143
8 UA 3 31 6.742727
9 UA 11 -8 6.742727
10 UA -3 0 -3.353143
11 UA -2 3 -3.353143
12 UA 0 14 -3.353143
13 UA -2 -14 -3.353143
14 UA -4 -13 -3.353143
15 UA 1 -5 -3.353143
16 UA -5 2 -3.353143
17 UA -7 -7 -3.353143
18 UA -5 -20 -3.353143
19 UA 0 -38 -3.353143
20 UA 2 -9 -3.353143
Examining row six in the output, the predicted arrival delay is 20.429878;
however, the actual arrival delay is 15 minutes. As expected, there are
discrepancies between the predicted and actual results. It is important to
test the quality of your model to see where predictions are wrong or
different from the actual results.
3. Check the quality of your model.
a. Use the root mean squared deviation (RMSD) error metric.
rmsd <- sqrt(sum((preds$ArrDelay - preds$ArrDelayPred) ^ 2) / nrow(preds))
print(rmsd)
Expected output:
[1] 15.20358
The RMSD shows that the model has a high error ratio. To improve the
model, you can add more predictors like departure and arrival cities.
b. Examine the rows where your model gave the worst predictions.
preds$error <- abs(preds$ArrDelay - preds$ArrDelayPred)
head(sort(preds, by=preds$error, decreasing=T))
Expected output:
The error is very high for the model's worst predictions. The long delays are
probably due to plane maintenance and repair. And the top errors might be
outliers, because the range from the largest error to the sixth largest error is
over 60 minutes.
Lessons learned
By using text analytics tooling, you can develop, run, and publish extractors that
glean structured information from unstructured documents. The extracted
information can then be analyzed, aggregated, joined, filtered, and managed by
using other InfoSphere BigInsights tools.
In this tutorial, you will extract business information from a series of IBM
quarterly reports, such as the revenue for each IBM division. You can then use that
information in other tools, such as BigSheets, to understand and analyze trends,
and visualize the results in charts or graphs.
We will extract useful information from text documents by using a 5-step process.
The tasks that are associated with this process are supported in the Extraction
Tasks view in the Eclipse tools. This gives us a workflow that we can follow as we
build extractors. The following steps are included in these lessons:
1. Identify the collection of documents from which you want to extract
information.
2. Analyze the documents to identify examples of the information that you want
to extract.
3. Write AQL statements to extract the identified information.
4. Test and refine the AQL statements.
5. Export the final extractor and deploy to a runtime environment such as
InfoSphere BigInsights or IBM InfoSphere Streams.
Lessons 1, 2, and 3 will introduce you to the Text Analytics features and tooling.
These introductory lessons teach you how to use some basic AQL statements, and
how to manipulate the Text Analytics Workflow perspective and the Extraction
Plan. In the more advanced lessons (Lessons 4, 5, 6, and 7), you refine the AQL,
finalize the extractor, and export the Text Analytics Module (TAM) so that it is
ready to deploy to a runtime system.
Learning objectives
After you complete the lessons in this tutorial, you will understand the concepts
and know how to do the following actions:
v Navigate a Text Analytics project in Eclipse.
Time required
Allow 30 minutes to complete the basic parts of this tutorial. Allow another 45
minutes to complete the more advanced lessons.
You will now create a text analytics project and import the documents.
1. From your desktop, start Eclipse. Click OK to use the default workspace. The
Task Launcher for Big Data opens.
2. Close the Help pane for now if it is visible. You can always get help by
pressing F1, or by selecting Help > Help Contents from the menu bar.
3. In the Task Launcher for Big Data, click the Develop tab, and click Create a
text extractor from the Tasks panel.
4. Create a project called TA_Training.
a. In the New BigInsights Project window, specify TA_Training as the project
name, then click Finish.
b. Click Yes in the message box to switch to the InfoSphere BigInsights
perspective. The Extraction Plan pane is usually visible on the right of the
window. It is the design pane for Text Analytics projects. The Extraction
Tasks pane is usually visible on the left of the window. It is the workflow
for Text Analytics projects. The actual location of the views might depend
on your Eclipse environment.
Learn more about adding views: If the Extraction Tasks view is not visible,
add that view. From the Eclipse menu click Window > Show view >
Extraction Tasks. You can follow the same steps for the Extraction Plan if it
is not visible.
5. Before you can start working with the sample documents, you must bring them
into Eclipse. Open the Project Explorer and expand the TA_Training project.
a. From the Eclipse menu bar select Window > Show view > Project Explorer.
b. Expand the project TA_Training and open the folder textAnalytics.
c. Access the input documents in one of the following ways:
By labeling examples, you also start creating an extraction plan, which is a view of
the design of your extractor. In the extraction plan, you identify, organize, manage,
and navigate elements of the extractor.
As you create the extraction plan by labeling the spans of text (sometimes referred
to as snippets) of interest and their associated clues, you are developing an
understanding of the input documents. It is a good idea to work with someone
who is familiar with the documents during this part of the process.
Since you are interested in extracting revenue by division, you must search the
documents for spans of text that contain this information. As you find and label
examples, be aware of patterns and clues in the text that can help improve the
accuracy of the extractor.
An example that you might find is a phrase such as Revenues from Software were
$3.9 billion. If you labeled this example, you might notice that it has three
important features:
v The term "Software", which is a division name.
v The term “$3.9 billion”, which is a revenue amount.
v The term revenue.
You will use all of these features as context to identify instances of revenue by
division.
Labels are meaningful identifiers of the text that you want to extract. Labels also
serve to categorize various clues that help you develop an extractor. There are two
types of labels:
Top level or parent
A span of text that contains the information that you want to extract. An
Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 99
example of a top level identifier is Revenues from Software were $3.9 billion,
which contains clues to a division and the revenue that is associated with
it.
Clues You decompose the top level identifiers into features and clues. Basic
features are usually parts of the top level or parent that you must extract.
A clue is typically supporting text that provides additional context. In our
example, we would consider the word revenue to be a clue and the division
names and revenue amount would be features.
The process of labeling the document is an iterative process. It helps if you can
work with a subject matter expert who can help you decide if you have identified
enough examples, features, and clues to reliably extract the required information. It
would be unusual to find the same information presented the same way across a
broad set of documents. More often than not, something causes things to change. It
might be a change in the business, a change in regulations or reporting
requirements, a change of writer or editor, a new template for the document, or
simply a change in writing style. Ideally your subject matter expert can alert you
to the changes and variations that you must cover with your text analytics code.
When you read some of the sample input documents, you will see that you have
two basic patterns to deal with: revenues for division were $x.x and division
revenues were $x.x. There are a number of additional variations in the
information around the basic features and clues, but only two basic patterns.
1. Before you start your analysis, you must set up the input documents in the
Extraction Tasks view.
a. Click the Extraction Tasks tab in the left pane of the Text Analytics
Workflow perspective.
b. Expand Step 1 of the Extraction Tasks wizard, Select Data Collection. Click
Browse Workspace and navigate to the ibmQuarterlyReports folder in your
project (TA_Training/textAnalytics/ibmQuarterlyReports). Select the
ibmQuarterlyReports folder, and click OK.
c. From the Language list, select en.
d. Select 4Q2006.txt in the Extraction Tasks wizard. Click Open.
2. Examine the text in the document you just opened by looking for examples that
report revenue by division.
3. Identify RevenueByDivision as the first clue in which you are interested.
a. Search the file until you see the phrase Revenues from the Software segment
were $5.6 billion. Highlight that phrase, right-click, and click Add example
with New Label.
b. In the Add New Label window, type RevenueByDivision in the Label Name
field and leave a Parent Label field blank to make RevenueByDivision the
top level label.
c. Click Finish.
4. Look again at the text from the 4Q2006.txt file. Search for the phrase Revenues
from the Systems and Technology Group (S&TG) segment totaled $7.1 billion. Add as
another example.
a. Right-click that phrase and click Label Example As.
b. Select RevenueByDivision.
5. You have found two examples of the pattern revenues for division were
$x.x. Now, find an example that refers to the other pattern in which you were
interested. Search for and highlight Global Financing segment revenues
increased 3 percent (flat, adjusting for currency) in the fourth quarter
to $620 million. in the 4Q2006.txt file.
Learn more about the Extraction Plan: You can think of the Extraction Plan as
an interactive design view of your extractor. It helps you to identify, organize,
and navigate the elements that you want to extract. It also helps you write the
associated AQL statements, which makes the Extraction Plan a powerful part of
the design and development process.
a. In the 4Q2006.txt file, select the span of text Revenues from the Software
segment were $5.6 billion. Highlight and right-click the term Revenues, and
select Add Example with New Label.
b. In the Add New Label window, type revenues in the Label Name field.
Type RevenuebyDivision as the parent label. Click Finish.
c. In the same span of text, find the phrase $5.6 billion. Right-click that phrase
and click Add example with New Label.
d. In the Add New Label window, type Money in the Label Name field. Type
RevenueByDivision as the parent label. You can also double-click the
RevenueByDivision parent label to use that name as the parent. Click Finish.
7. It is a good idea to decompose clues to the lowest level. In this way, you can let
the powerful text analytics engine and optimizer do more of the work, rather
than writing complex expressions in your code. This action of decomposing
clues can also give you a more robust and flexible solution. Money, which you
labeled in the previous step, is a good example. Money has three basic features:
a currency sign, followed by a number, followed by a quantifier such as million
or billion. Go ahead and create labels for these three features:
a. In the 4Q2006.txt file, find the span of text $5.6 billion which
was part of the original phrase in a previous step. You have already labeled
this phrase Money.
b. Right-click only the currency symbol, $, and click Add example with New
Label. Type Currency in the Label Name field. In the Parent Label field,
type Money.
c. Right-click 5.6 of the same phrase, and select Add example with New
Label and type Number in the Label Name field. In the Parent Label field,
type or select Money.
d. Right-click billion and select Add example with New Label and type
Quantifier in the Label Name field. In the Parent Label field, type or select
Money.
8. You would usually continue analyzing documents, labeling additional examples
and clues until you had seen enough to be confident that you understood the
features, clues, and patterns for which you will code. To save time with the
additional examples and clues that you should label, use Table 2 on page 102 as
a guide. Search the documents that are identified and add the labels, noting of
which parent the label is a child.
a. Open the document that is listed in the File column of the table in the
editor.
Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 101
b. Press Ctrl+F to search for the string that is listed in the Search term column
of the table.
c. For each clue to add as a label, right-click the word or phrase and click Add
example with New Label. Specify the suggested label name in the Label
name column of the table, type the appropriate parent label name, and click
Finish. If you already added the label and want to add an example of the
label, click Label Example As.
d. Close the file.
Table 2. Additional clues to strengthen your extractor
Label name as child to
RevenueByDivision unless
File Search term otherwise noted
4Q2006.txt $7.1 billion Money
4Q2006.txt Systems and Technology Group Division
(S&TG)
4Q2006.txt Global Technology Services® Division
4Q2006.txt million Quantifier as a child to
Money
4Q2007.txt 12.5 Number as a child to Money
4Q2009.txt 27.2 Number as a child to Money
4Q2010.txt Revenue Metric
4Q2010.txt $29.0 billion Money
4Q2010.txt 8.7 Number as a child to Money
4Q2010.txt 5.3 Number as a child to Money
Now you are going to write AQL statements to extract the basic features that you
identified during the document analysis process. You will see how you can use a
simple pattern to put the basic features in context to give you candidates. In
subsequent lessons, you use similar techniques to combine features to create
concepts, and expand your AQL to further consolidate and filter the results.
Extractors are written in the Annotation Query Language (AQL), which is the core
of text analytics in InfoSphere BigInsights and InfoSphere Streams. You code
custom extractors in AQL. Text Analytics also includes a library of pre-built
extractors and a sophisticated set of tools. The AQL language was designed by
using SQL-like expressions, which makes it familiar and easy to learn.
Learn more about writing AQL: An extractor is a program that is written in AQL
that extracts structured information from unstructured or semistructured text. AQL
is a declarative language, with a syntax that is similar to that of the Structured
Query Language (SQL). For more information about writing AQL, see the AQL
Reference.
If you look at the labels that you created in the Extraction Plan, you see that the
lowest level basic features that you labeled are the three elements of Money: the
currency symbol, a number, and a quantifier. You are now going to write AQL
102 IBM InfoSphere BigInsights Version 3.0: Tutorials
statements to extract those elements by using simple extract syntax to use
dictionaries and regular expressions. As you will see, AQL allows you to create the
views that use extract and select statements. These statements are the three
fundamental elements of AQL. So, it is worth repeating: by using AQL statements,
your data is managed through views, and views are created by using extract and
select statements. In addition, your input data set is referenced as a view called
Document and its contents are referenced as a column called text.
1. You will now create views that use extract statements. You create one view for
each of the three basic features of Money.
a. In the extraction plan, right-click the Currency label that you created in the
previous lesson.
b. From the menu, select New AQL Statement > Basic Feature AQL
Statement.
c. In the Create AQL Statement dialog, in the View Name field, specify
Currency.
d. In the AQL Module field, select RevenuebyDivision_BasicFeatures.
e. In the AQL script, specify RevenueBasic.aql for the name of the AQL
script that you will be writing.
f. In the Type field, select Dictionary.
g. Select the Output view check box.
h. Click OK.
2. The RevenueBasic AQL file opens in the editor. The file is populated with
templates to create a dictionary and a view.
Learn more about views: Views are the primary data structures that are used
with AQL statements. AQL statements create views by selecting, extracting,
and transforming information from other views. AQL views are like the views
in a relational database. They have rows and columns just like a database
view. However, AQL views are not materialized by default. In other words,
the result of the views is not viewable output. To see your output, you must
include an output view statement. You reference input data as a view called
Document with one column called text. Think of each document in your input
data set as one row in the Document view with the document content
mapped onto the text column.
3. Complete the AQL template to create the dictionary and the view.
a. In the create dictionary line, type or copy the following code to replace
the dictionary template:
create dictionary CurrencyDict
as (’$’);
Make sure that you deleted the lines that begin from file and with
language.
To extract elements from text, you can use regular expressions and
dictionaries. When you want to match text that is based on a pattern, you
use a regular expression. When you can match on defined words, use a
dictionary.
Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 103
are stored in an external file, which makes it easier to add and change
entries without having to open up the code.
End each AQL statement with a semi-colon. You are changing the
dictionary declaration to be a simple inline declaration. In the example,
when the statement is run, the string is the entry in the CurrencyDict
dictionary.
By using a dictionary file instead of inline terms, you can more easily
modify terms without modifying the code. The create dictionary
statement would change as follows:
create dictionary CurrencyDict
from file ’NewDictionary.dict’;
b. In the create view template, replace the template with the following code:
create view Currency as
extract
dictionary ’CurrencyDict’ on R.text as match
from Document R;
The create view statement uses an extract statement that finds all matches
of terms in the dictionary that you created. The dictionary matches are
stored in a column named match.
4. Do not change the output view line.
The output view statement materializes the view. By default, views are not
materialized. They are also likely to be removed when you optimize for better
performance. But, during development, you are likely to want to look at the
contents of intermediate views like this one for debugging purposes, then
later you can comment out the output view statements that are not required.
5. Click File > Save from the menu to save your changes. Verify that your AQL
looks like the following code:
module RevenueByDivision_BasicFeatures;
Option Description
create view Number
View Name
Number
AQL Module
RevenuebyDivision_BasicFeatures
AQL script
RevenueBasic.aql
Type Regular expression
Output view
Enabled
Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 105
Option Description
create view Quantifier
View Name
Quantifier
AQL Module
RevenuebyDivision_BasicFeatures
AQL script
RevenueBasic.aql
Type Dictionary
Output view
Enabled
10. Modify the RevenueBasic.aql file to correct the two templates that were
added.
a. Update the Number view to look like the following code:
create view Number as
extract regex /\d+(\.\d+)?/
on R.text as match
from Document R;
If you put the dictionary file in a location outside of the module folder,
then you must include the path relative to the project name.
Complete the create view statement by pointing to the dictionary that you
just created, and ensure that the view is case insensitive. Adding the
IgnoreCase parameter ensures that the terms million and Million are both
found. The create dictionary and the create view statements should
look like the following code:
create dictionary QuantifierDict
from file ’Quantifier.dict’
with language as ’en’;
12. To mark the Number and Quantifier labels as complete in the extraction plan,
right-click each label (Number and Quantifier) in the Extraction Plan and
select Mark Completed.
13. You are now going to extract instances where these three basic features occur
together, which gives you Money. You will do that extraction by using a
Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 107
pattern to extract candidates for revenue. In the Create AQL Statement dialog,
complete the fields necessary to create a view:
a. Right-click the Money label, select New AQL Statement > Candidate
Generation AQL statement. AQL is modular, which means that you can
package your statements into modules that can then be packaged and
reused. One way to modularize your code is by the type of AQL
statement. By using this design, you would package all basic feature
statements in one module, all candidate generation statements in another
module. The Text Analytics tooling creates default modules to support this
type of modularization. But since the extractor that you are building in
this tutorial is simple, you will package all of your statements into the
RevenuebyDivision_BasicFeatures module.
Learn more about AQL modules: For more information about modules,
see AQL modules
b. Type Money in the View Name field.
c. In the AQL Module field, make sure to specify
RevenuebyDivision_BasicFeatures as the module name.
d. In the AQL script field, type or select RevenueBasic.
e. Specify Pattern in the Type field.
f. Specify the Output view check box.
g. Click OK.
14. You are going to use the Currency, Number and Quantifier views in this new
view, and you will reference those views by assigning the variables C, N, and
Q to the Currency, Number, and Quantifier views in the FROM clause. The
pattern specification looks for the currency symbol, followed by a number,
followed by a unit. As a result, the view contains the following code:
create view Money as
extract pattern <C.match> <N.match> <Q.match>
return group 0 as match
from Currency C, Number N, Quantifier Q;
Learn more about patterns: For more information about patterns in AQL, see
Sequence patterns
15. Save the file and run the extractor in the usual way.
a. In the Extraction Plan, right-click RevenueByDivision and click Run >
Run the extraction plan on the entire document collection.
b. View the results in the Annotation Explorer.
You see the Money view, with sequential occurrences of a currency sign,
followed by number, followed by a unit. You extracted entities by using a
pattern over the input document and the existing annotations. The Money
view returned 333 rows.
16. To mark the Money label as complete in the extraction plan, right-click the
label Money in the Extraction Plan and select Mark Completed.
17. Optional: From the Annotation Explorer, you can export the extracted views
as HTML or CSV files, and you can highlight any of the extracted entities in
the annotated document view and get the drilldown of the views to which
they belong.
a. Click the Export Results icon in the Annotation Explorer.
If you decide to not continue with the more advanced lessons, you have learned
how to extract basic features and how to use a pattern to extract candidates. In
these first three lessons, you extracted the basic features of money: a currency
symbol, a number, and a quantifier. You used dictionaries, regular expressions, and
patterns. You created and output views and ran the extractor and examined the
output in the annotation explorer.
With these lessons, you were introduced to the fundamentals of Text Analytics and
some key AQL statements. You have also successfully used the tools to identify
instances of Money in IBM quarterly reports.
You will build on the basic features that you defined in previous lessons to extract
revenue by division that is based on the two patterns that you identified during
your initial analysis: revenues for division were $x.x and division revenues
were $x.x.
So far, you successfully extracted all instances of Money. Now you will extract the
basic features of revenue and division.
To generate candidates, you use the extract pattern statement, and build on the
code that you created in the previous lessons.
1. In the previous lesson you extracted Money. Now, you need to extract
instances of revenue and divisions. You extract these basic features by using
dictionaries:
a. Right-click the revenues label and click New AQL Statement. Select Basic
Features AQL statement.
b. Type Revenue in the View Name field.
Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 109
c. In the AQL Module field, make sure to specify
RevenuebyDivision_BasicFeatures as the module name.
d. In the AQL script field, type RevenueBasic.
e. Specify Dictionary in the Type field.
f. Specify the Output view check box.
g. Click OK.
Copy of paste the following code to replace the template:
create dictionary RevenueDict
as (’revenues’, ’revenue’);
...
7. Save the file, and run the extractor in the usual way. In the Annotation
Explorer, the division names now look correct. There are now 95 rows
returned.
8. Mark the labels revenues and Division as complete.
9. You have now extracted the three key basic features: money, revenue, and
division. The next step is to extract candidates that match the two patterns
that you identified earlier.
10. You will use patterns in your code to put the information from the three
views MoneyRevenue and Division in context. If you remember, in “Lesson 2:
Selecting input documents and labeling examples” on page 99, part of your
goal was to find both of the following patterns: revenues for division were
$x.x and division revenues were $x.x. The first pattern looks for examples
where the word revenue is followed by a division name and then a money
amount, with some number of tokens in between each basic feature. For
example, Revenues from the System and Technology Group (S&TG) segment
totaled $7.1 billon
extract pattern <R.match><Token>{1,2}<D.match><Token>{1,20}<M.match>
The second pattern looks for examples where a division name is followed by
the word revenue and a money amount, with some number of tokens in
between each basic feature. For example, Global Financing segment revenues
increased 3 percent (flat, adjusting for currency) in the fourth
quarter to $620 million.
extract pattern <D.match><Token>{1,3}<R.match><Token>{1,30}<M.match>
After you have matched both patterns and have a full set of candidates, you
can union them together into a single view.
a. Right-click the RevenueByDivision label and click New AQL Statement.
Select Candidate Generation AQL statement. Complete the Create AQL
Statement dialog with the following information:
View name
RevenueAndDivision
Module name
RevenuebyDivision_BasicFeatures
Script name
RevenueCandidate.aql
Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 111
create view RevenueAndDivision as
extract pattern
<R.match> <Token>{1,2}
(<D.match>) <Token>{1,20} (<M.match>)
return group 0 as match
and group 1 as Division and group 2 as Amount
from Revenue R, Division D, Money M;
Learn more about the value of the AQL templates: The AQL templates
reduce the need to look up syntax, retype the same expressions multiple
times, and debug spelling mistakes.
14. As you finalize the extractor, you no longer need the intermediate views. If
users of your AQL module must materialize, or use output view statements in
any of your externalized views, they can do so in their own code. You can
comment out the intermediate views so that the optimizer knows that they do
not need to be materialized. From the Project Explorer, edit the
RevenueBasic.aql and the RevenueCandidate.aql files and comment out the
output view statements:
a. From the Project Explorer, find the RevenueBasic.aql file and open it.
b. Add two dashes before the words output view for each of those
statements. This comments the entire line and it is not compiled.
c. Click File > Save.
d. From the Project Explorer, find the RevenueCandidate.aql file and repeat
the process of adding comments in front of the output view statements.
e. Click File > Save.
Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 113
5) Click OK.
6) After you specify the pre-built extractor libraries, you can extend your
Revenue extractor by using one of the Named entity views, such as
Organization. Include the following statement at the top of the
RevenueCandidate.aql script, immediately after the module declaration:
import view Organization
from module BigInsightsExtractorsExport
as Organization;
The Organization extractor identifies mentions of organization names.
After importing the view, then add this view in your
RevenueCandidate.aql script:
create view myOrg
as
select
GetText (R.organization) as TheOrg
from Organization R;
output view myOrg;
The result shows you all of the organizations that are mentioned in the
input text.
In this code, you are consolidating the output from the view
AllRevenueByDivision to remove duplicate entries.
The output should show 25 rows. This view contains exactly the information that
you need for further analysis. When you apply text analytics to more complex
documents, and when you are extracting more sophisticated information, you
would expect to spend time improving the precision and recall of your extractor.
You can also profile your extractor to understand and improve its performance
characteristics. There are utilities in the Text Analytics Workflow perspective to
help with both of these tasks.
Learn more about some of the Text Analytics utilites: In the InfoSphere
BigInsights Eclipse Text Analytics Workflow perspective, you can find help with
several of the Text Analytics utilities. The following is a list of some of the utilities
that you might want to explore in the Help contents:
Annotation Difference Viewer
Displays a side-by-side comparison of the extracted results from the same
input file. You can use the Annotation Difference Viewer to understand
how modifying the AQL statements in an extractor affects the results. Also,
you can use the Annotation Difference Viewer to understand how the
extracted results compare with a labeled data collection.
Provenance View
Displays the results from viewing the lineage of analysis results and is
useful for understanding the results of an extractor. It explains in detail the
provenance, or lineage of an output tuple, that is, how that output tuple is
generated by the extractor. You access the Provenance View through the
Result Table View.
Profiler View
Helps you to troubleshoot performance problems in the AQL code. The
Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 115
Profiler also calculates the throughput of the extractor (in KB/seconds) by
dividing the size of the data that was processed by the total duration of
the Profiler execution.
Pattern Discovery View
Displays results from discovering patterns in text input. Pattern discovery
identifies contextual clues from documents in the data collection that help
you refine the accuracy and coverage of an extractor.
Explain Module View
Displays the metadata of the module and the compiled form of the
extractor.
You can now use this extractor in new modules that you create in the Eclipse
environment. You can also publish and then deploy the extractor to the InfoSphere
BigInsights Console, as you will learn in the next lesson.
Complete “Lesson 6: Finalizing and exporting the extractor,” and have access to an
IBM InfoSphere BigInsights server.
1. From the Text Analytics Workflow perspective, open the Project Explorer.
2. Right-click the TA_Training project.
3. Click BigInsights Application Publish.
4. In the BigInsights Application Publish wizard, complete the workflow
information:
The application is placed in the IBM InfoSphere BigInsights server. Open the
InfoSphere BigInsights Console and open the Applications tab. Click Manage to
find your application.
Lessons learned
Chapter 8. Tutorial: Creating an extractor to derive valuable insights from text documents 117
v How to use the Text Analytics development process and the supporting tools.
v How to analyze text documents to populate an extraction plan by identifying
interesting text and clues.
v How to create and test AQL scripts to extract candidates.
v How to create AQL statements to filter the candidates to extract useful insights.
Additional resources
There are articles on IBM developerWorks that give you further information about
Text Analytics.
v Analyzing social media and structured data with InfoSphere BigInsights
v Analyze text from social media sites with InfoSphere BigInsights: Use
Eclipse-based tools to create, test, and publish text extractors
In this tutorial, you download and run sample machine data from the servers that
host the IBM Watson website. This Watson data represents a controlled data set
and is composed of raw web access log files that contain HTTP requests to the
Watson site. Each log record contains information about the request, for example
the date and time of the request, the requested page, and the result of the request.
This tutorial guides you through a typical use case of downloading, importing, and
extracting raw sample data files and then indexing, searching, and analyzing those
files to understand patterns of errors and events. You can apply the same process
of importing, preparing, and analyzing to your own machine data.
Learning objectives
After completing the lessons in this module, you will understand the concepts and
processes associated with:
v Identifying and preparing your data
v Extracting meaningful information from your data
v Indexing and searching your data
v Understanding the impact of events (for example, time periods or outages) on
errors
v Viewing the results of your analysis in workbooks, charts, and dashboards.
Time required
Prerequisites
You must install IBM InfoSphere BigInsights and IBM Accelerator for Machine
Data Analytics.
Learn more about the Data Download application: The Data Download
application is a sample application that ships with InfoSphere BigInsights. It
downloads sample data sets that are used in tutorials from IBM developerWorks.
For more information about this application, see the Data Download application.
1. Open the InfoSphere BigInsights Console.
2. Select the Applications tab.
3. Locate and select the Data Download application:
a. In the Search field, type Data Download.
b. Optional: If the Data Download application has not been run, it may not be
available in the list, and you need to deploy it before you can run it.
4. Complete the required application parameters:
a. In the Execution name field, enter watson as the name for this execution of
the application. An execution name saves the parameter values for this run of
the application so that you can run the application again with the same
parameters.
b. To accept the download terms and conditions, select the Agree to terms
check box.
c. From the Select data set drop-down menu, select Sample log data set.
d. In the Target directory field, enter /watson as the distributed file system
directory where you want to save the output file.
5. To run the application, click Run.
6. In the Application History panel in the lower half of the window, monitor the
progress of the application.
The sample data and metadata file are downloaded and uncompressed to the
/watson/input/batch_watson directory, and the sample configuration files are
downloaded to the /watson/config directory.
View the downloaded sample data. On the Files tab, navigate to the
/watson/input/batch_watson directory. If the Files tab is already open, you might
need to click the Refresh icon ( ). The batch_watson directory contains two
files:
log.txt
Contains the contents of the downloaded sample data files, by default, in
text format. For each HTTP request to the Watson site, the sample data
contains a line that shows information about that request, for example the
IP address of proxy server, the date and time of the request, the request
itself, the path of the requested page, the result of the request, and the
requesting client. To view the data in a grid-like format, click the Sheet
radio button:
Tip: To ensure that the data displays across the entire viewing pane, click
Fit Column(s).
).
This view organizes the information into a grid-like format with rows for each
record and columns for each aspect of the file. Notice the normalized DateTime
format, URI paths, and CodesAndValues data:
Lesson checkpoint
The application read the files in the input directory, extracted fields from your
sample data according to the specifications in the extract.config file, and
generated output files to the specified output directory. Then, you viewed the
a. Click the Copy icon ( ), navigate to the location where you want to
save the copy, then click OK.
b. Navigate to the copy of the sample_logmonitoring_connections.properties
file, and click Edit.
c. Update the file with the password, user name, and host name of the node
where the InfoSphere BigInsights Console is installed, for example:
#BigInsights Credential Store file
#Contains the Console Node Host ID,
#the login Username/Password for the console node
password=mypassword
username=biadmin
host=ConsoleNodeHostID
d. Click Save.
Learn more about the Index application: The Index application is an application
that ships with IBM Accelerator for Machine Data Analytics. It creates facets for
each field in a batch of extracted machine data and uses the facets to create
indexes of the batches. For more information about this application, see Running
the Index application.
1. In the InfoSphere BigInsights Console, select the Applications tab.
2. To locate and select the Index application, type Index in the Search field. If the
Index application is not available, deploy the application.
3. Complete the required application parameters:
a. In the Execution name field, enter watson.
b. In the Operations drop-down menu, select All.
c. In the Source Directory field, enter /watson/output/extract_out as the
top-level directory that contains the subdirectory that holds your batch data.
d. In the Output Directory field, enter /watson/output/index_out as the
directory that will contain the output of the Index application.
e. In the Index.config File field, point to the /watson/config/index_config/
index.config file.
f. In the Credentials File Path field, specify the path for the modified
sample_logmonitoring_connections.properties file.
View your indexed results. On the Welcome tab of the InfoSphere BigInsights
Console, click Search machine data under the Quick Links section. The Faceted
Search user interface opens in a new window and displays 31,469 results:
Learn more about Faceted Search: The Faceted Search UI enhances search results
by:
v Identifying the file name, type, batch ID, URL, timestamp, and other pertinent
information about the files that contain your search term
v Filtering entries by text, time range, and date range across servers and time
zones
v Displaying the results in graph form, categorized by the months that you specify
in the filter
v Aiding in analysis and troubleshooting.
Sessions are sets of data that are grouped by the following characteristics:
v A partition key, which identifies and divides a data file
v A time window, or time gap, that exceeds user-specified thresholds.
A chained application is two or more linked, or chained, applications, such that the
output of one application is the input to the next.
Next, the application analyzes the transformed sessions and identifies frequently
occurring sequences of events to illustrate potential causes and patterns of errors.
For more information about this application, see Running the Time Window
Transformation - Frequent Sequence Analysis chained application.
1. In the InfoSphere BigInsights Console, select the Applications tab.
2. To locate and select the Time Window Transformation - Frequent Sequence
Analysis chained application, type Time Window Transformation - Frequent
Sequence Analysis in the Search field. If the Time Window Transformation -
Frequent Sequence Analysis application is not available, deploy the application.
3. Complete the required application parameters:
a. In the Execution name field, enter watson as the name for this run of the
application.
b. In the Applications drop-down menu, select All.
c. In the Time Window input path field, enter /watson/output/extract_out as
the directory of the output of the Extraction application.
d. In the Time Window output path field, enter /watson/output/timegap_out
as the directory where you want to save the output of the Time Window
Transformation.
e. In the Time Window configuration file field, point to the
/watson/config/timeGapTransform_config/timeGapTransform.config file.
This configuration file is specific to the sample data and controls how the
application analyzes the time windows.
f. In the Frequent Sequence output path field, enter /watson/output/
output_freqseq_out as the directory where you want to save the output file
for the Time Window Transformation - Frequent Sequence Analysis chained
application.
g. In the Frequent Sequence configuration file field, point to the
/watson/config/frequentSequence_config/frequentSequenceMiner.config
file. This configuration file is specific to the sample data and controls how
the application analyzes the frequently occurring sequences of events.
4. Add a new row to the Frequent Sequences workbook to contain the output of
this application:
On the BigSheets tab of the InfoSphere BigInsights Console, you can view data in
a workbook and chart to visually represent your data. On the Dashboard tab, you
can create a dashboard to see multiple graphical charts simultaneously and
compare results across various views.
Learn more about the visualization options: Workbooks display data from a
results file in sheet-like representations with rows for each record and columns for
different aspects of the data. Charts display data either in a grid with X and Y axes
or in a pie chart. Dashboards can be customized to show multiple charts if, for
example, you want to compare results from multiple runs over time or different
application runs after you update the application configuration files.
1. On the BigSheets tab of the InfoSphere BigInsights Console, select the
FrequentSequences workbook. This workbook shows the results of the Time
Window Transformation - Frequent Sequence Analysis chained application,
which are frequently occurring sequences of events, in a grid-like format:
Tip: You may have to click Fit column(s) and scroll to the right to see all the
workbook columns.
3. On the Dashboard tab, select Machine Data Accelerator from the Select
dashboard drop-down menu. Right now, you see only one widget in the
dashboard because you have run only one application. In the next lesson, you
run another application and then view the results of both applications. At that
point, the dashboard will have two widgets:
The Join Transformation application joins your machine data records into sessions,
and the Significance Analysis application identifies correlations within those
sessions.
Next, the application reads the output of the Join Transform application and finds
correlations between the URIs and CodesAndValues. These correlations indicate
the significance of URI events and errors.
For more information about this application, see Running the Join Transformation -
Significance Analysis chained application.
1. In the InfoSphere BigInsights Console, select the Applications tab.
2. To locate and select the Join Transformation - Significance Analysis chained
application, type Time Window Transformation - Frequent Sequence Analysis
in the Search field. If the Join Transformation - Significance Analysis
application is not available, deploy the application.
3. Complete the required application parameters:
a. In the Execution name field, enter watson.
b. In the Applications drop-down men, select All.
c. In the Event log and Context log fields, enter /watson/output/extract_out
as the output directory of the Extraction application.
d. In the Join output path field, enter /watson/output/join_out as the
directory where you want to save the output file for the Join Transformation
application.
e. In the Join configuration file field, point to the /watson/config/
joinTransform_config/joinTransform.config file. This configuration file is
specific to the sample data and controls how the application joins the
sessions in the machine data.
f. In the Significance output path field, enter /watson/output/
significance_out as the directory where you want to save the output for
the Significance Analysis application.
g. In the Significance configuration file field, point to the
/watson/config/significanceAnalysis_config/
significanceAnalysis.config file. This configuration file is specific to the
sample data and controls how the application analyzes correlations between
URIs and CodesAndValues.
4. Add a new row to the Significance Analysis workbook to contain the output of
this application:
a. Under Schedule and Advanced Settings, select the Update Workbook check
box.
b. From the Select Workbook drop-down menu, select SignificanceAnalysis.
c. From the Select Output drop-down menu, select Significance Output Path,
and click Add row.
5. Scroll to the top of the Applications tab, and click Run. Because this
application is processing parameters from two separate applications and
because of the complexity of the transformation and analysis, it might take
several minutes to run.
n. To save the chart settings, click the green checkmark ( ), then click
Run. You see a bar chart that calculates the total number of each Watson
feature (column 2 in the SignificanceAnalysis workbook) in the sample
data:
e. To save the Machine Data Accelerator dashboard, click the Save icon (
).
Lessons learned
In this tutorial, you download and analyze sample social data. The sample data
contains tweets about Sample Outdoor Company, a fictional company that sells
outdoor recreation equipment, sporting goods, and clothing, in Twitter format.
Your goal is to identify consumer perceptions of Sample Outdoor Company
products.
Learning objectives
After completing the lessons in this module, you will understand the concepts and
processes associated with:
v Identifying and preparing your data
v Identifying meaningful information in your data
v Viewing your analysis in a workbook, chart, and dashboard.
Time required
Prerequisites
1. Install IBM InfoSphere BigInsights.
2. Install IBM InfoSphere Streams.
3. Install IBM Accelerator for Social Data Analytics.
The sample data for this tutorial is available on the IBM developerWorks website.
This batch of sample social data represents a set of text files that contains tweets
about a fictional company, Sample Outdoor Company.
Learn more about the Data Download application: The Data Download
application is a sample application that ships with InfoSphere BigInsights. It
downloads sample data sets that are used in tutorials from IBM developerWorks.
For more information about this application, see the Data Download application.
For more information about this application, see the Data Download application.
1. Open the InfoSphere BigInsights Console. Open the InfoSphere BigInsights
Console.
2. Select the Applications tab.
3. Locate and select the Data Download application:
a. In the Search field, type Data Download.
b. Optional: If the Data Download application has not been run, it may not be
available in the list, and you need to deploy it before you can run it. If the
Data Download application has not been run, it may not be available in the
list, and you need to deploy it.
4. Complete the required application parameters:
a. In the Execution name field, enter SocialSampleDataDownload as the name
for this execution of the application. An execution name saves the parameter
values for this run of the application so that you can run the application
again with the same parameters.
b. Select the Agree to terms check box to accept the download terms and
conditions.
c. From the Select Data Set drop-down list, select Sample social data set.
d. In the Target directory field, enter /BigOut/data/raw_tweets/ as the
distributed file system directory where you want to save the output file.
5. Click Run to run the application.
6. In the Application History panel in the lower half of the window, monitor the
progress of the application.
Result
Tip: You may have to click Fit Column(s) so the data displays across
the entire viewing pane.
CourseProGolfBag_Neg_tweets3.dat
Contains the contents of the downloaded
CourseProGolfBag_Neg_tweets3.dat sample data file, by default, in text
format. The file contains negative feedback from Twitter about the
Sample Outdoor Company Course Pro Golf Bag. This file contains the
same details about the Twitter message that the All_Products_tweets1
file does, but it is focused on negative buzz, sentiment, and feedback
about a particular product, the Course Pro Golf Bag. Click the Sheet
radio button to view the data in a grid-like format:
CourseProGolfBag_Pos_tweets3.dat
Contains the contents of the downloaded
CourseProGolfBag_Pos_tweets3.dat sample data file, by default, in text
format. The file contains positive feedback from Twitter about the
Sample Outdoor Company Course Pro Golf Bag. This file contains the
same details about the Twitter message that the All_Products_tweets1
and CourseProGolfBag_Neg_tweets3.dat files do, but it is focused on
positive buzz, sentiment, and feedback about a particular product, the
Course Pro Golf Bag. Click the Sheet radio button to view the data in a
grid-like format:
Due to the inherently short length of tweets, Twitter users often abbreviate
common terms, phrases, and proper names. Tweets also often contain words that
might be extraneous to the analytical context. To ensure that the IBM Accelerator
for Social Data Analytics applications properly analyze the tweets, it is important
that you define and update the search objects file and the aliases file.
1. On the Files tab of the InfoSphere BigInsights Console, navigate to and select
the /BigOut/data/raw_tweets/config/searchobjects_BigOutdoors.csv file. If
the Files tab is already open in another window, you may need to click the
The rest of the file contains information about the search object and its features,
one per line, following the schema.
2. Click Edit.
3. Add the following text at the end of the file:
TRAILCHEF TENT|CAMPING EQUIPMENT||TRAILCHEF|CAMPING GEAR||TENT
This line adds the tent search object to the file. In the next lesson, the Brand
Management Retail Configure – Local – Global Analysis chained application
searches for and returns matches to these search objects.
4. Click Save.
5. On the Files tab, navigate to and select the /BigOut/data/raw_tweets/config/
aliases_BigOutdoors.csv file.
Learn more about the aliases file: The alias_bigOutdoors.csv file lists aliases,
separated by a caret () and, at the end of the line, the actual search object or
search object feature, separated by the pipe character (|), for example. An
aliases file maps abbreviations and alternative references to the brand (company
This line adds the Great Outdoors brand name to the alias file, enabling the
Brand Management Retail Configure – Local – Global Analysis chained
application to capture all references to the Great Outdoors brand.
Lesson checkpoint
By modifying and noting changes to the CSV files, you successfully configured
how your applications will analyze the sample data. Next, you run the Brand
Management Retail Configure – Local – Global Analysis chained application to
identify user sentiment in the data files.
A chained application is two or more linked, or chained, applications, such that the
output of one application is the input to the next.
The Configuration application enables you to set parameters and search terms for
feedback analysis, or analysis of expressions of sentiment, buzz, intent to purchase,
and other dimensions of your analysis.
The Local Analysis application performs narrow but deep data analysis, running
all the annotators and extractors that were compiled by the Configuration
application on one tweet, blog post, or board post at a time and generating
feedback and profile information.
The Global Analysis application aggregates the results of multiple runs of the Local
Analysis application to generate either a global profile of a user or a global profile
of a user associated with feedback of that user. The global profile is based on
boards, blogs, and tweets and includes both demographic and behavioral
information.
Learn more about the Configure - Local - Global chained application: The
Configure - Local - Global chained application links the Configuration, Local
Analysis, and Global Analysis applications, enabling you to configure, or
customize your analysis, and then run local and global analysis at the same time.
For more information about this chained application, see Running the Configure -
Local - Global chained application.
1. In the InfoSphere BigInsights Console, select the Applications tab.
2. To locate and select the Brand Management Retail Configure – Local – Global
Analysis chained application, type Brand Management Retail Configure –
Local – Global Analysis in the Search field. If the Brand Management Retail
Configure – Local – Global Analysis chained application has not been run, it
).
This view organizes the information into a grid-like format with rows for each
record and columns for each aspect of the file. You may have to click Fit
column(s) to arrange the columns more evenly across the viewing pane. Notice
the columns for the search object and the category, brand, and product of the
search object. If you scroll to the right of the file, you see additional feedback
dimensions, like whether the feedback contains buzz, sentiment, and ownership
information, the date and time of the message, and the original message text:
).
This view organizes the information into a grid-like format with rows for each
record and columns for each aspect of the file. Notice the columns for the
search object and the category, brand, and product of the search object. If you
scroll to the right of the file, you see additional feedback dimensions, like
whether the feedback contains buzz, sentiment, and ownership information, the
date and time of the message, and the original message text:
Lesson checkpoint
You successfully ran the analysis, according to the configurations in the
searchobjects_BigOutdoors.csv and aliases_BigOutdoors.csv files, on the sample
data. In the next lesson, you view general Twitter feedback about Sample Outdoor
Company products and positive and negative user feedback about the Course Pro
Golf Bag in the BM_Retail_GA_Feedback workbook.
Tip: You may have to click Fit column(s) and scroll to the right to see all
the workbook columns.
c. Optional: You can repeat this process to view the results in the following
charts:
4. Optional: If you want to see the workbook from which any of these charts are
derived, click the blue arrow in the upper-right toolbar of the chart name to
open the chart on the BigSheets tab. You see the output of the Global Analysis
application that was used to create the chart as a worksheet. Scroll through the
columns to see the data that is associated with each search object, the polarity
(positive or negative) of the tweet, the tweet text, and information about the
person who sent the tweet, such as screen name and geography. This
information is the beginning portion of the consumer profile.
Lessons learned
Notices
IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may
be used instead. However, it is the user's responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not grant you
any license to these patents. You can send license inquiries, in writing, to:
The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or
implied warranties in certain transactions, therefore, this statement may not apply
to you.
IBM may use or distribute any of the information you supply in any way it
believes appropriate without incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact:
IBM Corporation
J46A/G4
555 Bailey Avenue
San Jose, CA 95141-1003 U.S.A.
The licensed program described in this document and all licensed material
available for it are provided by IBM under terms of the IBM Customer Agreement,
IBM International Program License Agreement or any equivalent agreement
between us.
All statements regarding IBM's future direction or intent are subject to change or
withdrawal without notice, and represent goals and objectives only.
This information is for planning purposes only. The information herein is subject to
change before the products described become available.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
Each copy or any portion of these sample programs or any derivative work, must
include a copyright notice as follows:
© (your company name) (year). Portions of this code are derived from IBM Corp.
Sample Programs. © Copyright IBM Corp. _enter the year or years_. All rights
reserved.
If you are viewing this information softcopy, the photographs and color
illustrations may not appear.
Depending upon the configurations deployed, this Software Offering may use
session or persistent cookies. If a product or component is not listed, that product
or component does not use cookies.
Table 3. Use of cookies by InfoSphere Information Server products and components
Component or Type of cookie Disabling the
Product module feature that is used Collect this data Purpose of data cookies
Any (part of InfoSphere v Session User name v Session Cannot be
InfoSphere Information management disabled
v Persistent
Information Server web
v Authentication
Server console
installation)
Any (part of InfoSphere v Session No personally v Session Cannot be
InfoSphere Metadata Asset identifiable management disabled
v Persistent
Information Manager information
v Authentication
Server
installation) v Enhanced user
usability
v Single sign-on
configuration
If the configurations deployed for this Software Offering provide you as customer
the ability to collect personally identifiable information from end users via cookies
and other technologies, you should seek your own legal advice about any laws
applicable to such data collection, including any requirements for notice and
consent.
For more information about the use of various technologies, including cookies, for
these purposes, see IBM’s Privacy Policy at http://www.ibm.com/privacy and
IBM’s Online Privacy Statement at http://www.ibm.com/privacy/details the
section entitled “Cookies, Web Beacons and Other Technologies” and the “IBM
Software Products and Software-as-a-Service Privacy Statement” at
http://www.ibm.com/software/info/product-privacy.
IBM, the IBM logo, and ibm.com® are trademarks or registered trademarks of
International Business Machines Corp., registered in many jurisdictions worldwide.
Other product and service names might be trademarks of IBM or other companies.
A current list of IBM trademarks is available on the Web at www.ibm.com/legal/
copytrade.shtml.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Java and all Java-based trademarks and logos are trademarks or registered
trademarks of Oracle and/or its affiliates.
The United States Postal Service owns the following trademarks: CASS, CASS
Certified, DPV, LACSLink, ZIP, ZIP + 4, ZIP Code, Post Office, Postal Service, USPS
and United States Postal Service. IBM Corporation is a non-exclusive DPV and
LACSLink licensee of the United States Postal Service.
Your feedback helps IBM to provide quality information. You can use any of the
following methods to provide comments:
Procedure
v Send your comments by using the online readers' comment form at
www.ibm.com/software/awdtools/rcf/.
v Send your comments by e-mail to comments@us.ibm.com. Include the name of
the product, the version number of the product, and the name and part number
of the information (if applicable). If you are commenting on specific text, include
the location of the text (for example, a title, a table number, or a page number).
Printed in USA
GC19-4104-03