Documentos de Académico
Documentos de Profesional
Documentos de Cultura
with Informatica
PowerCenter 7.1.2
This document discusses configuration and how-tos using PowerCenter 7.1.2 and NCR’s Teradata
RDBMS. It covers Teradata basics and also describes some “tweaks” which experience has shown
may be necessary to adequately deal with some of the “common” practices one may encounter at
a Teradata account. The Teradata documentation (especially the MultiLoad, FastLoad and Tpump
reference) is highly recommended reading material, as is the “External Loader” section of the
PowerCenter’s Server Manager Guide.
Additional Information: All Teradata documentation can be downloaded from the NCR web site
(http://www.info.ncr.com/Teradata/eTeradata-BrowseBy.cfm), it is also available on the Informatica
Tech Support website (tsspider.informatica.com/Docs/page1.html). There is a nice Teradata FAQ in
the Informatica Tech Support knowledge base (it contains a section on how to handle “timestamp”
columns). Finally, there is a “Teradata Forum” that provides a wealth of sometimes useful
information ( http://www.Teradataforum.com ).
Teradata Basics
Teradata is a relational database management system from NCR. It offers high performance for
very large databases tables because of its highly parallel architecture. It is a major player in the
retail space. While Teradata can run on other platforms, it is predominantly found on NCR hardware
(which runs NCR’s version of Unix). It is very fast and very scalable.
Teradata Hardware
The NCR computers on which Teradata runs support both MPP (Massively Parallel Processing) and
SMP (Symetric Multi-Processing). Each MPP “node” (or semi-autonomous processing unit) can
support SMP.
Teradata can be configured to communicate directly with a mainframe’s I/O channel. This is known
as “channel attached”. Alternatively, it can be “network attached”. That is, configured to
communicate via TCP/IP over a LAN. Since PowerCenter runs on Unix, most of the time you will be
dealing with a “network attached” configuration. However, once in a while, a client will want to use
their existing “channel attached” configuration under the auspices of better performance. Do not
necessarily assume that “channel attached” is always faster than “network attached”. Similar
performance has been observed across a channel attachment as well as a 100MB LAN. In addition,
“channel attached” requires an additional sequential data move because the data must be moved
from the PowerCenter server to the mainframe prior to moving the data across the mainframe
channel to Teradata.
Teradata Software
In the Teradata world, there are Teradata Director Program Ids (TDPIDs), databases and users. The
TDPID is simply the name that one uses to connect from a Teradata client to Teradata server (think
Oracle “tnsnames.ora” entry). Teradata also looks at databases and users somewhat
synonymously. A user has a userid, password and space to store tables. A database is basically a
user without a login and password (or a user is a database with a userid and password).
Teradata AMPs are Access Module Processors. Think of AMPs as Teradata’s parallel database
engines. Although they are strictly software (“virtual processors” according to NCR terminology),
Teradata folks often seem to use AMP and hardware “node” interchangeable because in the “old
days” an AMP was a piece of hardware.
This tells Teradata that when a client tool references the instance “demo1099”, it should direct
requests to “localhost” (or ip address 127.0.0.1), when a client tool references instance “p”, this
located on the server “curly” (or ip address 192.168.80.113). There is no tie here to any kind of
database server specific information (this is not similar to Oracle’s instance id. Tdpid <> Oracle
instance id!!!). That is, the tdpid is used strictly to define the name a client uses to connect to a
server. You can really call a server whatever you want. Teradata does not care. It simply takes the
name you specify, looks in the “host” file to map the <name>cop1 (or cop2, etc.) to an IP address,
and then attempts to establish a connect with Teradata at the IP address.
Sometimes you’ll see multiple entries in a hosts file with similar tdpids:
This setup allows load balancing of clients among multiple Teradata nodes. That is, most Teradata
systems have many nodes, and each node has its own IP address. Without the multiple hosts file
entries, every client will connect to one node and eventually this node will be doing more than its
“fair share” of client processing. With multiple host file entries, if it takes too long for the node
specified with the “cop1” suffix to respond (i.e. curly_1) to the client request to connect to “p”, then
the client will automatically attempt to connect to the node with the “cop2” suffix (i.e. curly_2) and
so forth.
ODBC: Teradata provides 32-bit ODBC drivers for Windows and Unix platforms. If possible, use the
ODBC driver from Teradata’s TTU7 release (or above) of their client software because this version
supports “array reads”. Tests have shown these “new” drivers (3.02) can be 20%-30% faster than
the “old” drivers (3.01). This lastest release of Teradata’s TTU8.0 uses ODBC v3.0421. Teradata’s
ODBC is on a performance par with Teradata’s SQL CLI. In fact, ODBC is Teradata recommended
SQL interface for their partners.
Do not use ODBC to write to Teradata unless you’re writing very small data sets (and even then,
you should probably use Tpump defined later instead) because Teradata’s ODBC is optimized for
query access and, hence, is not optimized for writing data. ODBC is good for sourcing and lookups.
PowerCenter Designer uses Teradata’s ODBC to import Source and Target table.
ODBC Unix
When the PowerCenter server is running on Unix, then ODBC is required to read (both sourcing and
lookups) from Teradata
As with all Unix ODBC drivers, the key to configuring the Unix ODBC driver is adding the appropriate
entries to the “.odbc.ini” file. To correctly configure the “.odbc.ini” file, there must be an entry
under [ODBC Data Sources] that points to the Teradata ODBC driver shared library (tdata.sl on HP-
UX, standard shared library extension on other flavors of Unix ).
The following example shows the required entries from an actual “.odbc.ini” file (note the path to
the driver may be different on each computer):
[TeraTest]
Driver=/usr/odbc/drivers/tdata.sl
Description=Teradata Test System
DBCName=148.162.247.34
Similar to the client “hosts” file setup, one can specify multiple IP addresses for the DBCName to
balance the client load across multiple Teradata nodes. Consult with the Teradata administrator for
exact details on this (or copy the entries from the PC client’s “hosts” file (see section Client
Configuration Basics).
Important note: Make sure that the Merant ODBC path precedes the Teradata ODBC path
information in the PATH and SHLIB_PATH (or LD_LIBRARY_PATH, etc.) environment variables. This is
because both sets of ODBC software use some of the same file names. PowerCenter should use the
Merant files because this is the software that has been certified.
All of the Teradata loader connections will require a value to the TDPID attribute. Refer to the first
section of this document to understand how to correctly enter the value. All of these loaders
require:
All of these loaders will also produce a log file. This log file will be the means to debug the loader if
something goes wrong. As these are external loaders, PowerCenter will only receive back from the
loader whether it ran successfully or not.
To land the input flat file that the loaders need to disk, the “is staged” attribute must be checked. If
the “is staged” attribute is not set, then the file will be piped/streamed to the loader.
If one selects the non-staged mode for a loader, one should also set the “checkpoint” property to 0.
This effectively turns off the “checkpoint” processing. “Checkpoint” processing is used for
recovery/restart of fastload and multiload sessions. However, if one is not using a physical file as
input, but rather a named pipe, then the recovery/restart mechanism of the loaders does not work.
Not only does this impact performance (i.e. the checkpoint processing is not free and we want to
eliminate as much unnecessary overhead as possible), but a non-zero checkpoint value will
sometimes cause seemingly random errors and session failures when used with named pipe input
(as is the case with “streaming” mode).
This starts the pmserver using the “pmserver.cfg” config file and points stdout and stderr to the file
“pmserver.out”. In this way, stderr and stdout will be defined even after the terminal session logs
out. Important note: There are no spaces in the token “2>&1”. This tells Unix to point stderr to
the same place stdout is pointing.
As an alternative to this method, you can specify the console output file name in the pmserver.cfg
file. That is, information written to “standard output” and “standard error” will go the file specified
as follows:
ConsoleOutputFilename=<FILE_NAME>
With this entry in the pmserver.cfg file, one can start the pmserver normally (i.e.
./pmserver).
Partitioned Loading
With PowerCenter v7.x, if one sets a “round robin” partition point on the target definition and sets
each target instance to be loaded using the same loader connection instance, then PowerCenter
automatically writes all data to the first partition and only starts one instance of FastLoad or
MultiLoad. You will know you are getting this behavior if you see the following entry in the session
log:
MAPPING> DBG_21684 Target [TD_INVENTORY] does not support multiple partitions. All
data will be routed to the first partition.
If you do not see this message, then chances are the session fails with the following error:
WRITER_1_*_1> WRT_8240 Error: The external loader [Teradata Mload Loader] does not
support partitioned sessions.
WRITER_1_*_1> Thu Jun 16 11:58:21 2005
WRITER_1_*_1> WRT_8068 Writer initialization failed. Writer terminating.
Cleaning up after a failed MultiLoad: MultiLoad supports sophisticated error recovery. That is, it allows load
jobs to be restarted without having to redo all of the prior work. However, for the types of problems normally
encountered during a POC (loading null values into a column that does not support nulls, incorrectly formatted
date columns), the error recovery mechanisms tend to get in the way. To learn about MultiLoad’s sophisticated
error recovery read the MultiLoad manual. To learn how to work around the recovery mechanisms to restart a
failed MultiLoad script from scratch, read this section.
To recover from a failed MultiLoad, one must “release” the target table from the “MultiLoad” state
and also drop the MultiLoad log table. One can do this using BTEQ or QueryMan to issue the
following commands:
Note: The “drop table” command assumes that you’re recovering from a MultiLoad script
generated by PowerCenter (PowerCenter always names the MultiLoad log table “mldlog_<table
name>). If you’re working with a hand-coded MultiLoad script, the name of the MultiLoad log table
could be anything.
Here is the actual text from a BTEQ session which cleans up a failed load to the table “td_test”
owned by the user “infatest”:
1) Use a dummy session (i.e. set test rows to 1 and target a test database) to generate
MultiLoad control files for each of the targets.
Here’s an example of a control file merged from two default control files:
.DATEFORM ANSIDATE;
.LOGON demo1099/infatest,infatest;
.LOGTABLE infatest.mldlog_TD_TEST;
c:\LOGS\TgtFiles\td_test.out.ldrlog ;
.Layout InputFileLayout1;
.Layout InputFileLayout2;
.END MLOAD;
.LOGOFF;
FastLoad
As the name suggests, this is a very fast utility to load data into Teradata. It is the fastest method
to load data into Teradata. However, there is one major restriction: the target table must be empty.
PushDown
The is also commonly referred to as ELT. The Zeus release of PowerCenter will have the ability to
create SQL that will execute in the Teradata database server that will replace certain transformation
that would normally run in the PowerCenter Server. This is critical to Teradata as it is common for
the Source and Target to reside in the same database.