Está en la página 1de 7


What is the lookup function used to retrieve the particular duplictae datarecord
s in the lookup file
swarna Abinitio Interview Questions
Jun 29th, 2009
Use lookup_count for finding the duplicates and lookup_next for retrieving it.If
lookup_count (string file_label, [ expression [ , expression ... ] ] )>0lookup_
next ( lookup_identifier_type lookup_id, string lookup_template ) Njoi!!Abhi - f
resh as dew!
Nov 16th, 2006
Lookup_next function is used for retrieving duplicates records of that particula
r record

Wrapper Script executes multiple graphs(or) jobs,that means multiple .ksh jobs a
re going to execute here,
where coming to .ksh which will be execute a single component process. that mean
s every component process having the .ksh process
The fact which is not having any measures(like Quantity) are called factless fac
In the GDE choose File > Preferences to open the Preferences dialog.
Click the Advanced tab, select the Documentation Transforms checkbox and click O
To set the doc_transform parameter:
On the Parameters tab, click doc_transform.
A transform editor opens, displaying the input and output fields of the componen
In the transform editor, connect the inputs to the outputs with transformation r
ules to describe which output fields are computed from which input fields. For a
n example, see "Use the doc_transform parameter to declare dependencies".
(Optional) Add comments in the Documentation box of the transform editor.
Hi,Its all present in stdenv.ThanksAnne
What is the need of config variables in abinitio?(ab_job,ab_max_core) and where
to define them?
The output field will be only one for the reformat component i.e XML_DATA.
place an output file to capture all the 4 input records as 4 output records in x
ml format.

Output will be as below.

output-index sends each record to only one transform & out port
output-indexes sends each record to one or more transform & out port
dependency anayalis is used to give
data lineage and is defined as a data life cycle that includes the data's origin
s and where it moves over time." It describes what happens to data as it goes th
rough diverse processes. It helps provide visibility into the analytics pipeline
and simplifies tracing errors back to their sources.
Aggregate and ROLLUP generates records that summarize groups of records. But Rol
l-up provides more control over record selection, grouping, and aggregation.
Broadcast - Takes data from multiple inputs, combines it and sends it to all
the output ports. For example, You have 2 incoming flows (This can be data
parallelism or component parallelism) on Broadcast component, one with 10
records & other with 20 records. Then on all the outgoing flows (it can be any
number of flows) will have 10 + 20 = 30 records
Replicate - It replicates the data for a particular partition and send it out
to multiple out ports of the component, but maintains the partition integrity.
For example, Your incoming flow to replicate has a data parallelism level of 2.
with one partition having 10 recs & other one having 20 recs. Now suppose you
have 3 output flos from replicate. Then each flow will have 2 data partitions
with 10 & 20 records respectively.

use ""reinterpret_as" function to convert string to decimal,or decimal to string

syntax: To convert decimal onto string
reinterpret_as(ebcdic string(13),(ebcdic decimal(13))(in.cust_amount))
decimal_lrepad(string name, decimaldata type length)
decimal_lpad(string name, decimaldata type length)
A .dbc file has the information required for Ab Initio to connect to the databas
e to extract or load tables or views. While .CFG file is the table configuration
file created by db_config while using components like Load DB Table

cfg -- cfg means configuration file. It can be used to configure anything. for e
xample u want to set a value of a variable n export it value just set the value
in the .cfg file and run the cfg file in the start script.
.cfg resides in the config dir
.dbc resides in the db dir
In API mode the data processing(load/update/insert/delete) is slow, however othe
r process can access the database tables during the update.
Compared to above, Utility mode processing(load/update/insert/delete) goes faste
r, as it handles data records in large chunks, however during that time no other
process can access the database table, i.e., the process running in Utility mod
e locks the table/owns exclusive ownership of that database instance.

In cross functional & largely distributed organizations, API mode is recommended

considering the performance aspect, over Utility mode.
However for one time loads/initialization of huge volume data in tables, Utility
mode can be used.

When U load the data into a table in Utility mode,all the constraints are disabl
ed and then data is loaded which leads to faster access.
During API Mode constraints will be enables so that the access will be slow. '''
Lookup is basically a specific dataset which is keyed. This can be used to mappi
ng values as per the data present in a particular file (serial/multi file). The
dataset can be static as well dynamic ( in case the lookup file is being generat
ed in previous phase and used as lookup file in current phase).
What is Range Lookup?
May 2nd, 2010
Its returns the first data record When you defined the range indicated by the lo
wer and upper bound arguments.

It reduces the number of processes created by the graph and can enhance performa
nce of the graphs. It is combined saveral graph componants at a time in a group
during run. groups runs as a single process and use the less memory during runti

Use rollup component to count the number of record in the flat file.Use {} as ke
y in the key specifier. It will consider all the fields as one record and count
the total number of records.
Roll up generates data records that summarizes group of data records the rollup
can be used to perform the data aggregatuin like sum,avg,max,min etcScan compone
nt generates a series of cumulative summary records such as successive yera to d
ate totals for group of data records
Finalize function of SCAN Component is called for each record for a group where
as for rollup component it called only once.
Pipeline broken error will actually indicates the failure of a downstream compone
nt.It normally occurs when the database is running out of memory which makes databa
se components in the graph unavailable.
1. All its parameters are initialized2. Start script call3. Graph start4. End sc
ript executed

What command would you use to inspect the users currently logged into SQL Server
Aug 14th, 2009
The below SQL Query can solve the purpose:select username from V$SESSION;
Output index function is used in reformat having multiple output ports to direct
which record goes to which out port.
for eg. for a reformat with 3 out ports such a function could be like
if (value=='A') 1 else if (value=='B') 2 else 3
which basically means that if the field 'value' of any record evaluates to A in
the transform function, it will come out of port 1 only and not from 2 or 3.

The following are some possible strategies for dealing with this problem:
1) If a job does a bulk load into a table and you know the table was empty befor
e the job began, you can truncate the table before recovering the job.
2) If a job does a bulk load into a table and the table was not empty before the
job began, you can solve the problem by having the graph attach a load ID to th
e rows it loads. You can do this by putting a component, such as a REFORMAT, in
front of the component that loads the table to attach the ID to each row. Then b
efore you recover and rerun the job, you can delete the rows containing the ID.
3) Use an API mode component with a commitTable parameter, such as UPDATE_TABLE.
Such a component keeps track of the records it commits to the database. When yo
u recover and rerun the job, it will skip over the records it has already commit
For example, suppose you configure an Update Table component to do a commit ever
y 10,000 rows and a system failure occurs after it has processed 133,753 rows. W
hen you restart the job, the loader will know that the first 130,000 rows have b
een committed and skip them.
If you use this solution, you must design the graph so that the input data to th
e Update Table arrives in the same order every time you run the graph.

There are many ways to improve the performance of the graph. It also
depends on a particular graph, the components used in it.
In general the following tips can be used for improving performance:
1> Try to use partitioning in the graph
2> try minimizing the number of components
3> Maintain lookups for better efficiency
4> Components like join/ rollup should have the option ??Input must be sorted??,
if they are placed after a sort component.
5> If component have In memory: Input need not be sorted option selected, use
the MAX_CORE parameter value efficiently.
6> Use phasing of a graph efficiently.
7> Ensure that all the graphs where RDBMS tables are used as input, the join
condition is on indexed columns.

8> Try to perform the sort or aggregation operation of data in the source
tables at the database server itself, instead of using it in AbInitio.
There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimize the number of sort components
4) Minimize sorted join component and if possible replace them by in-memory
join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join
with proper driving port
8) For large dataset don't use broadcast as partitioner
9) Minimize the use of regular expression functions like re_index in the
transfer functions
10) Avoid repartitioning of data unnecessarily.
In addition to the above mentioned cases we can also include some more
1. If sort component is used and the sort keys are same for the next sort
component which follows after 2 or 3 components, then instead of using sort
component again it is preferable to use Sort within Groups component
mentioning these keys as major keys and other keys as minor keys. in this case
it assumes that major keys are already sorted and it needs to sort only on
minor keys.
eg: sort-1 component uses keys a,b,c and 2nd sort component after 2 - 3
components (in the same flow) uses keys a,b,e,f. in that case use sort within
groups in the 2nd case keeping a,b as major keys and e,f as minor keys.
2. when splitting records into more than two flows prefer Reformat rather than
Broadcast component.
3. For joining records from 2 flows use Concatenate component only when there
is a need to follow some specific order in joining records. If no order is
required then it is preferable to use Gather component.
4. Instead of too many Reformat component consecutively one after the other
use output indexes parameter in the first Reformat component and mention the
condition there. For detailed information on this concept refer Help.
Was this answer useful? Yes

The max-core parameter is found in Sort, Join and Rollup components. There is no
single, optimal value for the max-core parameter, since a good value depends on
your particular graph and the environment in which it runs.
The Sort component works in memory and the Rollup and Join components have the o
ption to do so. These components have a parameter called max-core that determine
s the maximum amount of memory they will consume per partition before they spill
to disk. When the value of max-core is exceeded in any of the in-memory compone
nts, all of the inputs are dropped to disk. This can have a dramatic impact on p
erformance, but this does not mean that it is always better to increase the valu
e of max-core.
The higher you set the value of max-core, the more memory the component can use.
Using more memory generally improves performance ? up to a point. Beyond this p
oint, performance will not improve and may even decrease. If the value of max-co
re is set too high, operating system swapping can occur and the graph may even f
ail if memory on the machine is exhausted.
Sort Component
For the Sort component 100 MB is the default value for max-core. This default is

used to cover a wide variety of situations and may not be ideal for your partic
ular circumstances. Increasing the value of max-core will not increase performan
ce unless the full dataset can be held in memory or the data volume is so large
that a reduction in the number of temporary files improves performance. You can
estimate the number of temporary files by multiplying the data volume being sort
ed by three and dividing by the value of max-core, since data is written to disk
in blocks that are one third the size of the max-core setting. This number shou
ld be less than 1000. For example, suppose that you are sorting 1GB of data with
the default max-core setting of 100 MB and the process is running in serial. Th
e number of temporary files that will be created is:
3 1,000MB / 100 MB = 30 files
You should decrease the value of a Sort component's max-core if an in-memory Rol
lup or Join component in the same phase would benefit from additional memory. Th
e net performance gain will be greater

one example is, if you want to use oracle hint in your select query at input tab
le...u should use AB_LOCAL(table) to parse the query at Database sQL engine not
by AI parser.
It is mainly used at parallel unload.
siddharth ranga
Sep 29th, 2005
There are several ways to do this:
1) We can move the table in the same or other tablespace and rebuild all the ind
exes on the table.
alter table <table_name> move <tablespace_name> this activity reclaims the defra
gmented space in the table
analyze table table_name compute statistics to capture the updated statistics.
2)Reorg could be done by taking a dump of the table, truncate the table and impo
rt the dump back into the tabl

Scan gives cumulative summary for each record.

Roll up gives group summary.
For Example
Scan Output:
RollUp Output:

Scan gives how many input records that many output records.
Roll up gives only group summary records

What is Conduct>It in AbInitio?

Conduct>It is an environment for creating enterprise Ab Initio data integration
systems. Its main role is to create AbInitio Plans which is a special type of gr
aph constructed of another graphs and scripts. AbInitio provides both graphical
and command-line interface to Conduct>IT.
What is a data profiler inAbInitio?
The Data Profiler is a graphical data analysis tool which runs on top of the Co>
Operating system. It can be used to characterize data range, scope, distribution
, variance, and quality.
Describe the Evaluation of Parameters order - Abinitio (05-28-2012)
Describe the Evaluation of Parameters order.
Ex : Say the destination field is decimal(5) then use
- out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5))
Following is the order of evaluation:
- Host setup script will be executed first
- All Common parameters, that is, included , are evaluated
- All Sandbox parameters are evaluated
- The project script

project-start.ksh is executed

- All form parameters are evaluated

- Graph parameters are evaluated
- The Start Script of graph is executed
Truncate :- It is a DDL command, used to delete tables or clusters. Since it is
a DDL command hence it is auto commit and Rollback can't be performed. It is fas
ter than delete.
Delete:- It is DML command, generally used to delete a record, clusters or table
s. Rollback command can be performed , in order to retrieve the earlier deleted
things. To make deleted things permanently, "commit" command should be used.
We can avoid duplicate by using "key_change" method of the rollup component.
The code will be like below.
out :: key_change(prev,curr)=
begin out :: cur != prev ; end out :: rollup(in) = begin out :: in ; end
Feb 2nd, 2009
Use dedup sort to avoid duplicates