Está en la página 1de 39

Ab Initio

Ab Initio means “ Starts From the Beginning”. Ab-Initio software works with the client-server
model.

The client is called “Graphical Development Environment” (you can call it GDE).It resides on
user desktop.The server or back-end is called Co-Operating System”. The Co-Operating System
can reside in a mainframe or unix remote machine.

The Ab-Initio code is called graph ,which has got .mp extension. The graph from GDE is
required to be deployed in corresponding .ksh version. In Co-Operating system the
corresponding .ksh in run to do the required job.

How Ab-Initio Job Is Run What happens when you push the “Run” button?

•Your graph is translated into a script that can be executed in the Shell Development

•This script and any metadata files stored on the GDE client machine are shipped (via FTP) to
the server.

•The script is invoked (via REXEC or TELNET) on the server.

•The script creates and runs a job that may run across many hosts.

•Monitoring information is sent back to the GDE client.

Ab-Intio Environment The advantage of Ab-Initio code is that it can run in both the serial and
multi-file system environment. Serial Environment: The normal UNIX file system. Muti-File
System: Multi-File System (mfs) is meant for parallelism. In an mfs a particular file physically
stored across different partition of the machine or even different machine but pointed by a logical
file, which is stored in the co-operating system. The logical file is the control file which holds the
pointer to the physical locations.

About Ab-Initio Graphs: An Ab-Initio graph comprises number of components to serve different
purpose. Data is read or write by a component according to the dml ( do not confuse with the
database “data manipulating language” The most commonly used components are described in
the following sections.

Co Operating System

Co Operating System is a program provided by AbInitio which operates on the top of the
operating system and is a base for all AbInitio processes. It provdes additional features known as
air commands which can be installed on a variety of system environments such as Unix, HP-UX,
Linux, IBM AIX, Windows systems. The AbInitio CoOperating System provides the following
features: - Manage and run AbInitio graphs and control the ETL processes - Provides AbInitio
extensions to the operating system

- ETL processes monitoring and debugging- Metadata management and interaction with the
EME

AbInitio GDE (Graphical Development Enviroment)

GDE is a graphical application for developers which is used for designing and running AbInitio
graphs. It also provides: - The ETL process in AbInitio is represented by AbInitio graphs. Graphs
are formed by components (from the standard components library or custom), flows (data
streams) and parameters.

- A user-friendly frontend for designing Ab Initio ETL graphs- Ability to run, debug Ab Initio
jobs and trace execution logs- GDE AbInitio graph compilation process results in generation of a
UNIX shell script which may be executed on a machine without the GDE installed

AbInitio EME

Enterprise Meta Environment (EME) is an AbInitio repository and environment for storing and
managing metadata. It provides capability to store both business and technical metadata. EME
metadata can be accessed from the Ab Initio GDE, web browser or AbInitio CoOperating system
command line (air commands)

Conduct It

Conduct It is an environment for creating enterprise Ab Initio data integration systems. Its main
role is to create AbInitio Plans which is a special type of graph constructed of another graphs and
scripts. AbInitio provides both graphical and command-line interface to Conduct IT.

Data Profiler

The Data Profiler is an analytical application that can specify data range, scope, distribution,
variance, and quality. It runs in a graphic environment on top of the Co Operating system.

Component Library

The Ab Initio Component Library is a reusable software module for sorting, data transformation,
and high-speed database loading and unloading. This is a flexible and extensible tool which
adapts at runtime to the formats of records entered and allows creation and incorporation of new
components obtained from any program that permits integration and reuse of external legacy
codes and storage engines.
1.We know rollup component in Abinitio is used to summarize the group of data record
then why do we use aggregation?

Aggregation and Rollup, both are used to summarize the data.

Rollup is much better and convenient to use.

Rollup can perform some additional functionality, like input filtering and output filtering of
records.

Aggregate does not display the intermediate results in main memory, whereas Rollup can.

Analyzing a particular summarization is much simpler compared to Aggregations.

2.Mention what is Abinitio?

“Abinitio” is a latin word meaning “from the beginning.” Abinitio is a tool used to extract,
transform and load data. It is also used for data analysis, data manipulation, batch processing,
and graphical user interface based parallel processing.

3.What are the operations that support avoiding duplicate record?

Duplicate records can be avoided by using the following:

Using Dedup sort

Performing aggregation

Utilizing the Rollup component

4.Mention what is Rollup Component?

Roll-up component enables the users to group the records on certain field values. It is a multiple
stage function and consists initialize 2 and Rollup 3.

5.What kind of layouts does Abinitio support?

Abinitio supports serial and parallel layouts.

A graph layout supports both serial and parallel layouts at a time.

The parallel layout depends on the degree of the data parallelism

A multi-file system is a 4-way parallel system

A component in a graph system can run 4-way parallel system.


6. Explain what is the architecture of Abinitio?

Architecture of Abinitio includes

GDE (Graphical Development Environment)

Co-operating System

Enterprise meta-environment (EME)

Conduct-IT

7.What is MAX CORE of a component?

MAX CORE is the space consumed by a component that is used for calculations

Each component has different MAX COREs

Component performances will be influenced by the MAX CORE’s contribution

The process may slow down / fasten if a wrong MAX CORE is set

8.Explain what is de-partition in Abinitio?

De-partition is done in order to read data from multiple flow or operations and are used to re-join
data records from different flows. There are several de-partition components available which
include Gather, Merge, Interleave, and Concatenation.

9.How do you add default rules in transformer?

The following is the process to add default rules in transformer

Double click on the transform parameter in the parameter tab page in component properties

Click on Edit menu in Transform editor

Select Add Default Rules from the dropdown list box.

It shows Match Names and Wildcard options. Select either of them.

10.Mention what is the role of Co-operating system in Abinitio?

The Abinitio co-operating system provide features like

Manage and run Abinitio graph and control the ETL processes

Provide Abinitio extensions to the operating system

ETL processes monitoring and debugging


Meta-data management and interaction with the EME

11.State the first_defined function with an example.

This function is similar to the function NVL() in Oracle database

It performs the first values which are not null among other values available in the function and
assigns to the variable

Example: A set of variables, say v1,v2,v3,v4,v5,v6 are assigned with NULL.


Another variable num is assigned with value 340 (num=340)
num = first_defined(NULL, v1,v2,v3,v4,v5,v6,NUM)
The result of num is 340

12.Explain what is SANDBOX?

A SANDBOX is referred for the collection of graphs and related files that are saved in a single
directory tree and behaves as a group for the purposes of navigation, version control, and
migration.

13.How to run a graph infinitely?

To run a graph infinitely…

The .ksh graph file should be called by the end script in the graph.

If the graph name is abc.mp then the graph should call the abc.ksh file.

Explain what does dependency analysis mean in Abinitio?

In Abinitio, dependency analysis is a process through which the EME examines a project entirely
and traces how data is transferred and transformed- from component-to-component, field-by-
field, within and between graphs.

14.Explain PDL with an example?

To make a graph behave dynamically, PDL is used

Suppose there is a need to have a dynamic field that is to be added to a predefined DML while
executing the graph

Then a graph level parameter can be defined

Utilize this parameter while embedding the DML in output port.

For Example : define a parameter named myfield with a value “string(“ | “”) name;”
Use ${mystring} at the time of embedding the dml in out port.

Use $substitution as an interpretation option

15.Mention what dedup-component and replicate component does?

Dedup component: It is used to remove duplicate records

Replicate component: It combines the data records from the inputs into one flow and writes a
copy of that flow to each of its output ports

16.What is a local lookup?

Local lookup file has records which can be placed in main memory

They use transform function for retrieving records much faster than retrieving from the disk.

17.Mention how can you connect EME to Abinitio Server?

To connect with Abinitio Server, there are several ways like

Set AB_AIR_ROOT

Login to EME web interface- http://serverhost:%5Bserverport%5D/abinitio

Through GDE, you can connect to EME data-store

Through air-command

18.Describe the Evaluation of Parameters order.

Following is the order of evaluation:

Host setup script will be executed first

All Common parameters, that is, included , are evaluated

All Sandbox parameters are evaluated

The project script – project-start.ksh is executed

All form parameters are evaluated

Graph parameters are evaluated

The Start Script of graph is executed

19.Explain what is Sort Component in Abinitio?


The Sort Component in Abinitio re-orders the data. It comprises of two parameters “Key” and
“Max-core”.

Key: It is one of the parameters for sort component which determines the collation order

Max-core: This parameter controls how often the sort component dumps data from memory to
disk

20.what is a ramp limit?

A limit is an integer parameter which represents a number of reject events

Ramp parameter contain a real number representing a rate of reject events of certain processed
records

The formula is – No. of bad records allowed = limit + no. of records x ramp

A ramp is a percentage value from 0 to 1.

These two provides the threshold value of bad records.

21.Mention what information does a .dbc file extension provides to connect to thedatabase?

The .dbc extension provides the GDE with the information to connect with the database are

Name and version number of the data-base to which you want to connect

Name of the computer on which the data-base instance or server to which you want to connect
runs, or on which the database remote access software is installed

Name of the server, database instance or provider to which you want to link

22.Explain the methods to improve the performance of a graph?

The following are the ways to improve the performance of a graph:

Make sure that a limited number of components are used in a particular phase

Implement the usage of the optimum value of max core values for the purpose of sorting and
joining components.

Utilize the minimum number of sort components

Utilize the minimum number of sorted join components and replace them by in-
memory join/hash join, if needed and possible

Restrict only the needed fields in sort, reformat, join components


Utilize phasing or flow buffers when merged or sorted joins

Use sorted join, when two inputs are huge, otherwise use hash join

informatica vs ab initio
Feature AB Initio Informatica

About Tool Code based ETL Engine based ETL

Parallelism Supports One Types of parallelism Supports three types of


parallelism

Scheduler No scheduler Schedule through script


available

Error Handling Can attach error and reject files One file for all

Robust Robustness by function comparison Basic in terms of


robustness

Feedback Provides performance metrics for Debug mode, but slow


each component executed implementation

Delimiters while Supports multiple delimeters Only dedicated delimeter


reading

Q. What is the relation between eme, gde and co-operating system?


Eme is said as enterprise metadataenv, gde as graphical development env and co-operating
system can be said as abinitio server relation b/w this co-op, eme and gde is as fallowsco
operating system is the abinitio server. This co-op is installed on particular o.s platform that is
called native o.s .coming to the eme, its just as repository in Informatica, its hold the metadata,
transformations, dbconfig files source and targets information’s. Coming to gde its is end user
environment where we can develop the graphs (mapping just like in Informatica) designer uses
the gde and designs the graphs and save to the eme or sand box it is at user side. Where eme is at
server side.
Q. What are the benefits of data processing according to you?
Well, processing of data derives a very large number of benefits. Users can put separate many
factors that matters to them. In addition to this, with the help of this approach, one can easily
keep up the pace simply by deriving data into different structures from a totally unstructured
format. In addition to this, processing is useful in eliminating various bugs that are often
associated with the data and cause problems at a later section. It is because of no other reason
than this, data processing has wide application in a number of tasks.
Q. What exactly do you understand with the term data processing and businesses can trust
this approach?
Processing is basically a procedure that simply covert the data from a useless form into a useful
one without making a lot of efforts. However, the same may vary depending on factors such as
the size of data and its format. A sequence of operations is generally carried out to perform this
task and depending on the type of data, this sequence could be automatic or manual. Because in
the present scenario, most of the devices that perform this task are PC’s automatic approach is
more popular than ever before. Users are free to obtain data in forms such as a table, vectors,
images, graphs, charts and so on. This is the best things that business owners can simply enjoy.
Q. How data is processed and what are the fundamentals of this approach?
There are certain activities which require the collection of the data and the best thing is
processing largely depends on the same in many cases. The fact is data needs to be stored and
analyzed before it is actually processed. This task depends on some major factors are they are
1. Collection of Data
2. Presentation
3. Final Outcomes
4. Analysis
5. Sorting
These are also regarded as the basic fundamentals that can be trusted to keep up the pace in this
matter.

Q. What would be the next step after collecting the data?


Once the data is collected, the next important task is to enter it in the concerned machine or
system. Well, gone are those days when storage depends on papers. In the present time, data size
is very large and it needs to be performed in a reliable manner. The digital approach is a god
option for this as it simply let users perform this task easily and in fact without compromising
with anything. A large set of operations then need to be performed for the meaningful analysis.
In many cases, conversion also largely matters and users are always free to consider the
outcomes which best meet their expectations.
Q. What is a data processing cycle and what is its significance?
Data often needs to be processed continuously and it is used at the same time. It is known as data
processing cycle. The same provide results which are quick or may take extra time depending on
the type, size and nature of data. This is boosting the complexity in this approach and thus there
is a need of methods that are reliable and advanced than existing approaches. The data cycle
simply make sure that complexity can be avoided upto the possible extent and without doing
much.
Q. What are the factors on which storage of data depends?
Basically, it depends on the sorting and filtering. In addition to this, it largely depends on the
software one uses.
Q. Do you think effective communication is necessary in the data processing? What is your
strength in terms of same?
The biggest ability that one could have in this domain is the ability to rely on the data or the
information. Of course, communication matters a lot in accomplishing several important tasks
such as representation of the information. There are many departments in an organization and
communication make sure things are good and reliable for everyone.
Q. Suppose we assign you a new project. What would be your initial point and the key steps
that you follow?
The first thing that largely matters is defining the objective of the task and then engages the team
in it. This provides a solid direction for the accomplishment of the task. This is important when
one is working on a set of data which is completely unique or fresh. After this, next big thing that
needs attention is effective data modeling. This includes finding the missing values and data
validation. Last thing is to track the results.
Q. Suppose you find the term Validation mentioned with a set of data, what does that
simply represent?
It represents that the concerned data is clean, correct and can thus be used reliably without
worrying about anything. Data validation is widely regarded as the key points in the processing
system.
Q. What do you mean by data sorting?
It is not always necessary that data remains in a well-defined sequence. In fact, it is always a
random collection of objects. Sorting is nothing but arranging the data items in desired sets or in
sequence.
Q. Name the technique which you can use for combining the multiple data sets simply?
It is known as Aggregation
Q. How scientific data processing is different from commercial data processing?
Scientific data processing simply means data with great amount of computation i.e. arithmetic
operations. In this, a limited amount of data is provided as input and a bulk data is there at the
outcome. On the other hand commercial data processing is different. In this, the outcome is
limited as compare to the input data. The computational operations are limited in commercial
data processing.
Q. What are the benefits of data analyzing
It makes sure of the following:
1. Explanation of development related to the core tasks can be assured
2. Test Hypotheses with an integration approach is always there
3. Pattern detection in a reliable manner
Q. What are the key elements of a data processing system?
These are Converter, Aggregator, Validator, Analyzer, Summarizer, and a sorter
Q. Name any two stages of the data processing cycle and provide your answer in terms of a
comparative study of them?
The first is Collection and second one is preparation of data. Of course, the collection is the first
stage and preparation is the second in a cycle dealing with data processing. The first stage
provides baseline to the second and the success and simplicity of the first depends on how
accurately the first has been accomplished. Preparation is mainly the manipulation of important
data. Collection break data sets while Preparation joins them together.
Q. What do you mean by the overflow errors?
While processing data, calculations which are bulky are often there and it is not always necessary
that they fit the memory allocated for them. In case a character of more than 8-bits is stored
there, this errors results simply
Q. What are the facts that can compromise data integrity?
There are several errors that can cause this issue and can transform many other problems. These
are:
1. Bugs and malwares
2. Human error
3. Hardware error
4. Transfer errors which generally include data compression beyond a limit.
Q. What is data encoding?
Data needs to be kept confidential in many cases and it can be done through this approach. It
simply make sure of information remains in a form which no one else than the sender and the
receiver can understand.
Q. What does EDP stand for?
It means Electronic Data Processing
Q. Name one method which is generally considered by remote workstation when it comes
to processing
Distributed processing
Q. What do you mean by a transaction file and how it is different from that of a Sort file?
The Transaction file is generally considered to hold input data and that is for the time when a
transaction is under process. All the master files can be updated with it simply. Sorting is done to
assign a fixed location to the data files on the other hand.
Q. What is the use of aggregation when we have rollupas we know rollup component in
abinitio is used to summarize group of data record. Then where we will use aggregation?
Aggregation and Rollup both can summarize the data but rollup is much more convenient to use.
In order to understand how a particular summarization being rollup is much more explanatory
compared to aggregate. Rollup can do some other functionality like input and output filtering of
records.Aggregate and rollup perform same action, rollup display intermediate result in main
memory, Aggregate does not support intermediate result.
Learn Ab Initio Tutorial
Q. What are kinds of layouts does ab initio supports?
Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at
the same time. The parallel one depends on the degree of data parallelism. If the multi-file
system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is
defined such as it’s same as the degree of parallelism.
Q. How do you add default rules in transformer?
Double click on the transform parameter of parameter tab page of component properties, it will
open transform editor. In the transform editor click on the Edit menu and then select Add Default
Rules from the dropdown. It will show two options – 1) Match Names 2) Wildcard.
Q. Do you know what a local lookup is?
If your lookup file is a multifile and partioned/sorted on a particular key then local lookup
function can be used ahead of lookup function call. This is local to a particular partition
depending on the key.
Lookup File consists of data records which can be held in main memory. This makes the
transform function to retrieve the records much faster than retrieving from disk. It allows the
transform component to process the data records of multiple files fast.
Q. What is the diff b/w look-up file and look-up, with a relevant example?
Generally, Lookup file represents one or more serial files (Flat files). The amount of data is small
enough to be held in the memory. This allows transform functions to retrieve records much more
quickly than it could retrieve from Disk.
Q. How many components in your most complicated graph?
It depends the type of components you us. Usually avoid using much complicated transform
function in a graph.
Q. Have you worked with packages?
Multistage transform components by default use packages. However user can create his own set
of functions in a transfer function and can include this in other transfer functions.
Q. Can sorting and storing be done through single software or you need different for these
approaches?
Well, it actually depends on the type and nature of data. Although it is possible to accomplish
both these tasks through the same software, many software have their own specialization and it
would be good if one adopts such an approach to get the quality outcomes. There are also some
pre-defined set of modules and operations that largely matters. If the conditions imposed by them
are met, users can perform multiple tasks with the similar software. The output file is provided in
the various formats.
Q. What are the different forms of output that can be obtained after processing of data?
These are
1.Tables
2. Plain Text files
3. Image files
4. Maps
5. Charts
6. Vectors
7. Raw files
Sometime data is required to be produced in more than one format and therefore the software
accomplishing this task must have features available in it to keep up the pace in this matter.

Q. Give one reason when you need to consider multiple data processing?
When the required files are not the complete outcomes which are required and need further
processing.
Q. What are the types of data processing you are familiar with?
The very first one is the manual data approach. In this, the data is generally processed without
the dependency on a machine and thus it contains several errors. In the present time, this
technique is not generally followed or only a limited data is proceed with this approach. The
second type is the Mechanical data processing. The mechanical devices have some important
roles in it this approach. When the data is a combination of different formats, this approach is
adopted. The next approach is the Electronic data processing which is regarded as fastest and is
widely adopted in the current scenario. It has top accuracy and reliability.
Q. Name the different type of processing based on the steps that you know about?
They are:
1. Real-Time processing
2. Multiprocessing
3. Time Sharing
4. Batch processing
5. Adequate Processing
Q. Why do you think data processing is important?
The fact is data is generally collected from different sources. Thus, the same may vary largely in
a number of terms. The fact is this data needs to be passed from various analysis and other
processes before it is stored. This process is not as easy as it seems in most of the cases. Thus,
processing matter. A lot o time can be saved by processing the data to accomplish various tasks
that largely matters. The dependency on the various factors for the reliable operation can also be
avoided by to a good extent.
Q. What is common among data validity and Data Integrity?
Both these approaches deal with errors related with errors and make sure of smooth flow of
operations that largely matters.
Q. What do you mean by the term data warehousing? Is it different from Data Mining?
Many times there is a need to have data retrieval, warehousing can simply be considered to
assure the same without affecting the efficiency of operational systems. It simply supports
decision support and always works in addition to the business applications and Customer
Relationship Management and warehouse architecture. Data mining is closely related to this
approach. It assures simple findings of required operators from the warehouse.
Q. What exactly do you know about the typical data analysis?
It generally involves the organization as well as the collection of important files in the form of
important files. The main aim is to know the exact relation among the industrial data or the full
data and the one which is analyzed. Some experts also call it as one of the best available
approaches to find errors. It entails the ability to spot problems and enable the operator to find
out root causes of the errors.
Q. Have you used rollup component? Describe how?
If the user wants to group the records on particular field values then rollup is best way to do that.
Rollup is a multi-stage transform function and it contains the following mandatory functions.
1. Initialize
2. Rollup
3. Finalize
Also need to declare one temporary variable if you want to get counts of a particular group.
For each of the group, first it does call the initialize function once, followed by rollup function
calls for each of the records in the group and finally calls the finalize function once at the end of
last rollup call.
Q. How to add default rules in transformer?
Add Default Rules — Opens the Add Default Rules dialog. Select one of the following: Match
Names — Match names: generates a set of rules that copies input fields to output fields with the
same name. Use Wildcard (.*) Rule — Generates one rule that copies input fields to output fields
with the same name.
1) If it is not already displayed, display the Transform Editor Grid.
2) Click the Business Rules tab if it is not already displayed.
3) Select Edit > Add Default Rules.
In case of reformat if the destination field names are same or subset of the source fields then no
need to write anything in the reformat xfr unless you dont want to use any real transform other
than reducing the set of fields or split the flow into a number of flows to achieve the
functionality.
Q. What is the difference between partitioning with key and round robin?
Partition by Key or hash partition ->This is a partitioning technique which is used to partition
data when the keys are diverse. If the key is present in large volume then there can large data
skew? But this method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the data on each of
the destination data partitions. The skew is zero in this case when no of records is divisible by
number of partitions. A real life example is how a pack of 52 cards is distributed among 4
players in a round-robin manner.
Q. How do you improve the performance of a graph?
There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimize the number of sort components
4) Minimize sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving
port
8) For large dataset don’t use broadcast as partitioner
9) Minimize the use of regular expression functions like re_index in the transfer functions
10) Avoid repartitioning of data unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be partitioned and if
possible output file should also be partitioned.
Q. How do you truncate a table?
From Abinitio run sql component using the DDL “truncate table by using the truncate table
component in Ab Initio
Q. Have you ever encountered an error called “depth not equal”?
When two components are linked together if their layout does not match then this problem can
occur during the compilation of the graph. A solution to this problem would be to use a
partitioning component in between if there was change in layout.
Q. What are primary keys and foreign keys?
In RDBMS the relationship between the two tables is represented as Primary key and foreign key
relationship. Whereas the primary key table is the parent table and foreign key table is the child
table. The criteria for both the tables are there should be a matching column.
Q. What is an outer join?
An outer join is used when one wants to select all the records from a port – whether it has
satisfied the join criteria or not.

1. What are the components or functions available in ab initio?

Answer:

The main components in ab initio are here below,

Component Purpose
Dedup To remove duplicates

Join To join multiple input dataset based on a common key value.

Sort This component reorders the data. It takes the collation order and dumps data to memory

Filter Any conditional related removal of data.

Replicate This is component is mainly for the parallelism as an additional copy of data is
useful while any other nodes go unavailable.

merge This component is to combine multiple input data.

2. What are the types of parallel processing?

Answer:

This is the common Ab initio Interview questions asked in an interview. Different types of
parallel processing are,

Component parallelism

Data parallelism

Pipeline parallelism

Component parallelism: An application that has multiple components running on the system
simultaneously. But the data are separate. This is achieved through component level parallel
processing.

Data parallelism: Data is split into segments and runs the operations simultaneously. This kind of
process is achieved using the data parallelism

Popular Course in this category

Data Scientist Course

43 Online Courses | 170+ Hours | Verifiable Certificate of Completion | Lifetime Access

4.8 (1,006 ratings) Course Price

₹16499 ₹39999
View Course

Related Courses

SAS Training CourseMachine Learning CourseAWS Training Course

Pipeline parallelism: An application with multiple components but running on the same dataset.
This uses pipeline parallelism.

3. What is the different way to achieve the partitions?

Answer:

There are multiple ways to do the partitions.

Partitions Description

Expression Data split according to the data manipulation language.

Key Grouping the data by specific keys

Load balance Dynamic load balancing

Percentage Segregate the data where the output size is on the fractions of 100

Range Split the data evenly based on a key and a range among the nodes

Round robin Distributing the data evenly in blocksize across the output partitions.

Let us move to the next Ab initio interview Questions.

4. What is a multifile system?

Answer:

Multifile is a set of directories on different nodes in a cluster. They possess an identical directory
structure. The multifile system leads to a better performance as it is parallel processing where the
data resides on multiple disks.

It is created with the control partition on one node and data partitions on the other nodes to
distribute the processing in order to improve the performance.
Part 2 – Ab initio Interview Questions (Advanced)

Let us now have a look at the advanced Ab initio Interview Questions.

6. What kind of layouts does Ab initio support?

Answer:

Supports serial and parallel layouts.

A graph layout supports both serial and parallel layouts at a time.

A multi-file system is a 4-way parallel system

A component in a graph system can run 4-way parallel system.

7. What is the relation between Enterprise metadata environment (EME), the Graphical
development environment (GDE) and co-operating system?

Answer:

CoOperating System: It operates on top of the operating system and this is provided by the ab
initio and it the base for all Ab Initio processes. Air commands are one of the features that can be
installed on different operating systems like UNIX, Linux, IBM etc

These are the following features that it provides,

– Manages and runs Ab Initio graphs and control the ETL processes

– Providing the extensions

– ETL processes monitoring and debugging

– Metadata management and interaction with the EME

GDE: It’s a designing component and used to run the ab initio graphs.
Graphs are formed by the components (predefined or user-defined) and flows and the
parameters. It provides the ETL process in Ab Initio that is represented by graphs.

Data Scientist Course

43 Online Courses | 170+ Hours | Verifiable Certificate of Completion | Lifetime Access

Watch The Course Preview

Ability to run, debug the process logs jobs and trace execution logs

Enterprise Meta-Environment (EME): It’s an environment for storage and also metadata
management (Both business and technical metadata). The metadata is accessed from the
graphical development environment and also the web browser or the cooperating command line.
It is ab initio repository for any placeholders.

Let us move to the next Ab initio interview questions.

8.How data is processed and what are the fundamentals of this approach?

Answer:

There are certain activities which require the collection of the data and the best thing is
processing largely depends on the same in many cases. Before processing the data it has to reside
on some placeholder like a well-defined storage. This task depends on some major factors are
they are

1. Collection of Data

2. Presentation

3.Final Outcomes

4.Analysis

5.Sorting
9. What is the difference between partitioning with key and round robin?

Answer :

This is the advanced Ab initio interview questions asked in an interview. Partition by key

In this, we have to specify the key based on which the partition will occur. It results in well-
balanced data due to the key based partitions. It is useful for key dependent

parallelism.

Partition by round robin: In this, distributing data evenly in block size chunks the records are
partitioned in a sequential way across the output partition. It is not key

based and results are well-balanced data especially with a block size of 1. It is useful for

record independent parallelism.

10. How do you improve the performance of a graph?

Answer:

There are many ways the performance of the graph can be improved.

1) Reduce the usage of multiple components on certain phases.

2) Use a refined and well defined value of max core values for sort and join components

3) Minimize the use of regular expression functions like re_index in the transfer functions

4) Minimize sorted join component and if possible replace them by in-memory join/hash join

5) Use only required fields in the sort, reformat, join components

6) Using Phase or the flow buffering during the cases of merge or sorted joins

7) Use hash join if the two sets of input is small else better to choose the sorted join for the huge
input size

8) For large dataset better not use broadcast as partitioned

9) Reduce the number of sort components while processing.

10) Avoid repartitioning of data unnecessarily


1. Mention what information does a .dbc file extension provides to connect to the database
? (Ab Initio Scenario Based Interview Questions)
Answer:
The .dbc extension provides the GDE with the information to connect with the database are
• Name and version number of the data-base to which you want to connect
• Name of the computer on which the data-base instance or server to which you want to connect
runs, or on which the database remote access software is installed
• Name of the server, database instance or provider to which you want to link.

Ab Initio Scenario Based Interview Questions

2. What is a data processing cycle and what is its significance ?


Answer:
Data often needs to be processed continuously and it is used at the same time. It is known as data
processing cycle. The same provide results which are quick or may take extra time depending on
the type, size and nature of data. This is boosting the complexity in this approach and thus there
is a need of methods that are reliable and advanced than existing approaches. The data cycle
simply make sure that complexity can be avoided upto the possible extent and without doing
much.

3. Suppose we assign you a new project. What would be your initial point and the key steps
that you follow ?
Answer:
The first thing that largely matters is defining the objective of the task and then engages the team
in it. This provides a solid direction for the accomplishment of the task. This is important when
one is working on a set of data which is completely unique or fresh. After this, next big thing that
needs attention is effective data modeling. This includes finding the missing values and data
validation. Last thing is to track the results.

4. What do you mean by the term data warehousing? Is it different from Data Mining ?
Answer:
Many times there is a need to have data retrieval, warehousing can simply be considered to
assure the same without affecting the efficiency of operational systems. It simply supports
decision support and always works in addition to the business applications and Customer
Relationship Management and warehouse architecture. Data mining is closely related to this
approach. It assures simple findings of required operators from the warehouse.

5. Have you ever encountered an error called “depth not equal” ?


Answer:
When two components are linked together if their layout does not match then this problem can
occur during the compilation of the graph. A solution to this problem would be to use a
partitioning component in between if there was change in layout.
6. What is a cursor? Within a cursor, how would you update fields on the row just fetched
?
Answer:
The oracle engine uses work areas for internal processing in order to the execute sql statement is
called cursor.There are two types of cursors like Implecit cursor and Explicit cursor.Implicit
cursor is using for internal processing and Explicit cursor is using for user open for data required.

7. What are Cartesian joins ?


Answer:
A Cartesian join will get you a Cartesian product. A Cartesian join is when you join every row of
one table to every row of another table. You can also get one by joining every row of a table to
every row of itself.

8. Can anyone give me an exaple of realtime start script in the graph ?


Answer:
Here is a simple example to use a start script in a graph:
In start script lets give as:
export $DT=`date ‘+%m%d%y’`
Now this variable DT will have today’s date before the graph is run.
Now somewhere in the graph transform we can use this variable as;
out.process_dt::$DT;
which provides the value from the shell.

Ab Initio Scenario Based Interview Questions.

9. What is skew and skew measurement ?


Answer:
skew is the mesaureof data flow to each partation .
suppose i/p is comming from 4 files and size is 1 gb
1 gb= ( 100mb+200mb+300mb+5oomb)
1000mb/4= 250 mb
(100- 250 )/500= –> -150/500 == cal ur self it wil come in -ve value.
calclu for 200,500,300.
+ve value of skew is allways desriable.
skew is a indericet measure of graph.

10. Do you think effective communication is necessary in the data processing? What is your
strength in terms of same ?
Answer:
The biggest ability that one could have in this domain is the ability to rely on the data or the
information. Of course, communication matters a lot in accomplishing several important tasks
such as representation of the information. There are many departments in an organization and
communication make sure things are good and reliable for everyone.
Top 46 Linux Interview Questions And Answers Pdf
11. Describe in detail about lookup ?
Answer:
A group of keyed dataset is said called lookup. The datasets in lookup can be classified into two
types such as Static and Dynamic. In the case of dynamic datasets, the lookup file would be
generated in the previous phase and used in the current phase. With respect to the data present in
a particular multi/serial file, lookup can be used to map values.

12. What kind of layouts does Abinitio support ?


Answer:
• Abinitio supports serial and parallel layouts.
• A graph layout supports both serial and parallel layouts at a time.
• The parallel layout depends on the degree of the data parallelism
• A multi-file system is a 4-way parallel system
• A component in a graph system can run 4-way parallel system.

13. What is a local lookup ?


Answer:
• Local lookup file has records which can be placed in main memory
• They use transform function for retrieving records much faster than retrieving from the disk.

14. Mention what is the role of Co-operating system in Abinitio ?


Answer:
The Abinitio co-operating system provide features like Manage and run Abinitio graph and
control the ETL processes
Provide Abinitio extensions to the operating system
ETL processes monitoring and debugging
Meta-data management and interaction with the EME.
15. Mention what is Abinitio ?
Answer:
“Abinitio” is a latin word meaning “from the beginning.” Abinitio is a tool used to extract,
transform and load data. It is also used for data analysis, data manipulation, batch processing,
and graphical user interface based parallel processing.

16. Mention what is Rollup Component ?


Answer:
Roll-up component enables the users to group the records on certain field values. It is a multiple
stage function and consists initialize 2 and Rollup 3.

Ab Initio Scenario Based Interview Questions.

17. What is the importance of EME in abinitio ?


Answer:
EME is a repository in Ab Inition and it used for checkin and checkout for graphs also maintains
graph version.

18. What are steps to create repository in AB Initio ?


Answer:
If you have installed AB Initio in a standalone machine, then there is no need to create a separate
repository as it will be created automatically during the installation process. You could be able to
view the newly created automated repository under AB Initio folder.

19. What would be the next step after collecting the data ?
Answer:
Once the data is collected, the next important task is to enter it in the concerned machine or
system. Well, gone are those days when storage depends on papers. In the present time, data size
is very large and it needs to be performed in a reliable manner. The digital approach is a god
option for this as it simply let users perform this task easily and in fact without compromising
with anything. A large set of operations then need to be performed for the meaningful analysis.
In many cases, conversion also largely matters and users are always free to consider the
outcomes which best meet their expectations.20. Suppose you find the term Validation
mentioned with a set of data, what does that simply represent ?Answer:
It represents that the concerned data is clean, correct and can thus be used reliably without
worrying about anything. Data validation is widely regarded as the key points in the processing
system.

21. How scientific data processing is different from commercial data processing ?
Answer:
Scientific data processing simply means data with great amount of computation i.e. arithmetic
operations. In this, a limited amount of data is provided as input and a bulk data is there at the
outcome. On the other hand commercial data processing is different. In this, the outcome is
limited as compare to the input data. The computational operations are limited in commercial
data processing.22. Name any two stages of the data processing cycle and provide your answer
in terms of a comparative study of them ?Answer:
The first is Collection and second one is preparation of data. Of course, the collection is the first
stage and preparation is the second in a cycle dealing with data processing. The first stage
provides baseline to the second and the success and simplicity of the first depends on how
accurately the first has been accomplished. Preparation is mainly the manipulation of important
data. Collection break data sets while Preparation joins them together.23. What do you mean by
a transaction file and how it is different from that of a Sort file ?Answer:
The Transaction file is generally considered to hold input data and that is for the time when a
transaction is under process. All the master files can be updated with it simply. Sorting is done to
assign a fixed location to the data files on the other hand.
Ab Initio Scenario Based Interview Questions

24. Do you know what a local lookup is ?


Answer:
If your lookup file is a multifile and partioned/sorted on a particular key then local lookup
function can be used ahead of lookup function call. This is local to a particular partition
depending on the key.
Lookup File consists of data records which can be held in main memory. This makes the
transform function to retrieve the records much faster than retrieving from disk. It allows the
transform component to process the data records of multiple files fast.

25. How many components in your most complicated graph ?


Answer:
It depends the type of components you us. Usually avoid using much complicated transform
function in a graph.

26. Have you worked with packages ?


Answer:
Multistage transform components by default use packages. However user can create his own set
of functions in a transfer function and can include this in other transfer functions.

27. What are the different forms of output that can be obtained after processing of data ?
Answer:
These are

1. Tables
2. Plain Text files
3. Image files
4. Maps
5. Charts
6. Vectors
7. Raw files

Sometime data is required to be produced in more than one format and therefore the software
accomplishing this task must have features available in it to keep up the pace in this matter.

28. What exactly do you know about the typical data analysis ?
Answer:
It generally involves the organization as well as the collection of important files in the form of
important files. The main aim is to know the exact relation among the industrial data or the full
data and the one which is analyzed. Some experts also call it as one of the best available
approaches to find errors. It entails the ability to spot problems and enable the operator to find
out root causes of the errors.

29. How to add default rules in transformer ?


Answer:
Add Default Rules — Opens the Add Default Rules dialog. Select one of the following: Match
Names — Match names: generates a set of rules that copies input fields to output fields with the
same name. Use Wildcard (.*) Rule — Generates one rule that copies input fields to output fields
with the same name.
1) If it is not already displayed, display the Transform Editor Grid.
2) Click the Business Rules tab if it is not already displayed.
3) Select Edit > Add Default Rules.
In case of reformat if the destination field names are same or subset of the source fields then no
need to write anything in the reformat xfr unless you dont want to use any real transform other
than reducing the set of fields or split the flow into a number of flows to achieve the
functionality. Ab initio training
30. How do you truncate a table ?
Answer:
From Abinitio run sql component using the DDL “truncate table by using the truncate table
component in Ab Initio

31. Describe the Grant/Revoke DDL facility and how it is implemented ?


Answer:
Basically,This is a part of D.B.A responsibilities GRANT means permissions for example
GRANT CREATE TABLE ,CREATE VIEW AND MANY MORE .
REVOKE means cancel the grant (permissions).So,Grant or Revoke both commands depend
upon D.B.A.

Ab Initio Scenario Based Interview Questions

32. How would you find out whether a SQL query is using the indices you expect ?
Answer:
Explain plan can be reviewed to check the execution plan of the query. This would guide if the
expected indexes are used or not.

33. What is the purpose of having stored procedures in a data baset ?


Answer:
Main Purpose of Stored Procedure for reduse the network trafic and all sql statement executing
in cursor so speed too high.

34. How do you convert 4-way MFS to 8-way mfs ?


Answer:
To convert 4 way to 8 way partition we need to change the layout in the partioning component.
There will be seperate parameters for each and every type of partioning eg. AI_MFS_HOME,
AI_MFS_MEDIUM_HOME, AI_MFS_WIDE_HOME etc.
The appropriate parameter need to be selected in the component layout for the type of partioning.

35. What is $mpjret? Where it is used in ab-initio ?


Answer:
You can use $mpjret in endscript like
if 0 -eq($mpjret)
then
echo “success”
else
mailx -s “[graphname] failed” mailid.

36. What is the difference between a Scan component and a RollUp component ?
Answer:
Rollup is for group by and Scan is for successive total. Basically, when we need to produce
summary then we use scan. Rollup is used to aggregate data.

37. What is the Difference between DML Expression and XFR Expression ?
Answer:
The main difference b/w dml & xfr is that
DML represent format of the metadata.
XFR represent the tranform functions.which will contain business
rules

38. How can i run the 2 GUI merge files ?


Answer:
Do you mean by merging Gui map files in WR.If so, by merging GUI map files in GUI map
editor it wont create corresponding test script.without testscript you cant run a file.So it is
impossible to run a file by merging 2 GUI map files.

39. What is the difference between rollup and scan ?


Answer:
By using rollup we cant generate cumulative summary records for that we will be using scan.

Ab Initio Scenario Based Interview Questions

40. What is common among data validity and Data Integrity ?


Answer:
Both these approaches deal with errors related with errors and make sure of smooth flow of
operations that largely matters.

41. Name the different type of processing based on the steps that you know about ?
Answer:
They are:

1. Real-Time processing
2. Multiprocessing
3. Time Sharing
4. Batch processing
5. Adequate Processing
42. What is the diff b/w look-up file and look-up, with a relevant example ?
Answer:
Generally, Lookup file represents one or more serial files (Flat files). The amount of data is small
enough to be held in the memory. This allows transform functions to retrieve records much more
quickly than it could retrieve from Disk. (Top 43 Abinitio Interview Questions And Answers)
43. How to run a graph infinitely ?
Answer:
To run a graph infinitely…The .ksh graph file should be called by the end script in the graph.
If the graph name is abc.mp then the graph should call the abc.ksh file. company
44. Mention how can you connect EME to Abinitio Server ?
Answer:
To connect with Abinitio Server, there are several ways like
• Set AB_AIR_ROOT
• Login to EME web interface- http://serverhost:[serverport]/abinitio
• Through GDE, you can connect to EME data-store
• Through air-command

45. How can you force the optimizer to use a particular index ?
Answer:
Use hints /*+ */, these acts as directives to the optimizer

46. What are the operations that support avoiding duplicate record ?
Answer:
Duplicate records can be avoided by using the following:
• Using Dedup sort
• Performing aggregation
• Utilizing the Rollup component

47. What is m_dump ?


Answer:
m_dump command prints the data in a formatted way.

m_dump

Ab Initio Scenario Based Interview Questions

48. What is the latest version that is available in Ab-initio ?


Answer:
The latest version of GDE ism1.15 AND Co>operating system is 2.14

49. What are differences between different versions of Co-op ?


Answer:
1.10 is a non key version and rest are key versions.
There are lot of components added and revised at following versions.

50. Explain about AB Initio’s dependency analysis ?


Answer:
Dependency analysis in AB Initio is closely associated with data linage. Data linage provides the
source for data and upon the implementation of dependency analysis, the type of applications
dependent on the data can be identified. Dependency analysis also helps to carry out maximum
retrieval operation (from existing data) by the use of surrogate key. New records can be
generated when using scan or next_in_sequence/reformat sequence.

51. informatica vs ab initio ?


Answer:
Feature AB Initio Informatica
About Tool Code based ETL Engine based ETL
Parallelism Supports One Types of parallelism Supports three types of parallelism
Scheduler No scheduler Schedule through script available
Error Handling Can attach error and reject files One file for all
Robust Robustness by function comparison Basic in terms of robustness
Feedback Provides performance metrics for each component executed Debug mode, but slow
implementation
Delimiters while reading Supports multiple delimeters Only dedicated delimeter

52. What are the benefits of data analyzing ?


Answer:
It makes sure of the following:

1. Explanation of development related to the core tasks can be assured


2. Test Hypotheses with an integration approach is always there
3. Pattern detection in a reliable manner

53. What are the key elements of a data processing system ?


Answer:
These are Converter, Aggregator, Validator, Analyzer, Summarizer, and a sorter.

54. What are the facts that can compromise data integrity ?
Answer:
There are several errors that can cause this issue and can transform many other problems. These
are:

1. Bugs and malwares


2. Human error
3. Hardware error
4. Transfer errors which generally include data compression beyond a limit.
55. What does EDP stand for ?
Answer:
It means Electronic Data Processing56. Give one reason when you need to consider multiple data
processing ?Answer:
When the required files are not the complete outcomes which are required and need further
processing.57. Can sorting and storing be done through single software or you need different for
these approaches ?Answer:
Well, it actually depends on the type and nature of data. Although it is possible to accomplish
both these tasks through the same software, many software have their own specialization and it
would be good if one adopts such an approach to get the quality outcomes. There are also some
pre-defined set of modules and operations that largely matters. If the conditions imposed by them
are met, users can perform multiple tasks with the similar software. The output file is provided in
the various formats.[vc_wp_posts number=”5″]

Q. Mention what is Abinitio?


“Abinitio” is a latin word meaning “from the beginning.” Abinitio is a tool used to extract, transform
and load data. It is also used for data analysis, data manipulation, batch processing, and graphical
user interface based parallel processing.

Q. Explain what is the architecture of Abinitio?


Architecture of Abinitio includes

 GDE (Graphical Development Environment)


 Co-operating System
 Enterprise meta-environment (EME)
 Conduct-IT

Q. Mention what is the role of Co-operating system in Abinitio?


The Abinitio co-operating system provide features like

 Manage and run Abinitio graph and control the ETL processes
 Provide Abinitio extensions to the operating system
 ETL processes monitoring and debugging
 Meta-data management and interaction with the EME
Q. Explain what does dependency analysis mean in Abinitio?
In Abinitio, dependency analysis is a process through which the EME examines a project entirely
and traces how data is transferred and transformed- from component-to-component, field-by-field,
within and between graphs.

Q. Explain how Abinitio EME is segregated?


Abinition is logically divided into two segments

 Data Integration Portion


 User Interface ( Access to the meta-data information)

Q. Mention how can you connect EME to Abinitio Server?


To connect with Abinitio Server, there are several ways like

 Set AB_AIR_ROOT
 Login to EME web interface- http://serverhost:[serverport]/abinitio
 Through GDE, you can connect to EME data-store
 Through air-command

Q. List out the file extensions used in Abinitio?


The file extensions used in Abinitio are

 .mp:It stores Abinitio graph or graph component


 .mpc:Custom component or program
 .mdc:Dataset or custom data-set component
 .dml:Data manipulation language file or record type definition
 .xfr:Transform function file
 .dat:Data file (multifile or serial file)

Q. Mention what information does a .dbc file extension provides to connect to the
database?
The .dbc extension provides the GDE with the information to connect with the database are

 Name and version number of the data-base to which you want to connect
 Name of the computer on which the data-base instance or server to which you want to connect runs,
or on which the database remote access software is installed
 Name of the server, database instance or provider to which you want to link
Q. Explain how you can run a graph infinitely in Abinitio?
To execute graph infinitely, the graph end script should call the .ksh file of the graph. Therefore, if
the graph name is abc.mp then in the end script of the graph it should call to abc.ksh. This will run
the graph for infinitely.

Q. Mention what the difference between “Look-up” file and “Look is up” in Abinitio?
Lookup file defines one or more serial file (Flat Files); it is a physical file where the data for the Look-
up is stored. While Look-up is the component of abinitio graph, where we can save data and
retrieve it by using a key parameter.

Q. Mention what are the different types of parallelism used in Abinitio?


Different types of parallelism used in Abinitio includes

 Component parallelism:A graph with multiple processes executing simultaneously on separate


data uses parallelism
 Data parallelism:A graph that works with data divided into segments and operates on each
segments respectively, uses data parallelism.
 Pipeline parallelism:A graph that deals with multiple components executing simultaneously on the
same data uses pipeline parallelism. Each component in the pipeline read continuously from the
upstream components, processes data and writes to downstream components. Both components
can operate in parallel.

Q. Explain what is Sort Component in Abinitio?


The Sort Component in Abinitio re-orders the data. It comprises of two parameters “Key” and “Max-
core”.

 Key: It is one of the parameters for sort component which determines the collation order
 Max-core: This parameter controls how often the sort component dumps data from memory to disk

Q. Mention what dedup-component and replicate component does?

 Dedup component:It is used to remove duplicate records


 Replicate component:It combines the data records from the inputs into one flow and writes a copy
of that flow to each of its output ports
Q. Mention what is a partition and what are the different types of partition
components in Abinitio?
In Abinitio, partition is the process of dividing data sets into multiple sets for further processing.
Different types of partition component includes

 Partition by Round-Robin:Distributing data evenly, in block size chunks, across the output
partitions
 Partition by Range: You can divide data evenly among nodes, based on a set of partitioning ranges
and key
 Partition by Percentage: Distribution data, so the output is proportional to fractions of 100
 Partition by Load balance: Dynamic load balancing
 Partition by Expression: Data dividing according to a DML expression
 Partition by Key: Data grouping by a key

Q. Explain what is SANDBOX?


A SANDBOX is referred for the collection of graphs and related files that are saved in a single
directory tree and behaves as a group for the purposes of navigation, version control, and migration.

Q. Explain what is de-partition in Abinitio?


De-partition is done in order to read data from multiple flow or operations and are used to re-join
data records from different flows. There are several de-partition components available which
includes Gather, Merge, Interleave, and Concatenation.

Q. List out some of the air commands used in Abintio?


Air command used in Abinitio includes

 air object Is<EME path for the object-/Projects/edf/..>: It is used to see the listings of objects in a
directory inside the project
 air object rm<EME path for the object-/Projects/edf/..>: It is used to remove an object from the
repository
 air object versions-verbose<EME path for the object-/Projects/edf/..>: It gives the version
history of the object.

Other air command for Abinitio include air object cat, air object modify, air lock show user, etc.
Q. Mention what is Rollup Component?
Roll-up component enables the users to group the records on certain field values. It is a multiple
stage function and consists initialize 2 and Rollup 3.

Q. Mention what is the syntax for m_dump in Abinitio?


The syntax for m_dump in Abinitio is used to view the data in multifile from unix prompt. The
command for m_dump includes

 m_dump a.dml a.dat:This command will print the data as it manifested from GDE when we view
data in formatted text
 m_dump a.dml a.dat>b.dat:The output is re-directed in b.dat and will act as a serial file.b.dat that
can be referred when it is required.

Q. What is the relation between eme, gde and co-operating system?


Eme is said as enterprise metadataenv, gde as graphical development env and co-operating system
can be said as abinitio server relation b/w this co-op, eme and gde is as fallowsco operating system
is the abinitio server. This co-op is installed on particular o.s platform that is called native o.s .coming
to the eme, its just as repository in Informatica, its hold the metadata, transformations, dbconfig files
source and targets information’s. Coming to gde its is end user environment where we can develop
the graphs (mapping just like in Informatica) designer uses the gde and designs the graphs and save
to the eme or sand box it is at user side. Where eme is at server side.

Q. What is the use of aggregation when we have rollupas we know rollup component
in abinitio is used to summarize group of data record. Then where we will use
aggregation?
Aggregation and Rollup both can summarize the data but rollup is much more convenient to use. In
order to understand how a particular summarization being rollup is much more explanatory
compared to aggregate. Rollup can do some other functionality like input and output filtering of
records.Aggregate and rollup perform same action, rollup display intermediate result in main
memory, Aggregate does not support intermediate result.

Q. What are kinds of layouts does ab initio supports?


Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the
same time. The parallel one depends on the degree of data parallelism. If the multi-file system is 4-
way parallel then a component in a graph can run 4 way parallel if the layout is defined such as it’s
same as the degree of parallelism.
Q. How can you run a graph infinitely?
To run a graph infinitely, the end script in the graph should call the .ksh file of the graph. Thus if the
name of the graph is abc.mp then in the end script of the graph there should be a call to abc.ksh.
Like this the graph will run infinitely.

Q. How do you add default rules in transformer?


Double click on the transform parameter of parameter tab page of component properties, it will open
transform editor. In the transform editor click on the Edit menu and then select Add Default Rules
from the dropdown. It will show two options – 1) Match Names 2) Wildcard.

Q. Do you know what a local lookup is?


If your lookup file is a multifile and partioned/sorted on a particular key then local lookup function can
be used ahead of lookup function call. This is local to a particular partition depending on the key.
Lookup File consists of data records which can be held in main memory. This makes the transform
function to retrieve the records much faster than retrieving from disk. It allows the transform
component to process the data records of multiple files fast.

Q. What is the difference between look-up file and look-up, with a relevant example?
Generally Lookup file represents one or more serial files(Flat files). The amount of data is small
enough to be held in the memory. This allows transform functions to retrieve records much more
quickly than it could retrieve from Disk.
A lookup is a component of abinitio graph where we can store data and retrieve it by using a key
parameter.A lookup file is the physical file where the data for the lookup is stored.

Q. How many components in your most complicated graph?


It depends the type of components you us. Usually avoid using much complicated transform function
in a graph.

Q. Explain what is lookup?


Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per the
data present in a particular file (serial/multi file). The dataset can be static as well dynamic ( in case
the lookup file is being generated in previous phase and used as lookup file in current phase).
Sometimes, hash-joins can be replaced by using reformat and lookup if one of the inputto the join
contains less number of records with slim record length.AbInitio has built-in functions to retrieve
values using the key for the lookup.

Q. Have you worked with packages?


Multistage transform components by default use packages. However user can create his own set of
functions in a transfer function and can include this in other transfer functions.

Q. Have you used rollup component? Describe how?


If the user wants to group the records on particular field values then rollup is best way to do that.
Rollup is a multi-stage transform function and it contains the following mandatory functions.

 Initialize
 Rollup
 Finalize

Also need to declare one temporary variable if you want to get counts of a particular group.
For each of the group, first it does call the initialize function once, followed by rollup function calls for
each of the records in the group and finally calls the finalize function once at the end of last rollup
call.

Q. How do you add default rules in transformer?


Add Default Rules — Opens the Add Default Rules dialog. Select one of the following: Match Names
— Match names: generates a set of rules that copies input fields to output fields with the same
name. Use Wildcard (.*) Rule — Generates one rule that copies input fields to output fields with the
same name.
1) If it is not already displayed, display the Transform Editor Grid.
2) Click the Business Rules tab if it is not already displayed.
3) Select Edit > Add Default Rules.
In case of reformat if the destination field names are same or subset of the source fields then no
need to write anything in the reformat xfr unless you dont want to use any real transform other than
reducing the set of fields or split the flow into a number of flows to achieve the functionality.

++
Q. What is the difference between partitioning with key and round robin?
Partition by Key or hash partition ->This is a partitioning technique which is used to partition data
when the keys are diverse. If the key is present in large volume then there can large data skew? But
this method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the data on each of the
destination data partitions. The skew is zero in this case when no of records is divisible by number of
partitions. A real life example is how a pack of 52 cards is distributed among 4 players in a round-
robin manner.

Q. How do you improve the performance of a graph?


There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimize the number of sort components
4) Minimize sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port
8) For large dataset don’t use broadcast as partitioner
9) Minimize the use of regular expression functions like re_index in the transfer functions
10) Avoid repartitioning of data unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be partitioned and if
possible output file should also be partitioned.

Q. How do you truncate a table?


From Abinitio run sql component using the DDL “truncate table by using the truncate table
component in Ab Initio

Q. Have you ever encountered an error called “depth not equal”?


When two components are linked together if their layout does not match then this problem can occur
during the compilation of the graph. A solution to this problem would be to use a partitioning
component in between if there was change in layout.

Q. What is the function you would use to transfer a string into a decimal?
In this case no specific function is required if the size of the string and decimal is same. Just use
decimal cast with the size in the transform function and will suffice. For example, if the source field is
defined as string(8) and the destination as decimal(8) then (say the field name is field1).
out.field :: (decimal(8)) in.field
If the destination field size is lesser than the input then use of string_substring function can be used
like the following. Say destination field is decimal (5).
Outfield: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /* string_lrtrim used to trim leading
and trailing spaces */

Q. What are primary keys and foreign keys?


In RDBMS the relationship between the two tables is represented as Primary key and foreign key
relationship. Whereas the primary key table is the parent table and foreign key table is the child
table. The criteria for both the tables are there should be a matching column.

Q. What is an outer join?


An outer join is used when one wants to select all the records from a port – whether it has satisfied
the join criteria or not.

También podría gustarte