Está en la página 1de 34

Capturing Unmatched Records from a Join in Data Stage

The Join stage does not provide reject handling for unmatched records (such as in an InnerJoin
scenario). If un-matched rows must be captured or logged, an OUTER join operation must be
performed. In an OUTER join scenario, all rows on an outer link (eg. Left Outer, Right Outer, or
both links in the case of Full Outer) are output regardless of match on key values.
During an Outer Join, when a match does not occur, the Join stage inserts NULL values into the
unmatched columns. Care must be taken to change the column properties to allow NULL values
before the Join. This is most easily done by inserting a Copy stage and mapping a column from
NON-NULLABLE to NULLABLE.
A Filter stage can be used to test for NULL values in unmatched columns.
In some cases, it is simpler to use a Column Generator to add an indicator column, with a
constant value, to each of the outer links and test that column for the constant after you have
performed the join. This is also handy with Lookups that have multiple reference links.

I have a scenario like


Deptno=10---->First record and last record
Deptno=20---->First record and last record
Deptno=30---->First record and last record
I want those first and last records from each
department in
a single target. How to do this in DataStage, any
one can
assist me.
Thanks in advance.

Question Submitted By :: Data Stage


I also faced this Question!!

Answer Posted By

Answers were Sorted based on User's Feedback

Answer Source--->Sort stage--->copy stage


# 1 From copy stage we have to take two source stages
(source1,source2)
source1-->Removeduplicate stage(in this we can get
first
record from each dept)
source2--->Remove duplicate stage(in this we can
get last
record from each dept)

ashok

using funnel we can add these results.


Is This Answer Correct ?

9 Yes

1 No

Answer take a sequential file and give the output link to copy
#2
stage and from copy stage give one output link to
head
stage and one output to tail stage and in head and tail
stages give no. of partitions per record =1 and give
the
output to funnel stage and give funnel type =
sequence and
mention the link order to displays which records first
and
give the output link to dataset and you will get the
output
you want
how can i get 2nd highest salary in datastage?can u send me
,thanQ
2)if i had source has 2 records 1st record ie 1st column
contains 1,2,3 and 2nd coulmn contains 10,10,10 i have to
get target as 2nd columns as 20,30,40 how can i?

Question Submitted By :: Data Stage


I also faced this Question!!

Answer Posted By

Answers were Sorted based on User's Feedback

Answer seq file--# 1 >sort(descending)stage-->surrogate key stage-->target

farzana kalluri

sort tha data in descending order


,then it will shows from
highiest salary, after that keep
surrogate key stage it
will automtaically generates the
nos so that we can
identifiy the 2nd hieght salary .
2

Is This Answer
Correct ?

5 Yes

1 No

Answer for 2nd Qn:


# 2 Source is:
Col1 Col2
1 10
2 10
3 10
In transformer, we put the
below logic:

subhash

Link.Col1 * Link.Col2 +
Link.Col2------>Col2
1*10+10------------->20
2*10+10------------->30
3*10+10------------->40
Hope this is fine.
Is This Answer
Correct ?

4 Yes

1 No

Answer there are two ways you can get.


# 3 1. sort the data in descending
order and make true the
create key change column.
2. this can be done in
transformer stage at stage
variables.
a. first sort the data on
descending order (key would
the
column which you wanted to
be).
define three stage variables in
3

transformer....sv1,sv2 & sv3


set sv2 and sv3 initial values to
0.
then...
if (salary=sv3) then sv2 else
(sv2+1) ------> sv1
sv1 -------> sv2
salary --------> sv3
in next link filter...
constraint sv1=1
now you can see all 1st highest
salary, 2nd highest
salary..and so on
Is This Answer
Correct ?

4 Yes

2 No

i have data like


sam
ram
sam
raj
ram

Teradata (199)

Business Objects (770)

Cognos (872)

Informatica (1756)

Crystal Enterprise Suite


(29)

Actuate (35)

trgt2

Ab Initio (185)

raj
how can i do this in datastage?

Data Stage (594)

SAS (575)

Micro Strategy (39)

ETL (199)

Data Warehouse General


(217)

I want two targets

Question trgt1
ram
sam

Question Submitted By :: Data Stage


I also faced this Question!!

Answer Posted
By
Answers were Sorted based on User's Feedback

Answer src.....>agg........>t/r..........2trgs
# 1 take src as seqfiles and take
aggregater stage is for
calculate count of records based on
name coloumn,next take
the transformer is for apply
constraints like cnt=1 then go

for trg1
Is This Answer
Correct ?

7 Yes

0 No

Input Data is:


Emp_Id, EmpInd
100, 0
100, 0
100, 0
101, 1
101, 1
102, 0
102, 0
102, 1
103, 1
103, 1
I want Output
100, 0
100, 0
100, 0
101, 1
101, 1
Means Indicator should either all ZEROs or all
ONEs per
EmpId.
Impliment this using SQL and DataStage both.

Question Submitted By :: Data Stage


I also faced this Question!!

Answer Posted By

Answers were Sorted based on User's Feedback

Answer
#1

In DataStage:
SRC--->CPY---->JOIN----TFM---TGT
--------|---- /
--------|--- /
--------|-- /
--------|- /
--------AGG
In AGG, GROUP BY EmpId, calculate
MIN and MAX for each
EmpId.
JOIN both one copy from CPY and 2nd
Aggrigated copy from
AGG.
In TFM, put constraint: IF MIN=MAX,
then populate to TGT
then u will get required output.

subhash

Is This Answer
Correct ?

Answer
#2

9 Yes

0 No

1. SQL:
SELECT * FROM
( SELECT EmpId, COUNT(*) AS CNT1
FROM EMP GROUP BY
EmpId) E1,
( SELECT EmpId, COUNT(*) AS CNT2
FROM EMP GROUP BY
EmpId, EmpInd) E2,
WHERE E1.EmpID = E2.EmpId AND
E1.CNT1 = E2.CNT2;
2.DataStage:
SRC--->CPY---->JOIN----TFM---TGT
|/
|/
|/
|/
AGG
In AGG, GROUP BY EmpId, calculate
CNT and SUM.
JOIN both one copy from CPY and 2nd
Aggrigated copy from
AGG.
In TFM, put constraint: IF CNT=SUM,
then populate to TGT
then u will get required output.

hi my source is::
empno,deptno,salary
1,
10,
2,
20,
2,
10,
1,
30,
3,
10,
3,
20,
1,
20,
then target should be in

3.5
8
4.5
5
6
4
9
below form...

empno,max(salary),min(salary),deptno
1,
9,
3.5,
20
2,
8,
4.5,
20
3,
6,
4,
10
can anyone give data flow in data stage for the above
scenario....

thanks in advance...

Question Submitted By :: Data Stage


I also faced this Question!!

Answer Posted By

Answers were Sorted based on User's Feedback

Answer
#1

source->copy->2 aggregators->join->target
1 aggregator->eno,max(sal),min(sal)
2 aggregator->eno,dno,max(sal)
by using max(sal) key, we can join both o/p of
aggregators,we can get that output...

SEQUENTIAL FILE I HAVE ONE RECORD,I WANT 100 RECORDS IN


TARGET?HOW CAN WE DO THAT?PLS EXPLAIN ME AND WHAT
STAGES ARE
THERE?WHAT LOGIC?

Question Submitted By :: Data Stage


I also faced this Question!!

Answer Posted By

Answers were Sorted based on User's Feedback

Answer
#1

1)
JOB1: SRC---->COPY---->TGT
SEQuence:
START LOOP---->JOB1----->END LOOP
Activity.
In TGT stage use 'Append' Mode.
By Looping 100 time, we can get 100 records in
target.
2)
SRC---->Transformer---->TGT
By using Looping Variable in the Transformer, we
can achieve
this.
Loop While Condition "@ITERATION <=100"
With out using Funnel Stage, how to populate the data from
different sources to single target

Question Submitted By :: Data Stage

I also faced this Question!!

Answer Posted By

Answers were Sorted based on User's Feedback

Answer
#1

Hi Kiran ,
We can populate the sources
metadata to target without
using funnel stage using
"Seqential File" Stage.
let me explain
In Sequential file we have a
property called "File" so first
u give the file name and load
the data.
Next time in the same
sequential file right side we
have a
property "File" just click on
that so it will ask another
file name just give other file
name ...do not load the data
, in the same you can give how
many files u have.
Finally u ran the job
Automatically the data will be
appended.
Thks

I/P
---

ID

Value

1
2
3
4

AB
ABC
ADE
A

O/p
---

ID
1
1
2
2
2
3
3
3
4

Value
A
B
A
B
C
A
D
E
A

Question Submitted By :: Data Stage


I also faced this Question!!

Answer
Posted
By

Answers were Sorted based on User's Feedback

Answer
#1

first of all we have to split the value into


individual char
value(1,1) v(2,2) v(3,3)
c1 c2 c1 c2 c3 c4
1 AB 1 A B
2 ABC---> 2 A B c----> pivot--->o/p
3 ADE 3 A D E
4 A4A

records,
I want target table with
column name start with
'A'
and 'B',remaining columns as reject outputs.
how can achieve this by data stage?please help
me?????

Question Submitted By :: Data Stage


I also faced this Question!!

Answer Posted By

Answers were Sorted based on User's Feedback

Answer
#1

Job design will be:


seq --- Tx ---- target.txt
|_____ reject.txt
IN transformer use below constraint for target.txt
Left(city,1)='A' or Left(city,1)='B'
Check the otherwise and send it to reject file.
9

There are two file are there .1st file contains 5


records and
2nd file contain 10 records in target they want 50
records.how
can achieve this

Question Submitted By :: Data Stage


I also faced this Question!!

Answer
Posted By

Answers were Sorted based on User's Feedback

Answer
#1

Use query
>select * from tab1,tab2;

bharath

You get Cartesian Product of two table rows


if tab1 having m no.of an tab2 has n no of col
then mXn (m by n) rows are returned.
Is This Answer Correct ?

Answer
#2

10 Yes

2 No

to the both file we need to add one DUMMY


column(value is
like '1', we can use Column generator for this DUMMy
column
generation)
then we can JOIN these 2 files based on this DUMMY
column.
so, each column of File1 will join with each column of
File2
i.e. 5*10= 50 records will come into output

I have 2 files 1st contains duplicate records only, 2nd


file contains Unique records.EX:
File1:
1
subhash
10000
1
subhash
10000
2
raju
20000
2
raju
20000
3
chandra
30000
3
chandra
30000
File2:
1
subhash
10000
5
pawan
15000
7
reddy
25000
3
chandra
30000
Output file:--&#61664; capture all the duplicates in
both file with count.

10

1
1
1
2
2
3
3
3

subhash
10000
subhash
10000 3
subhash
10000
raju
20000
raju
chandra
30000
chandra
30000
chandra
30000

3
3
2
20000
3
3
3

Question Submitted By :: Data Stage


I also faced this Question!!

Answer
Posted By

Answers were Sorted based on User's Feedback

Answer File1,File2====&#61672;Funnel----#1
&#61664;Copy=======1st link AGG, 2nd link
JOIN----&#61664;Filter----&#61664;OutputFile
1. pass the 2 files to funnel stage and then copy stage.
2. from copy stage 1st link to AGG stage, 2nd link to
JOIN stage
3. In AGG stage, Group by Key column say ID, NAME
take the count and JOIN based on KEY column
4. Filter on COUNT>1 send the output OutputFile
we get desired output
I have a file it contain 2 records like
empname,company as
Ram, Tcs and Ram, IBM. But i want empname,
company1,company2 as Ram, TCS,IBM in the target. How?

Question Submitted By :: Data Stage


I also faced this Question!!
Answers were Sorted based on User's Feedback

Answer The Simple way is:


#1
SRCFile---->PIVOT Stage---->TGT
1. in PIVOT stage, select vertical PIVOT option.
2. specify the 'Array Size' as 2
3. select the 'Group by' check box for 'empname'
4. select the 'Pivot' check box for 'company'
then u will get the desired output
Hi All, I have a file. i need to fetch the records
between
first and last records by using transform stage.
EX:-

11

Answer Posted
By

Source:
EMPNO EMPNAME
4567
shree
6999
Ram
3265
Venkat
2655
Abhi
3665
Vamsi
5852
Amit
3256
Sagar
3265
Vishnu
Target:
EMPNO EMPNAME
6999
Ram
3265
Venkat
2655
Abhi
3665
Vamsi
5852
Amit
3256
Sagar
I dont wan't to Shree and vishnu records.we can fetch
another way also but How can I write the function in
transform stage?

Question Submitted By :: Data Stage


I also faced this Question!!

Answer Posted
By

Answers were Sorted based on User's Feedback

Answer In the transformer stage's link constraints:


#1
write below constraint
@INROWNUM <> 1 And @INROWNUM <>
LastRow()
then you will get the desired out put.
In Sequential file, how can i split a column into
two, and
that column contains string datatype.
For Example, i have column of string datatype as
subedar
khaja. Now i want get output as separately with
subedar in
one column and khaja in second column.
How?
Coula anybody, solve it?

Question Submitted By :: Data Stage


I also faced this Question!!
Answers were Sorted based on User's Feedback

12

Answer Posted
By

Answer SEQUENTIAL FILE.....>TRANSFORM.....>DATASET


# 1 IN TRANSFORM STAGE WE USE FIELD FUNCTION
(OR)LEFT FUNCTION.
IN TRANSFORM
>>>FUNCTIONS>>>STRINGS>>>FIELD(%string%,'%
delimter%',%occuarence%).
FIELD(subedar khaja.,' ',1)=column1.
FIELD(subedar khaja.,' ',2)=column2
A flat file contains 200 records. I want to load first
50
records at first time running the job, second 50
records at
second time running and so on, how u can develop this
job?

Question Submitted By :: Data Stage


I also faced this Question!!
Answers were Sorted based on User's Feedback

Answer 1st Way:


# 1 1. Add 'row number' column in Seq File stage, so that each
record has a number associated with it.
2. Add a job param with which we can provide the number
of
record from where we want to run the job. We can pass
this
either using Sequence Start LOOP(List type variables50,100,150,200) or by shell script.
3. In the tfm, use a stage variable to run only from the
record number till 50 records by counting each record.
2nd way:
Design the job like this:
1. Add 'row number' column in Seq File stage, so that each
record has a number associated with it.
2. Use filter stage and write the conditions like this:
a. row number column<=50(in 1st link to load the records
in target file/database)
b. row number column>50 (in 2nd link to load the records
in the file with the same name as input file name, in
overwrite mode)
13

Answer Posted
By

So, first time when your job runs first 50 records will be
loaded in the target and same time the input file records
are overwritten with records next first 50 records i.e. 51
to 200.
2nd time when your job runs first 50 records(i.e. 51-100)
will be loaded in the target and same time the input file
records are overwritten with records next first 50 records
i.e. 101 to 200.
And so on, all 50-50 records will be loaded in each run to
the target
My input has a unique column-id with the values
10,20,30.....how can i get first record in one o/p
file,last record in another o/p file and rest of the
records in 3rd o/p file?

Question Submitted By :: Data Stage


I also faced this Question!!
Answers were Sorted based on User's Feedback

Answer As you have a single text file as Source. Use folloowing


# 1 approach to get the desired output.
Head1 Target1
Seq. File Copy Tail2 Target2
Head3 Tail Target3
Steps:
1.> Read your source file using sequential file stage.
2.> Pass the records to copy stage and take 3 output link.
3.> 1 to Head stage head1, 2nd to Head2 and 3rd to Head3.
4.> In the 1st Head Stage Head1, in the properties specify
1, it will pick up the 1st record and make that record to
target 1.
5.> Similarly, to capture last record in target2, in Tail
stage property mention 1. It will take last record and pass
it to target2.
6.> To load rest records 1st using head stage, capture top
records say, if u have 10 records in the source pick top 9
records using head stage then use tail stage followed by
14

Answer Posted
By

head stage and mention 8, it will pick all records except


1st one. then u can load these to target3.
If u get confused ask me ....
Thanks
Kumar
I am running a job with 1000 records.. If the job gots
aborted after loading 400 records into target... In this
case i want to load the records in the target with 401
record... How will we do it??? This scenario is not for
sequence job it's only in the job Ex: Seq file--> Trans-->
Dataset..

Question Submitted By :: Data Stage


I also faced this Question!!

Answer Posted By

Answers were Sorted based on User's Feedback

Answer by using look-up stage we can get the


# 1 answer..
there are two tables like 1000 records(source)
table and 400
records(target) table.
take the source table as primary table and 400
records table
as reference table to look-up table
reference table
.
.
.
source............. look-up......... target
Is This Answer Correct ?

7 Yes

3 No

Answer With the help of available Environment


# 2 variable as
APT_CHECKPOINT_DIR, we can run the
job for remaining records.
This is available in Datastage Administrator
like.....
15

sree

With Datastage Administrator - project-wide


defaults for
general environment variables, set per project
in the
Projects tab under Properties -> General Tab
-> Environment
variables.
Here, we can enable the this checkpoint
variable. Then, we
can load the remaining records........
Posted: Fri Mar 12, 2004 3:06 am
Posts: 18
DataStage Release: 7x
Points: 139
Job Type: Parallel
OS: Unix

Reply with quote

Back to top

I have defined a Join stage using fuller outer join. And I have explicitly
copied the key from both input into the output dataset.
Image that I have pass the output to a Transformer stage, and set some
criteria to capture the unmatched records from either side.
When I start looking for unmatched records in the output, I just can't get
them. I have tried following methods :
1) IsNull(Link1.Key)
2) Len(Link1.Key) = 0
3) RawLength(Link1.Key) = 0
I have checked that the output with Data Set Management, and the key has
nothing within it, and it is not null.
Would any one please help by suggesting
1) a best way to look for unmatched records of two inputs.
2) how to set the criteria so that I can capture those records.
Thanks in advance.

View user's profile Send private message


santhu

Posted: Fri Mar 12, 2004 3:59 am


16

Reply with quote

Back to top

Participant

Joined: 12 Mar 2004


Posts: 20

[quote="santhu"]

Points: 153

Hi,
First of all, when you use JOIN stage, you cannot capture any unmatched
data in the output link of the JOIN stage. So any condition under
Transformer will not help.
There are 3 ways of Horizontally combining data, i.e JOINS, LOOKUP and
MERGE.
Possibilites of capturing unmatched data for
1) JOIN: You cannot capture unmatched data for any kind of join using the
JOIN stage i.e neither from the left nor the Right inputs.
2) LOOKUP: Lookup stage has only 1 Primary Source and can have N
lookups / secondary data/reference data. You can capture unmatched
primary data in the Reject set (1 only) if you specify "Reject" option in the
lookup stage settings. You cannot capture unmatched secondary / lookup
data
3) MERGE: This stage has 1 MASTER source and can have N update /
secondary sources. You can KEEP / DROP the master source data if not
matching, and you can capture all the N unmatching update / secondary
sources into respective N Reject files.
Hope this helps to solve your issue
Regards,
Santhosh S

Orchadmin Command : DataStage


Atul.Singh | Aug 17 2013 | Visits (18846)

17

inShare
Orchadmin is a command line utility provided by datastage to research on data sets.

The general callable format is : $orchadmin <command> [options] [descriptor file]

1. Before using orchadmin, you should make sure that either the working directory or the
$APT_ORCHHOME/etc contains the file config.apt OR
The environment variable $APT_CONFIG_FILE should be defined for your session.

Orchadmin commands

The various commands available with orchadmin are

1. CHECK: $orchadmin check

Validates the configuration file contents like , accesibility of all nodes defined in the configuration file,
scratch disk definitions and accesibility of all the nodes etc. Throws an error when config file is not found or
not defined properly

2. COPY : $orchadmin copy <source.ds> <destination.ds>

Makes a complete copy of the datasets of source with new destination descriptor file name. Please not that
a. You cannot use UNIX cp command as it justs copies the config file to a new name. The data is not copied.
b. The new datasets will be arranged in the form of the config file that is in use but not according to the old
confing file that was in use with the source.

18

3. DELETE : $orchadmin < delete | del | rm >

[-f | -x] descriptorfiles.

The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be
used to delete one or more persistent data sets.
-f options makes a force delete. If some nodes are not accesible then -f forces to delete the dataset
partitions from accessible nodes and leave the other partitions in inaccesible nodes as orphans.
-x forces to use the current config file to be used while deleting than the one stored in data set.

4. DESCRIBE: $orchadmin describe [options] descriptorfile.ds

This is the single most important command.


1. Without any option lists the no.of.partitions, no.of.segments, valid segments, and preserve partitioning
flag details of the persistent dataset.
-c : Print the configuration file that is written in the dataset if any
-p: Lists down the partition level information.
-f: Lists down the file level information in each partition
-e: List down the segment level information .
-s: List down the meta-data schema of the information.
-v: Lists all segemnts , valid or otherwise
-l : Long listing. Equivalent to -f -p -s -v -e

5. DUMP: $orchadmin dump [options] descriptorfile.ds

The dump command is used to dump(extract) the records from the dataset.
Without any options the dump command lists down all the records starting from first record from first
partition till last record in last partition.

19

-delim <string> : Uses the given string as delimtor for fields instead of space.
-field <name> : Lists only the given field instead of all fields.
-name : List all the values preceded by field name and a colon
-n numrecs : List only the given number of records per partition.
-p period(N) : Lists every Nth record from each partition starting from first record.
-skip N: Skip the first N records from each partition.
-x : Use the current system configuration file rather than the one stored in dataset.

6. TRUNCATE: $orchadmin truncate [options] descriptorfile.ds

Without options deletes all the data(ie Segments) from the dataset.
-f: Uses force truncate. Truncate accessible segments and leave the inaccesible ones.
-x: Uses current system config file rather than the default one stored in the dataset.
-n N: Leaves the first N segments in each partition and truncates the remaining.

7. HELP: $orchadmin -help OR $orchadmin <command> -help

Help manual about the usage of orchadmin or orchadmin commands.

scenario
i/p file
col1
a,b,c
o/p
col1
a
b
c
20

**************
we can do this with field function in transformer stage
************
I have worked on similar scenario like yours for one of my friend,so i want to explain that for
you.......
Input:100|aa,cc,bb,dd
200|aa
330|mm,nn
440|aa,cc,dd,ee,ff,gg
Output:440,cc
440,ee
440,gg
100,cc
100,dd
330,nn
200,aa
440,aa
440,dd
440,ff
100,aa
100,bb
330,mm
********************************************
I have developed one shell script,so before running the job we have to run the script...
Script is:#/bin/sh
"" > C:/temp/temp.txt
"" > C:/temp/temp1.txt
for line in `cat C:/temp/pivot.txt`
do
VAR=`echo $line|awk -F"," '{printf"%s|%s\n",$0,NF-1}'`
echo $VAR >> C:/temp/temp.txt
done
MAX=`cat C:/temp/temp.txt|awk -F"|" '{print $3}'|sort -r|head -1`
for line in `cat C:/temp/temp.txt`
21

do
VER=`echo $line|awk -F"|" '{print $3}'`
MAX1=`expr $MAX - $VER`
line1=$line
while [ $MAX1 -gt 0 ]
do
line1=`echo $line1|awk -F"|" '{printf"%s|%s\n",$1,$2}'|sed 's/$/,/g'`
MAX1=`expr $MAX1 - 1`
done
if [ $MAX1 -eq 0 ]
then
line1=`echo $line1|awk -F"|" '{printf"%s|%s\n",$1,$2}'`
fi
echo $line1 >> C:/temp/temp1.txt
done
********************************************************************
Now Read the Temp1.txt as a source in datastage .
Job Design:Seq file --------> Transformer----->Povot Stage---->filter ----->Target Seq file.
Read Temp1.txt in seq file.
In Tsfm,Using field function,parse the columns.
Use the pivot stage to pivot columns into rows.
Filter the null records.
Pass the output to seq file.
I have this approch,if anyone has better approch,please share your idea.
*************
seq->transformer->pivot->target
in transformer create three columns col1,col2,col3
use substring option
substring(colname,[1,1])=col1
substring(colname,[3,1])=col2
substring(colname,[5,1])=col3
in pivot output
22

give columnname derivation


col col1,col2,col3
My requirement is that for example
INPUT is given below
empid line_num Text
100 1 a
100 2 b
100 3 c
200 1 aa
200 2 bb
300 1 ccc
OUTPUT should be:emp text
100 abc
200 aabb
300 ccc
I have applied the same logic as given above.
but i am not the output as above.
Instead the output i am getting is
emp text
100 a
200 aa
Please advice me how to go about this...

View user's profile Send private message


Not
Rate this
0
1
2
3
4
5 yet
response:
rated
bkumar103
Participant

Posted: Mon
Sep 08, 2008
7:13 am

Joined: 25 Jul 2007


23

Reply with quote

Back to top

Posts: 214
Location: Chennai
Points: 1320

You can code like this


suppose col1 is the key..
then define the three stage variable stgvar1,
stgvar2 and atgvar3
the derivation for stage varibles might be like as:
stgvar1:
if trim(DSLink2.col1) = trim(stgvar4) then 1 else
0
stgvar2:
if stgvar1 = 1 then
trim(stgvar2):trim(DSLink2.col3) else
trim(DSLink2.col3)
stgvar3:
trim(DSLink2.col1)
the output from the transformer can be passed to
hash file or aggregator stage to remove the
duplicate based on the key.
Thanks,

In put data
col|col1
1|a
1|d
1|r
2|g
3|h
3|g
4|e
out put data
col|col1|count
1|a|1
1|d|2
1|r|3
2|g|1
3|h|1
3|g|2
4|e|1
24

I think this is your requirement.


solution is---stage variables:
stagevar1--- if col=stagevar then stagevar1+1 else 1
stagevar--- col
out put:
count----stagevar1

SQL Queries Interview Questions - Oracle


Analytical Functions Part 1
Analytic functions compute aggregate values based on a group of rows. They differ
from aggregate functions in that they return multiple rows for each group. Most of
the SQL developers won't use analytical functions because of its cryptic syntax or
uncertainty about its logic of operation. Analytical functions saves lot of time in
writing queries and gives better performance when compared to native SQL.
Before starting with the interview questions, we will see the difference between the
aggregate functions and analytic functions with an example. I have used SALES
TABLE as an example to solve the interview questions. Please create the below sales
table in your oracle database.

CREATE TABLE SALES


(
SALE_ID
PRODUCT_ID
YEAR
Quantity
PRICE
);

INTEGER,
INTEGER,
INTEGER,
INTEGER,
INTEGER

INSERT
INSERT
INSERT
INSERT
INSERT

INTO
INTO
INTO
INTO
INTO

SALES
SALES
SALES
SALES
SALES

VALUES
VALUES
VALUES
VALUES
VALUES

(
(
(
(
(

1,
2,
3,
4,
5,

100,
100,
100,
100,
100,

2008,
2009,
2010,
2011,
2012,

10,
12,
25,
16,
8,

5000);
5000);
5000);
5000);
5000);

INSERT
INSERT
INSERT
INSERT
INSERT

INTO
INTO
INTO
INTO
INTO

SALES
SALES
SALES
SALES
SALES

VALUES
VALUES
VALUES
VALUES
VALUES

(
(
(
(
(

6, 200,
7, 200,
8, 200,
9, 200,
10,200,

2010,
2011,
2012,
2008,
2009,

10,
15,
20,
13,
14,

9000);
9000);
9000);
9000);
9000);

25

INSERT INTO
INSERT INTO
INSERT INTO
INSERT INTO
INSERT INTO
COMMIT;

SALES
SALES
SALES
SALES
SALES

VALUES
VALUES
VALUES
VALUES
VALUES

(
(
(
(
(

11,
12,
13,
14,
15,

300,
300,
300,
300,
300,

2010,
2011,
2012,
2008,
2009,

20,
18,
20,
17,
19,

7000);
7000);
7000);
7000);
7000);

SELECT * FROM SALES;


SALE_ID PRODUCT_ID YEAR QUANTITY PRICE
-------------------------------------1
100
2008
10
5000
2
100
2009
12
5000
3
100
2010
25
5000
4
100
2011
16
5000
5
100
2012
8
5000
6
200
2010
10
9000
7
200
2011
15
9000
8
200
2012
20
9000
9
200
2008
13
9000
10
200
2009
14
9000
11
300
2010
20
7000
12
300
2011
18
7000
13
300
2012
20
7000
14
300
2008
17
7000
15
300
2009
19
7000

Difference Between Aggregate and Analytic Functions:


Q. Write a query to find the number of products sold in each year?
The SQL query Using Aggregate functions is
SELECT Year,
COUNT(1) CNT
FROM SALES
GROUP BY YEAR;
YEAR CNT
--------2009 3
2010 3
2011 3
2008 3
2012 3

The SQL query Using Aanalytic functions is


SELECT SALE_ID,
PRODUCT_ID,

26

Year,
QUANTITY,
PRICE,
COUNT(1) OVER (PARTITION BY YEAR) CNT
FROM SALES;
SALE_ID PRODUCT_ID YEAR QUANTITY PRICE CNT
-----------------------------------------9
200
2008
13
9000 3
1
100
2008
10
5000 3
14
300
2008
17
7000 3
15
300
2009
19
7000 3
2
100
2009
12
5000 3
10
200
2009
14
9000 3
11
300
2010
20
7000 3
6
200
2010
10
9000 3
3
100
2010
25
5000 3
12
300
2011
18
7000 3
4
100
2011
16
5000 3
7
200
2011
15
9000 3
13
300
2012
20
7000 3
5
100
2012
8
5000 3
8
200
2012
20
9000 3

From the ouputs, you can observe that the aggregate functions return only one row
per group whereas analytic functions keeps all the rows in the gorup. Using the
aggregate functions, the select clause contains only the columns specified in group
by clause and aggregate functions whereas in analytic functions you can specify all
the columns in the table.
The PARTITION BY clause is similar to GROUP By clause, it specifies the window of
rows that the analytic funciton should operate on.
I hope you got some basic idea about aggregate and analytic functions. Now lets
start with solving the Interview Questions on Oracle Analytic Functions.
1. Write a SQL query using the analytic function to find the total sales(QUANTITY) of
each product?
Solution:
SUM analytic function can be used to find the total sales. The SQL query is
SELECT PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER( PARTITION BY PRODUCT_ID ) TOT_SALES
FROM SALES;
PRODUCT_ID QUANTITY TOT_SALES
-----------------------------

27

100
100
100
100
100
200
200
200
200
200
300
300
300
300
300

12
10
25
16
8
15
10
20
14
13
20
18
17
20
19

71
71
71
71
71
72
72
72
72
72
94
94
94
94
94

2. Write a SQL query to find the cumulative sum of sales(QUANTITY) of each


product? Here first sort the QUANTITY in ascendaing order for each product and
then accumulate the QUANTITY.
Cumulative sum of QUANTITY for a product = QUANTITY of current row + sum of
QUANTITIES all previous rows in that product.
Solution:
We have to use the option "ROWS UNBOUNDED PRECEDING" in the SUM analytic
function to get the cumulative sum. The SQL query to get the ouput is
SELECT PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER( PARTITION BY PRODUCT_ID
ORDER BY QUANTITY ASC
ROWS UNBOUNDED PRECEDING) CUM_SALES
FROM SALES;
PRODUCT_ID QUANTITY CUM_SALES
----------------------------100
8
8
100
10
18
100
12
30
100
16
46
100
25
71
200
10
10
200
13
23
200
14
37
200
15
52
200
20
72
300
17
17
300
18
35
300
19
54
300
20
74
300
20
94

28

The ORDER BY clause is used to sort the data. Here the ROWS UNBOUNDED
PRECEDING option specifies that the SUM analytic function should operate on the
current row and the pervious rows processed.

3. Write a SQL query to find the sum of sales of current row and previous 2 rows in a
product group? Sort the data on sales and then find the sum.
Solution:
The sql query for the required ouput is
SELECT PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER(
PARTITION BY PRODUCT_ID
ORDER BY QUANTITY DESC
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) CALC_SALES
FROM SALES;
PRODUCT_ID QUANTITY CALC_SALES
-----------------------------100
25
25
100
16
41
100
12
53
100
10
38
100
8
30
200
20
20
200
15
35
200
14
49
200
13
42
200
10
37
300
20
20
300
20
40
300
19
59
300
18
57
300
17
54

The ROWS BETWEEN clause specifies the range of rows to consider for calculating
the SUM.
4. Write a SQL query to find the Median of sales of a product?
Solution:
The SQL query for calculating the median is
SELECT PRODUCT_ID,
QUANTITY,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY QUANTITY ASC)

29

FROM

SALES;

OVER (PARTITION BY PRODUCT_ID) MEDIAN

PRODUCT_ID QUANTITY MEDIAN


-------------------------100
8
12
100
10
12
100
12
12
100
16
12
100
25
12
200
10
14
200
13
14
200
14
14
200
15
14
200
20
14
300
17
19
300
18
19
300
19
19
300
20
19
300
20
19

5. Write a SQL query to find the minimum sales of a product without using the
group by clause.
Solution:
The SQL query is
SELECT

PRODUCT_ID,
YEAR,
QUANTITY

FROM
(
SELECT PRODUCT_ID,
YEAR,
QUANTITY,
ROW_NUMBER() OVER(PARTITION BY PRODUCT_ID
ORDER BY QUANTITY ASC) MIN_SALE_RANK
FROM
SALES
) WHERE MIN_SALE_RANK = 1;
PRODUCT_ID YEAR QUANTITY
-----------------------100
2012
8
200
2010
10
300
2008
17

30

Oracle Analytic Functions compute an aggregate value based on a group of rows. It opens up a
whole new way of looking at the data. This article explains how we can unleash the full potential
of this.
Analytic functions differ from aggregate functions in the sense that they return multiple rows for
each group. The group of rows is called a window and is defined by the analytic clause. For each
row, a sliding window of rows is defined. The window determines the range of rows used to
perform the calculations for the current row.
Oracle provides many Analytic Functions such as
AVG, CORR, COVAR_POP, COVAR_SAMP, COUNT, CUME_DIST, DENSE_RANK, FIRST,
FIRST_VALUE, LAG, LAST, LAST_VALUE, LEAD, MAX, MIN, NTILE,
PERCENT_RANK, PERCENTILE_CONT, PERCENTILE_DISC, RANK,
RATIO_TO_REPORT, STDDEV, STDDEV_POP, STDDEV_SAMP, SUM, VAR_POP,
VAR_SAMP, VARIANCE.
The Syntax of analytic functions:
Analytic-Function(Column1,Column2,...)
OVER (
[Query-Partition-Clause]
[Order-By-Clause]
[Windowing-Clause]
)

Analytic functions take 0 to 3 arguments.


An Example:
SELECT ename, deptno, sal,
SUM(sal)
OVER (ORDER BY deptno, ename) AS Running_Total,
SUM(sal)
OVER ( PARTITION BY deptno
ORDER BY ename) AS Dept_Total,
ROW_NUMBER()
OVER (PARTITION BY deptno
ORDER BY ename) As Sequence_No
FROM emp
ORDER BY deptno, ename;

31

The partition clause makes the SUM(sal) be computed within each department, independent of
the other groups. The SUM(sal) is 'reset' as the department changes. The ORDER BY ENAME
clause sorts the data within each department by ENAME;
1. Query-Partition-Clause
The PARTITION BY clause logically breaks a single result set into N groups, according
to the criteria set by the partition expressions. The analytic functions are applied to each
group independently, they are reset for each group.
2. Order-By-Clause
The ORDER BY clause specifies how the data is sorted within each group (partition).
This will definitely affect the output of the analytic function.
3. Windowing-Clause
The windowing clause gives us a way to define a sliding or anchored window of data, on
which the analytic function will operate, within a group. This clause can be used to have
the analytic function compute its value based on any arbitrary sliding or anchored
window within a group. The default window is an anchored window that simply starts at
the first row of a group an continues to the current row.
Let's look an example with a sliding window within a group and compute the sum of the current
row's salary column plus the previous 2 rows in that group. i.e ROW Window clause:
SELECT deptno, ename, sal,
SUM(sal)
OVER ( PARTITION BY deptno
ORDER BY ename
ROWS 2 PRECEDING ) AS Sliding_Total
FROM emp
ORDER BY deptno, ename;

32

Now if we look at the Sliding Total value of SMITH it is simply SMITH's salary plus the salary
of two preceding rows in the window. [800+3000+2975 = 6775]
We can set up windows based on two criteria: RANGES of data values or ROWS offset from
the current row . It can be said, that the existance of an ORDER BY in an analytic function will
add a default window clause of RANGE UNBOUNDED PRECEDING. That says to get all rows
in our partition that came before us as specified by the ORDER BY clause.

** Solving Top-N Queries **


Suppose we want to find out the top 3 salaried employee of each department:
SELECT deptno, ename, sal, ROW_NUMBER()
OVER (
PARTITION BY deptno ORDER BY sal DESC
) Rnk FROM emp;

This will give us the employee name and salary with ranks based on descending order of salary
for each department or the partition/group . Now to get the top 3 highest paid employees for each
dept.
SELECT * FROM (
SELECT deptno, ename, sal, ROW_NUMBER()
OVER (
PARTITION BY deptno ORDER BY sal DESC
) Rnk FROM emp
) WHERE Rnk <= 3;

The use of a WHERE clause is to get just the first three rows in each partition.

33

** Solving the problem with DENSE_RANK **


If we look carefully the above output we will observe that the salary of SCOTT and FORD of
dept 10 are same. So we are indeed missing the 3rd highest salaried employee of dept 20. Here
we will use DENSE_RANK function to compute the rank of a row in an ordered group of rows.
The ranks are consecutive integers beginning with 1. The DENSE_RANK function does not
skip numbers and will assign the same number to those rows with the same value.
The above query now modified as:
SELECT * FROM (
SELECT deptno, ename, sal, DENSE_RANK()
OVER (
PARTITION BY deptno ORDER BY sal DESC
) Rnk FROM emp
)
WHERE Rnk 3

and the output is as follows:

34