Ab Initio FAQ's Part1
Ab Initio FAQ's Part1
Ab Initio FAQ's Part1
Ans: The Layout is something which determines whether a component runs in a serial or
parallel mode. If you specify the path as serial directory the component runs as single stream
and if you specify the path as a multi file directory the component runs in parallel mode. Also
the path which you specify there is serves as the working directory of the graph where all
intermediate files are stored
Layout can be specified as
1) propagate from neighbors
2) URL
3) custom
4) host
Before you can run an Ab Initio graph, you must specify layouts to describe the following to the
Co>Operating System:
2) What is skew?
Ans: Skew tells about the unbalanced behavior of data partitioning. You can do performance
tuning by controlling Skew, Max-Core etc and there are so many ways.
3) How to read only 10 records from i/p file?
Ans: 1) There is a component called Sample in your sort component folder. If you use this
after the input file you can specify how many records you would like to pass through
2) In the Input Table component, in the parameters tab, you can specify how many records to
read.
3) Also there is a Leading Records component (well in 2.11 anyway) that allows you to specify
the number of records to read from a serial or mfs file.
4) One way to do this is with the Read-Raw component available in 2.11 or higher, although
pragmatically you will have to describe and process the record structure as it works with raw
data.
4) How do you make 4 way to 8 way in a graph
Ans: Put a partition and a gather component... Partition component should be 4 way MFS and
the gather should be 8way.
If the graph execution has to be stopped, depending on certain conditions, then use
force_error() function.
10) What are the functions used for system date?
a. today() :: Returns the internal representation of the current date on each call
b. today1() :: Returns the internal representation of the current date on the first
call.
Note [DML represents dates internally as integer values specifying days relative to
January 1, 1900]
iii)
now()
:: Returns the current local date and time
iv)
now1() :: The first time a component calls now1, the function returns
the value returned from the system function localtime. The second
and subsequent times a component calls now1, it returns the same value it
returned on the first call
11) How to convert a string into date format?
Ans: The string needs to be first casted in the date format. So if you have an input field of
string 20031130 and your output field is a date (YYYY-MM-DD),then use this
out.fieldname (date(YYYY-MM-DD))in.fieldname;
Note: [However if any of the i/p field has NULL data, it fails, so use a is_valid() ,is_defined()
functions to check the validity of the i/p data ]
12) What is the relation between EME, GDE and Co-operating system?
Ans. EME is said as enterprise metadata env, GDE as graphical development env and Cooperating system can be said as ab initio server relation b/w this CO-OP, EME AND GDE is as
fallows Co operating system is the Abinitio Server. This co-op is installed on particular
O.S platform that is called NATIVE O.S .coming to the EME, its just as repository in
informatica, its hold the metadata, transformations, db config files source and targets
informations. coming to GDE its is end user environment where we can develop the
graphs(mapping just like in informatica) designer uses the GDE and designs the graphs and save
to the EME or Sand box it is at user side where EME is as server side.
13) What is the use of aggregation when we have rollup as we know rollup component
in abinitio is used to summarize group of data record, then where we will use
aggregation?
Ans: Aggregation and Rollup both can summarize the data but rollup is much more convenient
to use. In order to understand how a particular summarization being rollup is much more
explanatory compared to aggregate. Rollup can do some other functionality like input and
output filtering of records.
Aggregate and rollup perform same action, rollup display intermediate result in main memory;
Aggregate does not support intermediate result
14) What are kinds of layouts does ab initio supports?
Ans: Basically there are serial and parallel layouts supported by AbInitio. A graph can have both
at the same time. The parallel one depends on the degree of data parallelism. If the multi-file
system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is
defined such as it's same as the degree of parallelism.
15) How can you run a graph infinitely?
Ans: To run a graph infinitely, the end script in the graph should call the .ksh file of the graph.
Thus if the name of the graph is abc.mp then in the end script of the graph there should be a
call to abc.ksh. Like this the graph will run infinitely.
16) Do you know what a local lookup is?
Ans: If your lookup file is a multifile and partioned/sorted on a particular key then local lookup
function can be used ahead of lookup function call. This is local to a particular partition
depending on the key.
Lookup File consists of data records which can be held in main memory. This makes the
transform function to retrieve the records much faster than retrieving from disk. It allows the
transform component to process the data records of multiple files fastly.
17) What is the difference between look-up file and look-up, with a relevant example?
Ans: Generally Lookup file represents one or more serial files (Flat files). The amount of data is
small enough to be held in the memory. This allows transform functions to retrieve records
much more quickly than it could retrieve from Disk.
A lookup is a component of abinitio graph where we can store data and retrieve it by using a
key parameter.
A lookup file is the physical file where the data for the lookup is stored.
18) How many components in your most complicated graph?
It depends the type of components you us. Usually avoid using much complicated transform
function in a graph.
19) Explain what is lookup?
Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per
the data present in a particular file (serial/multi file). The dataset can be static as well
dynamic (in case the lookup file is being generated in previous phase and used as lookup file in
current phase). Sometimes, hash-joins can be replaced by using reformat and lookup if one of
the inputs to the join contains less number of records with slim record length.
AbInitio has built-in functions to retrieve values using the key for the lookup
20) What is a ramp limit?
The limit parameter contains an integer that represents a number of reject events. The ramp
parameter contains a real number that represents a rate of reject events in the number of
records processed.
no of bad records allowed = limit + no of records*ramp.
ramp is basically the percentage value (from 0 to 1)
This two together provides the threshold value of bad records.
21) Have you worked with packages?
A multistage transform component by default uses packages. However user can create his own
set of functions in a transfer function and can include this in other transfer functions.
22) Have you used rollup component? Describe how.
If the user wants to group the records on particular field values then rollup is best way to do
that. Rollup is a multi-stage transform function and it contains the following mandatory
functions.
1. initialise
2. Rollup
3. finalise
Also need to declare one temporary variable if you want to get counts of a particular group.
For each of the group, first it does call the initialize function once, followed by rollup function
calls for each of the records in the group and finally calls the finalize function once at the end
of last rollup call.
23) How do you add default rules in transformer?
Add Default Rules Opens the Add Default Rules dialog. Select one of the following: Match
Names Match names: generates a set of rules that copies input fields to output fields with the
same name. Use Wildcard (.*) Rule Generates one rule that copies input fields to output
fields with the same name.
1) If it is not already displayed, display the Transform Editor Grid.
2) Click the Business Rules tab if it is not already displayed.
3) Select Edit > Add Default Rules.
In case of reformat if the destination field names are same or subset of the source fields then
no need to write anything in the reformat xfr unless you dont want to use any real transform
other than reducing the set of fields or split the flow into a number of flows to achieve the
functionality.
24) What is the difference between partitioning with key and round robin?
Partition by Key or hash partition -> this is a partitioning technique which is used to partition
data when the keys are diverse. If the key is present in large volume then there can large data
skew? But this method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the data on each
of the destination data partitions. The skew is zero in this case when no of records is divisible
by number of partitions. A real life example is how a pack of 52 cards is distributed among 4
players in a round-robin manner.
25) How do you improve the performance of a graph?
There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving
port
its is end user environment where we can develop the graphs(mapping just like in informatica)
designer uses the GDE and designs the graphs and save to the EME or Sand box it is at user side
where EME is as server side.
33) Explain the difference between the truncate and "delete" commands.
The difference between the TRUNCATE and DELETE statement is Truncate belongs to DDL
command whereas DELETE belongs to DML command. Rollback cannot be performed in case of
truncate statement whereas Rollback can be performed in Delete statement. "WHERE" clause
cannot be used in Truncate where as "WHERE" clause can be used in DELETE statement.
34) How we can create job sequencer in abinitio i.e running number of graphs at a
time?
As such there is no job sequencer supported by Ab initio Until the versions:GDE:1.13.3 and
Co>Op:2.12.1 But we can sequence a the jobs by creating Wrapper Scripts in UNIX i.e. a korn
shell script which calls the graphs in sequence.
In Abinito it is not possible to create the job sequence. But scheduling of the jobs can be done
with the help of scheduling tool called "CONTROL M".In this tool graph corresponding scripts
and wrapper scripts are placed as per the sequence of exec and we can monitor the execution
of the graphs. There is no sequencer concept in abinitio. suppose you have graphs A,B,C
A o/p is I/p to B and B o/p is Input to C
Then you will write a wrapper script that will call this jobs, script will be like this
a.ksh
b.ksh
c.ksh
you can use next_in_sequence function which returns sequence of integers
35) How to take the input data from an excel sheet?
There is a Read Excel component that reads excel either from host or from local drive. The dml
will be a default one.
make it csv formatted , deliminated file and read it thru input table comp.
36) What is the function you would use to transfer a string into a decimal?
use ""reinterpret_as" function to convert string to decimal,or decimal to string.
syntax: To convert decimal into string
reinterpret_as(ebcdic string(13),(ebcdic decimal(13))(in.cust_amount))
37) How to run the graph without GDE?
In the run directory a graph can be deployed as a .ksh file. Now, this .ksh file can be run at the
command prompt as:
ksh <script_name> <parameters if any>
38) How to work with parameterized graphs?
One of the main purpose of the parameterized graphs is that if we need to run the same graph
for n number of times for different files, we set up the graph parameters like $INPUT_FILE,
$OUTPUT_FILE etc and we supply the values for these in the Edit>parameters. These
parameters are substituted during the run time. We can set different types of parameters like
positional, keyword, local etc.
The idea here is, instead of maintaining different versions of the same graph, we can maintain
one version for different files.
Have you worked with packages?
Packages are nothing but the reusable blocks of objects like transforms, user defined functions,
dmls etc. These packages are to be included in the transform where you use them. For
example, consider a user defined function like
/*string_trim.xfr*/
out::trim(input_string)=
begin
let string(35) trimmed_string = string_lrtrim(input_string);
out::trimmed_string;
end
Now, the above xfr can be included in the transform where you call the above function as
include ''~/xfr/string_trim.xfr'';
But this should be included ABOVE your transform function.
For more details see the help file in "packages".
What is an outer join?
If you want to see all the records of one input file independent of whether there is a matching
record in the other file or not. Then its an outer join.
What is driving port?
In a join, it is sometimes advantageous to have the Sorted-Input parameter set to "Input need
not be sorted". This helps, when we are sure that one of the input ports has far less records
than the other port, and the data from that port can be held in memory. In this case, we can
set the other port as the driving port.
Say, e.g. Port in0 has 1000 rec and in1 has 1 million records, in this case we set the port in1 as
driving port, for which the value would be 1. By default, the driving port value is 0(for in0).
Depending on the requirement, sometimes it more advisable to create a lookup instead. But
that depends on the requirement and design
What is writing of wrapper can any explain elaborately?
Writing a wrapper script helps u 2 to run the graph in sequence as u want.
Example:
when u need to run 3 graphs but the condition is after the first graph ran successfully u need to
take the feed generated by it and use it in next graph and so on... graph after it finished u
have to check the graph ran successfully then run the second KSh so on.....
What is Conditional DML? Can anyone please explain with example
Then u have to right a Unix script in which run the ksh of the first The DML that is used as a
condition is known as conditional DML..
Suppose we have data that includes the Header, Main data and Trailer as given below:
10 This data contains employee info.
20 emp_id,emp_name, salary
30 count
So, the DML for the above structure would be:
Record
decimal (",") id;
if (id==10)
begin
string (",") info;
end
else if (id==20)
begin
string (",") emp_id;
string (",") name;
1500
1050
2640
29.08.06
31.08.06
1.09.06
30.08.06
02.09.06 so on
Here no_purchase is the vector field which rep the no of times a cust hase done purchases
What does dependency analysis mean in Ab Initio?
Dependency analysis will answer the questions regarding data linage that is where does the
data come from, what applications produce and depend on this data etc.
For data parallelism, we can use partition components. For component parallelism, we can
use replicate component. Like this which component(s) can we use for pipeline parallelism?
When connected sequence of components of the same branch of graph executes concurrently is
called pipeline parallelism.
Components like reformat where we distribute input flow to multiple o/p flow using output
index depending on some selection criteria and process those o/p flows simultaneously creates
pipeline parallelism.
But components like sort where entire i/p must be read before a single record is written to o/p
cannot achieve pipeline parallelism
flow:
input file ------>reformat----->rollup------>filter by expression----->o/p file
50th record
25 records
10 records
clearly speaking when ever u run any graph we observe the number of records processed on
flows ,this is best example for pipeline parallism.
What is .abinitiorc and what it contains?
.abinitiorc is the config file for ab initio. It is found in user's home directory. Generally it is
used to contain abinitio home path, different log in information like id encrypted password
login method for hosts where the graph connects in time of execution.
.abinitiorc file contains all configuration variables such as AB_WORK_DIR, AB_DATA_DIR etcthis
file can be find in "$AB_HOME/Config".
What do you mean by .profile in Abinitio and what...?
.profile is a file which gets executed automatically when that particular user logging in.
You can change your .profile file to include any commands that you want to execute whenever
u logging in. you can even put commands in your .profile file that overrides settings made in
/etc/profile(this file is set up by the system administrator).
You can set the following in your .profile......
- Environment settings
- aliases
- path variables
- name and size of your history file
- primary and secondary command prompts.....and many more.
What is semi-join?
in abinitio, there are 3 types of join...
1. inner join.
2. outer join
";
target;
string("01") nm=NULL("");/*(maximum length is string(35))*/
Then we can have a mapping like:
Straight move.Trim the leading or trailing spaces.
What is driving port? When do you use it?
Driving port in join supplies the data that drives join . That means, for every record from the
driving port, it will be compared against the data from non driving port.
We have to set the driving port to the larger dataset so that non driving data which is smaller
can be kept in main memory for speeding up the operation.
What is $mpjret? Where it is used in ab-initio?
$mpjret is return value of shell command "mp run" execution of Ab-Initio graph.
What is data cleaning? How is it done?
I can simply say it as Purifying the data.
Data Cleansing: the act of detecting and removing and/or correcting a databases dirty data
(i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly)
1. What is the Difference between GDE and Co>Operating system?
GDE(Graphical development environment) is look like a GUI to develop the graphs in a simple
manner.
Co> Operating system is nothing but distributed operating system, which can run as a backend
server
Current Version of GDE is 1.15 and Co>Operating system is 2.14
2. Which process you fallowed to develop a graph?
Reformat
Rollup
Join
Sort
Replicate
Partition by expression and key
Redefine
Multi update
Lookup
Intermediate
Input
Output
Reject
Log
Error
Specific Parameters:
Select
Output Index
5. What is the difference between output Index and Select parameters in reformat?
Select and output index both are used to filter the data, but using select parameter we cant
get the deselected record. But using output index parameter U can filter the data as well as u
can connect the deselected record to another output port.
6. What is the difference between Reformat and Redefined component?:
Reformat can change the record format by dropping, adding, modifying fields.
Using Redefined format copies the records from input to output without changing the record
values.
7. Explain about Join component?
Reads the data from two or more inputs and combines the records with
matching keys and send to output ports
Specific parameters:
Driving port: Driving port is the largest input and remain inputs will directly
reads into memory.(Available only when Inmemory: Input need not to sort
parameter set to true)
Join type:
1. Inner join
2. Full outer join
3. Explicit join
Record required parameter: This will be available when join type is set to Explicit. If
you want left outer join set true to input 0 and false to input 1.If you want right outer
join set false to Input 0 and set true to Input 1.
Overridden key : Set the alternative names to the particular key fields
Max memory: Maximum usage of bytes before joining to write the temporary files to
the disk(Available only when(sorted Input In memory: Input need to be sort is set to
true), default is 8MB
Max-core: Maximum usage of bytes before joining to write the temporary files to the
disk (Available only when (sorted Input In memory: Input need not to be sort is set to
true) .The default is 64MB
Sorted Input:
When set to in memory Input need to be sort, it accept only sorted input and if it is In
memory Input need not sort, accepts unsorted data
Specific ports:
Unusedn: We can retrieve the unmatched data using unused ports
9. Can we make a explicit join for more than two inputs?
Yes, we can make join for more than two inputs
Ex:
For three inputs, if you want left outer join set the record required parameter true to
input 0 and false to input 1 and input2
For three inputs, if you want right outer join set the record required parameter false to
input 0 input 1 and set true toinput2
10. What is the difference between merge and join?
Both components used to join the data based on keys, with join we can combine to input
flows, but using merge we can combine the partitioned data.
11. Explain about sort component?
Sort component sorts and merge the data
Parameters:
Key
Maxcore (Default is 100MB)
12. How to determine the Maximum usage of memory of a component?
The maximum available value of max core is 231 1
13. Explain about Portion by key and Expression
Portion by Key: Distributes the records to output flow portions according to its key value
Partition by Expression: Distributes the record to output flows partitions by expression.
14. What r the different types of partition components?
Partition by key
Partition by Expression
Partition by round robin
Partition by range
Broadcast
Merge
Interleave (Combines in round robin fashion)
Concatenate
Sort
Sort with groups
Checkpointed sort
23. What is a multifile and how we can create through command line?
AbInitio multifiles are nothing but a partition of a large serial file into tree structure and runs
parallel way.
We can create the multifile in command line using the command M_MFKS fallowed by URL. Of
that particular file.
24. What is the difference between phase and check point?
Phases are used to break up a graph into blocks for performance tuning.
Check point is used for recovery
Component parallelism
Pipeline parallelism
Data parallelism
Component parallelism:
Component parallelism occurs when program components execute simultaneously on different
branches of a graph.
Pipeline parallelism:
Pipeline parallelism occurs when a connected sequence of program components on the same
branch of a graph execute simultaneously.
Data parallelism:
Data parallelism occurs when you separate data into multiple divisions, allowing multiple
copies of program components to operate on the data in all the divisions simultaneously.
Straight flow
Fan-in flow(
Fan-out flow(
)
)
)
Straight flow: This flow connects the two components with the same depth of parallelism
Fan-in flow: A fan-in flow connects a component with a greater depth of parallelism to one
with a lesser depth in other words; it follows a many-to-one pattern.
Fan-out flow:
A fan-out flow connects a component with a lesser number of partitions to one with a greater
number of partitions in other words, it follows a one-to-many pattern.
Lookup functions
Date functions
Is_error (Tests whether the error will occur while the time of evaluating the
expression)
Decimal_lpad:
Decimal_lrpad
String_compare
String_substring
String_concat
String_Index
String_length
String_lpad
String_lrpad
Key
Record format
Non-phased
Phased
On the GDE Debugging toolbar, click the Add Watcher to Flow button
Right-click the flow and choose Add Watcher from the shortcut menu.
If you set the checkpoint phase .rec file will create automatically. Once failure occur
for graph, while the time of rerunning of that graph, It will automatically recover the
data till last check point
If you want to run the from the beginning, you need perform the manual rollback from
the command line
The command is m_rollback
type_specifier
variable_name
[not NULL]
Optional. Keywords indicating that the variable cannot take on the value of
null. These must appear after the variable name and before the initial value.
NOTE: If you create a local variable without the not NULL keywords, and do
not assign an initial value, the local variable initially takes on the value of
null.
[=
expression ]
;
For example, the following local variable definitions define two variables, x and y. The value
for x depends on the value of the amount field of the variable in, and the value of y depends
on the value of x:
let int x = in.amount + 5;
let double y = 100.0 / x;
54.What is Global variable?
With in a package you can create and use the global variable to all the transformation
functions, which are present in the package, but u should declare the global variable outside
the transformation function.
Declaration:
let type_specifier variable_name [not NULL] [ = expression ] ;
Let
type_specifier
variable_name
[not NULL]
Optional. Keywords indicating that the variable cannot take on the value of
null. These must appear after the variable name and before the initial value.
NOTE: If you create a global variable without the not NULL keywords, and do
not assign an initial value, the global variable initially takes on the value of
null.
[=
expression ]
Log files have either a .hlg or .nlg suffix. A log file ending in .hlg is on the control, or host,
machine of a graph. A log file ending in .nlg is on a processing machine of a graph.
The job_log_file can be an absolute or relative pathname. Paths have the following syntax:
o On the control machine AB_WORK_DIR/host/job_id/job_id.hlg
o On a processing machine AB_WORK_DIR/vnode/job_id-XXX/job_id.nlg, where
the XXX on a processing machine path is an internal ID assigned to each
machine by the Co>Operating System.
58.How can I generate DML for a database table from command line?
Using the m_db command line utility we can generate the dml.
Syntax is
m_db gendml dbc_file [options] -table tablename
59.Can we do check-In and Check-Out through Command line?
Yes, we can do check-in and check-out using the air commands like AIR_OBJECT_IMPORT and
AIR_OBJECT_EXPORT.
60.What sort of issues you solved in the production support?
Ans) The Co>Operating System is core software that unites a network of computing
resourcesCPUs, storage disks, programs, datasetsinto a production-quality data processing
system.
The Co>Operating System is layered on top of the native operating systems of a collection of
computers. It provides a distributed model for process execution, file management, process
monitoring, checkpointing, and debugging.
The Graphical Development Environment (GDE) provides a graphical user interface into the
services of the Co>Operating System.
4) What are the differences between the various GDE connection methods?
Ans) There are a number of communication methods used to communicate between the GDE
and the Co>Operating System, including:
Ab Initio Server/REXEC:
Ab Initio Server/TELNET:
DCOM:
REXEC:
RSH:
TELNET:
SSH(/Ab Initio)
When using the GDE to connect to the Co>Operating system, the normal process for a
connection differs depending upon which communication method is selected. In broad
terms, two things tend to happen: files are transferred from the GDE to the target host (or
from the host to the GDE), and processes are started/executed on the host.
When using telnet, rexec and rsh, the basic steps are as follows.
A. The GDE transfers the execution script to the server via FTP.
B. The GDE connects to the server by means of the selected method.
C. The GDE executes that script on the server by means of the connection set up in
step B.
The process is differerent for connection methods that use the Ab Initio Server, however. These
methods include Ab Initio Server/Telnet and Ab Initio Server/Rexec, as well as SSH and DCOM.
The use of the Ab Initio Control Server replaces the need for FTP and adds enhanced serverside services. When the Ab Initio Control Server is involved, the basic steps are as follows:
All file transfer occurs across the same Control Server connection.
Database table configuration files for use with 2.1 Database Components
.dat
.dbc
.dml
.aih
Host Settings
.aip
Project Settings
Flow
Layout and
9) What kind of flat file formats supports by Ab Initio Graphical Design Interface (GDE)?
Ans) The Ab Initio Graphical Design Interface (GDE) supports these flat file formats: All file
types use the .dat extension.
Serial Files
Multifiles
Ad-hoc Multifile
Serial Files
A serial file is a flat, non-parallel file also known as one-way parallel. You create serial files
using a Universal Resource Locator (URL) on the component's Description tab. The URL starts
with file
Multifiles:
A multifile is a parallel file consisting of individual files called partitions and often stored on
different disks or computers. A multifile has a control file that contains URLs pointing to one or
more data files. You can divide data across partition files using these methods: random or
roundrobin partitioning, partitioning based on ranges or functions, and replication or
broadcast, in which each partition is an identical copy of the serial data. You create multifiles
using a URL on the components Description tab.
Ad-hoc Multifile :An ad-hoc multifile is a also a parallel file. Unlike a multifile, however, the
content of an ad-hoc multifile is not stored in multiple directories. In a custom layout, the
partitions are serial files. You create an ad-hoc multifile using partitions on the component's
Description tab.
10) What is dbc file contains?
Ans) File with a .dbc extension which provides the GDE with the information it needs to
connect to a database. A configuration file contains the following information:
The name and version number of the database to which you want to connect
The name of the computer on which the database instance or server to which you want
to connect runs, or on which the database remote access software is installed
The name of the database instance, server, or provider to which you want to connect
These six parameters are automatically created (and assigned their correct value) whenever
you create a sandbox.
12) What is the difference b/w sandbox parameters & graph parameters?
Ans) The difference between sandbox parameters and graph parameters is:
Graph parameters are visible only to the particular graph to which they belong
Sandbox parameters are visible to all the graphs stored in a particular sandbox
Transform functions are always associated with transform components; these are components
that have a transform parameter: Aggregate, Denormalize Sorted, Fuse, Join, Match Sorted,
MultiReformat, Normalize, Reformat, Rollup, and Scan components.
18) What is Prioritizing rule?
Ans) The order of evaluation of rules in a transform function by assigning priority numbers to
the rules. The rules are attempted in order of priority, starting with the assignment of lowestnumbered priority and proceeding to assignments of higher-numbered priorities, then finally to
an assignment for which no priority has been given.
19) What are local variables?
Ans) A local variable is a named storage location in an expression or transform function. You
declare a local variable within the transform function in which you want to use it. The local
variable is reinitialized each time the transform function is called, and it persists for one single
evaluation of the transform function.
20) What Is a Package?
Ans) A package is a named collection of related DML objects. A package can hold types,
transform functions, and variables, as well as other packages. Packages provide a means of
locating in one place DML objects that are needed more than once in a given graph, or needed
by multiple developers. Packages allow developers to avoid redundant code; this makes
maintenance of DML objects more efficient.
Packages are very useful in these types of situations:
The record formats of multiple ports use common record formats and/or type specifiers
Multiple components use common transforms
Turn off Undo by choosing File > Autosave/Undo on the GDE menu bar and clearing the
selection of Undo/Redo Enabled.
Turn off Propagation by choosing Edit > Propagation on the GDE menu bar and clearing
the selection of Record Format and Layout.
Increase the Tracking Interval by choosing Run > Default Settings on the GDE menu bar,
clicking the Code Generation tab, and increasing the Tracking Interval to 60 seconds.
Ans) There are three types of parallelism employed by the Co>Operating System:
Component Parallelism
Pipeline Parallelism
Data Parallelism
None
Translation Only
Translates graphs from GDE format to datastore format, but does not
do error checking and does not store results in the datastore.
Tip We recommend that at minimum you do translation only, since it
is required for analysis, which you can run anytime.
Translation with
Checking
Translates graphs from GDE to datastore format and checks for errors
that will interfere with dependency analysis. See Checked-for Errors.
Full Dependency
Analysis (Default)
Performs full dependency analysis on the graph and saves the results
in the datastore.
Tip We recommend that you do not do analysis now, as it can greatly
prolong checkin.
What to Analyze
The What to Analyze group of checkboxes allow you to specify which files will be subjected to
the level of analysis you specified in Analysis Level. The following table explains the four
choices:
Choice
All Files
All Unanalyzed
Files
All files in the project that have changed or those that are dependent on
or are required by files that have changed since the last time they were
analyzed regardless of whether or not the files were checked in by
you.
Only My Checked
In Files
Only the files checked in by you. This group can include files you
checked in earlier which are still on the analysis queue and have not yet
been analyzed.
Analysis Scope
The Analysis Scope group of checkboxes allow you to specify how far the specified level of
analysis will be extended to files dependent on those being analyzed, both in the current
project and in other projects. The following table describes the three choices.
Choice
Files in other projects common to (included in) the one you are
checking, if they are dependent on the files being analyzed.
Only the dependent files that are in the same project as the
file(s) being analyzed.
No Dependent Files
No dependent files.
Standard Parameters
Switch Parameters
Dependent Parameters
Ans) The value for the max-core parameter determines the maximum amount of memory, in
bytes, that the component can use. If the component is running in parallel, the value of maxcore represents the maximum memory usage per partition, not the sum for all partitions.
If you set the max-core value too low, the component runs more slowly than expected. If you
set the max-core value too high, the component might use too many machine resources, slow
the process drastically, and cause hard-to-diagnose failures.
41) What is ordered flow?
Ans) The Ordered attribute is a port attribute. It determines whether the order in which you
attach flows to a port, from top to bottom, is significant to the definition and purpose of the
component. If a port is ordered, the order in which flows are attached determines the result of
the processing the component does: if you change the order in which you attach the flows, you
create a different result.
Note: GDE indicates the difference between a port that is ordered and one that is not by
drawing them differently. If you inspect the ordered port of Concatenate in the graph, you see
a line dividing the port between the two flows; that line is not present in the port of Gather,
which is not ordered.
42) What will be the record order in the flows?
Ans) Components maintain the ordering of the input data records unless their explicit purpose
is to reorder records. For most components, if record x appears before record y in an input
flow partition, and if record x and record y are both in the same output flow partition, then
record x appears before record y in that output flow partition.
For example, if you supply sorted input to a Partition component, it produces sorted output
partitions.
Exceptions are:
The components that explicitly reorder records, such as Sort, Sort within Groups, and
Partition by Key and Sort.
The components that have fan-in flows, such as the Departition components. They each
define their own record order.
Each stage is written as a DML transform function. The multistage transform components are:
Denormalize, Normalize, Rollup & Scan
45) Explain about compress components?
Ans) There are a number of components that compress and uncompress data.
Components:
46) Difference b/w Replicate & Broadcast?
Ans) Broadcast arbitrarily combines all the data records it receives into a single flow and writes
a copy of that flow to each of its output flow partitions.
Replicate arbitrarily combines all the data records it receives into a single flow and writes a
copy of that flow to each of its output flows.
Use Replicate to support component parallelism for example, when you want to perform
more than one operation on a flow of data records coming from an active component.
Use Broadcast to increase data parallelism when you have connected a single fan-out flow to
the out port or to increase component parallelism when you have connected multiple straight
flows to the out port.
47) Explain about FUSE?
Ans) Fuse applies a transform function to corresponding records of each input flow. The first
time the transform function executes, it uses the first record of each flow. The second time the
transform function executes, it uses the second record of each flow, and so on. Fuse sends the
result of the transform function to the out port.
The component works as follows. The component tries to read from each of its input flows.
Inner join sets the record-required parameters for all ports to True. Inner join is the
default. The GDE does not display the record-required parameters because they all
have the same value.
Outer join sets the record-required parameters for all ports to False. The GDE does
not display the record-required parameters because they all have the same value.
Explicit allows you to set the record-required parameter for each port individually.
49) What is the use of override key parameter & where it is used?
Ans) Override key parameter is used in the Join component. To specify the alternative name(s)
for the key field(s) for a particular in port.
50) What are the different options available in reject thresh hold?
Ans) There are 3 options available, they are
Never abort
Abort on first reject
Use limit/ramp
Ramp is a decimal representing the Rate of reject events in the number of records
processed.
The component stops the execution of the graph when the number of reject events
exceeds the result of the following formula:
limit + (ramp * number_of_records_processed_so_far)
Instead of using a statement in SQL, you can now extract the to-be-joined data from the
database by calling a stored procedure, specified in the sql_select parameter. The syntax for
calling a stored procedure using Oracle or DB2 is as follows:
Either the name of a file containing a transform function, or a transform string. The
Reformat component calls the specified transform function for each input record. The
transform function uses the value of the input record to direct that input record to a
particular output port. The expected output of the transform function is the index of
an output port (zero-based). The Reformat component directs the input record to the
identified output port and executes the transform function, if any, associated with that
port.
When you specify a value for output-index, each input record goes to exactly one
transform/output port pair. For example, suppose there are 100 input records and two
output ports. Each output port receives between 0 and 100 records. According to the
transform function you specify for output-index, the split can be 50/50, 60/40, 0/100,
99/1, or any other combination that adds up to 100.
If you do not specify a value for output-index, Reformat sends every input record to
every transform/output port pair. For example, if Reformat has two output ports and
there are no rejects, 100 input records results in 100 output records on each port for a
total of 200 output records.
56) What is the difference between the Reformat and Redefine Format components?
Ans) The difference between Reformat and Redefine Format is that Reformat can actually
change the bytes in the data while Redefine Format simply changes the record format on the
data as it flows through, leaving the data unchanged.
The Reformat component can change the record format of data records by dropping fields, or
by using DML expressions to add fields, combine fields, or transform the data in the records.
The Redefine Format component copies data records from its input to its output without
changing the values in the data records. You use Redefine Format to change or rename fields in
a record format without changing the values in the records. In this way, it is similar to the DML
built-in function, reinterpret_as. Typically this component has different DML on its input and
output ports, and allows the unmodified data to be interpreted in a different form.
57) Explain Multi Reformat?
Ans) Multi Reformat changes the record format of data records flowing between from one to 20
pairs of in and out ports by dropping fields, or by using DML expressions to add fields, combine
fields, or transform the data in the records. A typical use for Multi Reformat is to put it
immediately before a custom component that takes multiple inputs.
The component operates separately on the data flowing between each pair of its inn-outn
ports. The count parameter specifies the total number of port pairs. Each inn-outn port pair
has its own associated transformn to reformat the data flowing between those ports.
58) What is ABLOCAL() and how can I use it to resolve failures when unloading in parallel?
Ans) Some complex SQL statements contain grammar that is not recognized by the Ab Initio
parser when unloading in parallel. You can use the ABLOCAL() construct in this case to prevent
the Input Table component from parsing the SQL (it will get passed through to the database). It
also specifies which table to use for the parallel clause.
59) What is the difference b/w Update table & Multi update table?
Ans) The main difference is commit number & commit table are mandatory parameters in multi
update table, where as in update table they are optional.
Update table modifys only single table in the database, where as multi update table can
modify more than one table, so we require commit table & commit number in multi update
table.
API Mode Execution (same in both componets):
The statements are applied to the incoming records as follows. For each record:
Denormalize consolidates groups of related data records into a single output record
with a vector field for each group, and optionally computes summary fields in the
output record for each group.
Both these components are Multi stage Transform components, The multi-stage transform
components require packages because, unlike other transform components, they are driven by
more than single transform functions. These components each take a package as a parameter
and, in order to process data, look for particular variables, functions, and types in that
package. For example, a multi-stage component might look for a type named temporary_type,
a transform function named finalize, or a variable named count_items.
61) How can I generate DML for a database table from the command line?
Ans) The Ab Initio command-line utility, m_db, with the gendml argument, generates
appropriate metadata for a database table or expression. The syntax for the utility is
Concatenate appends multiple flow partitions of data records one after another.
Gather combines data records from multiple flow partitions arbitrarily.
Interleave combines blocks of data records from multiple flow partitions in round-robin
fashion.
Merge combines data records from multiple flow partitions that have been sorted
according to the same key specifier and maintains the sort order.
Broadcast arbitrarily combines all the data records it receives into a single flow and
writes a copy of that flow to each of its output flow partitions.
Partition by Expression distributes data records to its output flow partitions according
to a specified DML expression.
Partition by Key distributes data records to its output flow partitions according to key
values.
Partition by Range distributes data records to its output flow partitions according to
the ranges of key values specified for each partition.
Partition with Load Balance distributes data records to its output flow partitions,
writing more records to the flow partitions that consume records faster.
FTP From transfers files of data records from a computer that is not running the
Co>Operating System to a computer that is running the Co>Operating System.
FTP To transfers files of data records to a computer that is not running the
Co>Operating System from a computer that is running the Co>Operating System
Sort within Groups reads data records from all the flows connected to the in port until
it either reaches the end of a group or reaches the number of bytes specified in the
max-core parameter.
When Sort within Groups reaches the end of a group, it does the following:
a. Sorts the records in the group according to the minor-key parameter
b. Writes the results to the out port
c. Repeats this procedure with the next group
NOTE: When connecting a fan-in or all-to-all flow to the in port of a Sort, you do not need to
use a Gather because Sort can gather internally on its in port.
71) How you validate the records?
Ans) Validate Records uses the is_valid function to check each field of the input data records to
determine if the value in the field is:
Consistent with the data type specified for the field in the input record format
Meaningful in the context of the kind of information it represents
72) What is the difference between m_rollback and m_cleanup? When would you use them?
Ans) m_rollback rolls back a partially completed graph to its beginning state. m_cleanup
cleans up files left over from unsuccessfully executed graphs and manually recovered graphs.
The Co>Operating System automatically creates a recovery (.rec) file and other temporary files
and directories in the course of executing a graph. When a graph terminates abnormally, it
leaves the temporary files and directories on disk. At this point there are several alternatives
possible:
Roll back to the last checkpoint.
The Co>Operating System rolls back the graph automatically, if possible. You can roll back the
graph manually by explicitly using the m_rollback command without the -d option. After a
rollback, some temporary files and directories remain on disk. To remove them, follow one of
the other three alternatives.
Rerun the graph.
If the graph is not already rolled back, rerunning the graph first rolls back the graph to the last
checkpoint. The graph then starts re-executing. If the re-execution is successful, it removes all
temporary files and directories.
So, given this new feature, for old job files, you can use the m_cleanup utility to list the
temporary files and directories, and m_cleanup_rm to delete them. You can also use
m_cleanup_du to display the amount of space these files use. Because recovery files and
temporary files are automatically created in the course of a run, remember not to delete these
files for jobs that are still running.
73) What does the error message "straight flows may only connect ports having equal
depths" mean?
Ans) The "straight flows may only connect ports having equal depths" error message appears
when you connect two components running at different levels of parallelism (or depth) with a
straight flow (one that does not have an arrow symbol on it). For example, you get this error
message if you connect a Join running 10 ways parallel to a serial output file, or if you connect
a serial Join to a 4-way multifile.
74) What is AB_WORK_DIR and what do you need to know about it?
Ans) AB_WORK_DIR is a configuration variable whose value is a working space for graph
execution.You can view the value of this by using m_env describe.
75) What does the error message "too many open files" mean, and how do you fix it?
Ans) The "too many open files" error messsage occurs most commonly because the value of the
max-core parameter of the Sort component is set too low. In these cases, increasing the value
of the max-core parameter solves the problem.
76) What does the error message "Failed to allocate <n> bytes" mean and how do you fix it?
Ans) The "failed to allocate" type of error message is generated when an Ab Initio process has
exceeded its limit for some type of memory allocation.
Reduce the value of max-core in order to reduce the amount of memory allocated to a
component before temporary files are used. When the amount of memory specified by
max-core is used up by a component, the component starts writing temporary files to
hold the data being processed.
Be aware that while reducing the value of max-core may solve the problem of running
out of swap space, it may have an adverse effect on the graph's performance and will
increase the number of temporary files.
Increase available swap space, for example, by waiting until other memory intensive
jobs have completed.
77) What do you need to do to configure to run my graph across two or more machines?
Ans) In order to execute a graph across multiple machines, you need to carry out the following
steps:
Make sure that all the machines involved have compatible Co>Operating Systems
installed.
Set up the configuration files (.abinitiorc files) so that the different Co>Operating
Systems can communicate with each other.
Set up the environment variables and make sure that they are propagated properly
from one machine to another, when appropriate.
Set up the graph so that it can run across the machines as desired.
78) What communication ports does the GDE use when communicating with the
Co>Operating System?
Ans) The communication ports used depend upon the communication protocol selected. In
short, the GDE uses:
SSH(/AI): 22
AI/REXEC: 512
AI/TELNET: 23 & **
The ** above refer to the dynamically determined port that the control server sets up for
the file transfer.
79) If you use the layout Database: default in your database component, which working
directory does the Co>Operating System use?
Ans) The $AB_WORK_DIR directory is the working directory for database layouts.
$AB_DATA_DIR provides disk storage for the temporary files.
80) What are vectors? Why would you use them?
Ans) Vectors are arrays of elements. An element can be a single field or an entire record. They
are often used to provide a logical grouping of information. Many programming languages use
the concept of an array. In broad terms, an array is a collection of elements that are logically
grouped for ease of access.
81) How can you quickly test my DML expressions?
Ans) You can use the m_eval utility to quickly test the expressions that you intend to use in
your graphs.
82) What is the layout for watcher files?
Ans) The debugger places watcher files in the layout of the component downstream of the
watcher.
83) How do you remove watcher files?
Ans) To delete all watcher datasets in the host directory (for all graphs), you can either use the
GDE menu option, Debugger > Clean-out Watcher Datasets or invoke the following command:
m_rm -f -rmdata GDE-WATCHER-xxx
84) How can I determine which version of the GDE and Co>Operating System I am using?
Ans) To determine your GDE version, on the GDE menu bar choose Help > About Ab Initio.
For the Co>Operating System, use either of the following commands:
m_env -version
m_env -v
85) Should you use a Reformat component with a lookup file or a Join component in graph?
Ans) First of all, there are situations in which you cannot use a Reformat with Lookup instead of
a Join. For example, you cannot do a Full Outer Join using a Reformat and Lookup. The answer
below assumes that in your particular case either Reformat with Lookup or Join can be used in
principle, and that the question is about performance benefits of one over the other. When the
lookup file (in case of lookup) or the nondriving input (in case of a Join) fits into the available
memory, the Join and the lookup offer very similar performance.
86) How can you increase the time-out value for starting an Ab Initio process?
Ans) You can increase time-out values with the Ab Initio environment variables
AB_STARTUP_TIMEOUT and AB_RTEL_TIMEOUT_SECONDS.
87) Give the file management commands?
Ans) To create Multi file system: m_mkfs [ options ] control_url partition_url [ partition_url .. ]
To delete Multidirectory:
To copy:
m_cp
To move:
m_mv
Disk free:
m_df
Count:
m_wc
88) What are data-sized vectors? How do you work with them?
Ans) Data-sized vectors are vectors that have no set length of elements but, rather, are
variably sized based upon the number of elements in each data record. For example, if an
input dataset has three records, each with a vector, the first record's vector might have 5
elements, the second 1 element, and the third record, 7.
89) What is the difference b/w today (now) and today1 (now1)?
Ans) The today (now) function calls the operating system for the current date on each call.
In contrast, the function today1 (now1) calls the operating system for the current date
only on the first call in a job, returning the same value on subsequent calls. The
difference between the two functions is particularly noticeable on jobs that start
before and end after midnight.
1. Listing files using wildcards - If the input file names have a common pattern then you can
use a wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are
found at the runtime matching the wild card pattern will be taken for the Ad hoc multifile.
2. Listing files in a variable. You can create a runtime parameter for the graph and inside the
parameter you can list all the files separated by spaces.
3. Listing files using a command - E.g. $(ls $AI_SERIAL/ad_hoc_input_*.dat), which produces
the list of files to be used for the ad hoc multifile. This method gives maximum flexibility in
choosing the input files, since you can use complex commands also that involves owner of file
or date time stamp.
15) How can I tune a graph so it does not excessively consume my CPU?
How to Tune a Graph against Excessive CPU consumption?
ANSWER:
Options:
1. Reduce the DOP ( degree of paralleism ) for components.
Example:
1. Change from a 4-way parallel to a 2-way parallel.
2. Examine each transformation for inefficiencies.
Example:
1. If transformation uses many local variables, make these variables global.
2. If same function call is performed more than once; call it once and store its value in a global
variable.
3. When reading data, reduce the amount of data that needs to be carried forward to the next
component.
16) I'm having trouble finding information about the AB_JOB variable. Where and how can I
set this variable?
ANSWER:
You can change the value of the AB_JOB variable in the start script of a given graph. This will
enable you to run the same graph multiple times at the same time (thus parallel). However,
make sure you append some unique identifier such as timestamp or sequential number to the
end of each AB_JOB variable name you assign. You will also need to vary the file names of any
outputs to keep the graphs from stepping on each others outputs. I have used this technique
to create a "utility" graph as a container for a start script that runs another graph multiple
times depending on the local variable input to the "utility" graph. Be careful you don't max out
the capacity of the server you are running on.
17) I have a job that will do the following: ftps files from remote server; reformat data in
those files and updates the database; deletes the temporary files.
How do we trap errors generated by Ab Initio when an ftp fails? If I have to re-run / re-start
a graph again, what are the points to be considered? Does *.rec file have anything to do
with it?
ANSWER:
AbInitio has very good restartability and recovery features built into it. In Your situation you
can do the tasks you mentioned in one graph with phase breaks.
FTP in phase 1 and your transformation in next phase and then DB update in another phase
(This is just an example this may not best of doing it as best design depends on various other
factors)
If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your
graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would
see a message saying recovery file exists, do you want to start your graph from last successful
check point or restart from beginning. Same thing if it fails in Phase 2.
Phases are expensive from Disk I/O perspective, so have to be careful in doing too much
phasing.
Coming back to error trapping each component has reject, error, log ports, reject captures
rejected records, error captures corresponding error and log captures the execution statistics
of the component. You can control reject status of each component by setting reject threshold
to either "Never Abort", "Abort on first reject" or setting "ramp/limit"
Recovery files keep tack of crucial information for recovering the graph from failed status,
which node the component is executing on etc. It is a bad idea to just remove the *.rec files,
you always want to rollback the recovery fils cleanly so that temporary files created during
graph execution won't hang around and occupy disk space and create issues.
Always use m_rollback -d
18) What is parallelism in Ab Initio?
ANSWER:
1) Component parallelism:
A graph with multiple processes running simultaneously on separate data uses component
parallelism.
2) Data parallelism:
A graph that deals with data divided into segments and operates on each segment
simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data
parallelism. To support this form of parallelism, Ab Initio software provides Partition
Components to segment data, and Departition Components to merge segmented data back
together.
3) Pipeline parallelism:
A graph with multiple components running simultaneously on the same data uses pipeline
parallelism.
Each component in the pipeline continuously reads from upstream components, processes data,
and writes to downstream components. Since a downstream component can process records
previously written by an upstream component, both components can operate in parallel.
NOTE: To limit the number of components running simultaneously, set phases in the graph.
20) What is a sandbox?
ANSWER:
Sandbox is a directory structure of which each directory level is assigned a variable name, is
used to manage check-in and checkout of repository based objects such as graphs.
fin -------> top level directory ( $AI_PROJECT )
|
|---- dml -------> second level directory ( $AI_DML )
|
|----- xfr -------> second level directory ( $AI_XfR )
|
|----- run --------> second level directory ( $AI_RUN )
|
You'll require a sandbox when you use EME (repository s/w) to maintain release control.
Within EME for the same project an identical structure will exist.
The above-mentioned structure will exist under the os (eg unix), for instance for the project
called fin, and is usually name of the top-level directory.
In EME, a similar structure will exist for the project: fin.
When you checkout or check-in a whole project or an object belonging to a project, the
information is exchanged between these two structures.
For instance, if you checkout a dml called fin.dml for the project called fin, you need a
sandbox with the same structure as the EME project called fin. Once you've created that, as
shown above, fin.dml or a copy of it will come out from EME and be placed in the dml directory
of your sandbox.
21) How can I read data which contains variable length records with different record
structures and no delimiters?
ANSWER:
a)Try using the Read Raw component, it should do exactly what you are looking for.
b)Use the dml format:
record
string(integer(4)) my_field_name;
end
22) How do I create subgraphs in Ab Initio?
ANSWER:
First, highlight all of the components you would like to have in the sub graph, click on edit,
then click on sub graph, and finally click on create.
23) suppose that u are changing fin.dml u said checkout
Exactly how does u do it? And one more thing like can quote an example where do u use
sandbox
Parameters and how exactly u create those do u keep those sand box parameters also 2
copies as we keep our graphs and other files.
ANSWER:
Checkin and checkout from EME
Checkin (sandbox) ---------------> EME
Checkout (sandbox) <------------- EME
1. AbInitio gives command line interfaces via air command to perform
Checkin and checkout
out::rollup(in) begin
let datetime("YYYY-MM-DD HH24:MI:SS")("\001") rfmt_dt;
rfmt_dt=int_dat(in.reg_date, in.reg_time);
out.datetime_output :: rfmt_dt;
out.* :: in.*;
end;
However I got an error during run time.
The Error Message looked like:
While compiling finalize:
While compiling the statement:
rfmt_dt = int_to_date(in.reg_date, in.reg_time);
Error: While compiling transform int_to_date:
Output object "out.output_date_format" unknown.?
28) I have small problem understanding the problem with reformat.
i could not figure out why this reformat component runs forever. i believe it is in endless
loop somehow
Reformat component has following input and output DML:
record
begin
string(",") code, code2;
intger(2) count ;
end("\n")
Note : here variable "code" is never null nor blank.
sample data is
string_1,name,location,firstname,lastname,middlename,0
string_2,job,location,firstjob,lastjob,0
string_3,design,color,paint,architect,0
out::reformat(in) =
begin
let string(integer(2)) temp_code2 = in.code2;
let string(integer(2)) temp_code22 = " ";
let integer(2) i=0;
while (string_index(temp_code2, ",") !=0 || temp_code2 "")
begin
temp_code22 = string_concat(in.code,",", string_substring(temp_code2,
1,string_index(temp_code2,",")));
temp_code2 = string_substring(temp_code2, string_index(temp_code2, ","),
string_length(temp_code2));
i=i+1;
end
out.code :: in.code;
out.code2 :: string_lrtrim(temp_code22);
out.count:: i;
end;
my expected output is
string_1,string_1,name,string_1,location,string_1,firstname,string_1,lastname,string_1,middlen
ame,5
string_2,string_2,job,string_2,location,string_2,firstjob,string_2,lastjob,4
string_3,string_3,design,string_3,color,string_3,paint,string_3,architect,4
ANSWER:
record
begin
string(",") code, code2;
integer(2) count ;
end("\n")
In my abinitio it is not validated ..................
29) In my graph I am creating a file with account data. For a given account there can be
multiple rows of data. I have to split this file into 4 (specifically) files which are nearly
equal in size. The trick is to keep the accounts confined to one file. In other words
account data should not span across these files. How do I do it?
Also if the records are less than 4 (different accounts) i should be able to create empty
files. But I need atleast 4 files.
FYI: The requirement to have 4 files is because I need to start 4 parallel processes for load
balancing the subsequent processes.
ANSWER:
a)
I could not get ur requirement very clearly as you want to split the files in 4 equal parts as well
as keep the same account numbers in same file. Can you explain what will you do in case of
5account numbers having 20 records each?..........As far as splitting is concerned a very very
crude soln would be as follows
In the end script do the following:
1.Find the size of the file and store it in variable (say v_size)
2.v_qtr_size=`expr $v_size / 4`
3.split -b $v_qtr_size <filename>
4.Rename the splitted files as per ur requirement. Note the splitted
files have a specific pattern in their name
b)
Your requirement is such that it essentially depends on the skewness of your data across
accounts. If you want to keep same accounts in same partition, then partition the data by key
(account) with the out port connected to 4 way parallel layout. But this does not guarantee
equal load in all partitions unless the data has little skewness.
But I can suggest you an alternative approach, though cumbersome, still might give you a
result, close to your requirement.
You replicate your original dataset into two, and take one of them and rollup on account no to
find the record count per account_no. Now sort this result on record count so that you have the
account_no with min count at top and the one with max count at bottom. Now apply a
partition by round robin and separate out the four partitions (partition 0, 1, 2 & 3).
Now take the first partition and join with your main dataset ( that you have replicated earlier)
on account_no and write the matching records (out port) into the first file. Take the unused
records of your main flow of the previous join and now join it with the second partition
(partition1) on account_no and write the matching records (out port) to the second file.
Similarly again take the unused records of the previous join and join it with the third partition
(partition 2) on acount_no. Write the matching record (out port) to the third file and the
unused records of the main flow in the fourth file.
This way you can get four files, nearly equal in size, and same account not spread across files.
30) I have a graph parameter state_cd having values based on a If statement. This variable I
would like to use in SQL statement in AI_SQL directory. I have 20 SQL statements for 20
table codes. I will be using corresponding SQL statement based on table code passed as
parameter to a graph.
eg: SQLs in AI_SQL directory.
---------------------------1. Select a,b from abc where abc.state_cd in ${STATE_CD}
2. Select x,y from xyz where xyz.state_cd in ${STATE_CD}
${STATE_CD} is a graph parameter
In value - "(IL,CO,MI)"
Problem is ${STATE_CD} is not getting interpreted when I echo the 'Select Statement',
hence giving problem.
Details
In the course of running a job, the Co>Operating System creates a jobname.rec file in the
working directory on the run host.
NOTE: The script takes jobname from the value of the AB_JOB environment variable. If
you have not specified a value for AB_JOB, the GDE supplies the filename of the graph as the
default value for AB_JOB when it generates the script.
The jobname.rec file contains a set of pointers to the internal job-specific files written by the
launcher, some of which the Co>Operating System uses to recover a job after a failure. The
Co>Operating System also creates temporary files and directories in various locations. When a
job fails, it typically leaves the jobname.rec file, the temporary files and directories, and
many of the internal job-specific files on disk. (When a jobs succeeds, these files are
automatically removed, so you don't have to worry about them.)
If your job fails, determine the cause and fix the problem. Then:
If the job succeeds, the jobname.rec file and all the temporary files and directories are
cleaned up. Alternatively, run m_rollback -d to clean up the files left behind by the failed job.
What value should I set for the max-core parameter?
Short answer
The max-core parameter is found in the SORT, JOIN, and ROLLUP components, among others.
There is no single, optimal value for the max-core parameter, because a "good" value depends
on your particular graph and the environment where it runs.
Details
The SORT component works in memory, and the ROLLUP and JOIN components have the option
to do so. These components have a parameter called max-core, which determines the
maximum amount of memory they will consume per partition before they spill to disk. When
the value of max-core is exceeded in any of the in-memory components, all inputs are dropped
to disk. This can have a dramatic impact on performance; but this does not mean that it is
always better to increase the value of max-core.
The higher you set the value of max-core, the more memory the component can use. Using
more memory generally improves performance up to a point. Beyond this point, performance
will not improve and might even decrease. If the value of max-core is set too high, operating
system swapping can occur and the graph might fail if memory on the machine is exhausted.
When setting the value for max-core, you can use the suffixes k, m, and g (upper-case is also
supported) to indicate powers of 1024. For max-core, the suffix k (kilobytes) means precisely
1024 bytes, not 1000. Similarly, the suffix m (megabytes) means precisely 1048576 (10242), and
g (gigabytes) means precisely 1024 3. Note that the maximum value for max-core is 2g-1.
SORT component
For the SORT component, 100 MB is the default value for max-core. This default is used to
cover a wide variety of situations and might not be ideal for your particular circumstances.
Increasing the value of max-core will not increase performance unless the full dataset can be
held in memory, or the data volume is so large that a reduction in the number of temporary
files improves performance. You can estimate the number of temporary files by multiplying the
data volume being sorted by three and dividing by the value of max-core (because data is
written to disk in blocks that are one third the size of the max-core setting). This number
should be less than 1000. For example, suppose you are sorting 1 GB of data with the default
max-core setting of 100 MB and the process is running in serial. The number of temporary files
that will be created is:
3 1000MB / 100 MB = 30 files
You should decrease the value of a SORT component's max-core if an in-memory ROLLUP or
JOIN component in the same phase would benefit from additional memory. The net
performance gain will be greater.
If you get a "Too many open files" error message, your SORT component's max-core might be
set too low. If this is the case, SORT can also fill AB_WORK_DIR (usually set to /var/abinitio at
installation), which will cause all graphs to fail with a message about semaphores. This
directory is where recovery information and inode information for named pipes are stored and
is typically mounted on a small filesystem.
of partitions to get AI_GRAPH_MAX_CORE max-core is measured per partition, and the factor
of two gives a contingency safety factor. So:
AI_GRAPH_MAX_CORE = (total memory - memory used elsewhere)/(2 * number of partitions