Ab Initio Introduction
Ab Initio Introduction
Ab Initio Introduction
Ab Initio means “ Starts From the Beginning”. Ab-Initio software works with the
client-server model.
The Ab-Initio code is called graph ,which has got .mp extension. The graph
from GDE is required to be deployed in corresponding .ksh version. In Co-
Operating system the
corresponding .ksh in run to do the required job.
How Ab-Initio Job Is Run What happens when you push the “Run” button?
•Your graph is translated into a script that can be executed in the Shell
Development
•This script and any metadata files stored on the GDE client machine are
shipped (via
FTP) to the server.
•The script is invoked (via REXEC or TELNET) on the server.
•The script creates and runs a job that may run across many hosts.
•Monitoring information is sent back to the GDE client.
Ab-Intio Environment The advantage of Ab-Initio code is that it can run in
both the serial and multi-file system environment. Serial Environment: The
normal UNIX file system. Muti-File System: Multi-File System (mfs) is meant for
parallelism. In an mfs a particular file physically stored across different
partition of the machine or even different
machine but pointed by a logical file, which is stored in the co-operating
system. The
logical file is the control file which holds the pointer to the physical locations.
About Ab-Initio Graphs: An Ab-Initio graph comprises number of components
to serve different purpose. Data is read or write by a component according
to the dml ( do not
confuse with the database “data manipulating language” The most
commonly used
components are described in the following sections.
Co Operating System
GDE is a graphical application for developers which is used for designing and
running AbInitio graphs. It also provides:
- The ETL process in AbInitio is represented by AbInitio graphs. Graphs are
formed by components (from the standard components library or custom),
flows (data streams) and parameters.
- A user-friendly frontend for designing Ab Initio ETL graphs
- Ability to run, debug Ab Initio jobs and trace execution logs
- GDE AbInitio graph compilation process results in generation of a UNIX shell
script which may be executed on a machine without the GDE installed
AbInitio EME
Conduct It
Data Profiler
The Data Profiler is an analytical application that can specify data range,
scope, distribution, variance, and quality. It runs in a graphic environment on
top of the Co Operating system.
Component Library
The Ab Initio Component Library is a reusable software module for sorting,
data transformation, and high-speed database loading and unloading. This is
a flexible and extensible tool which adapts at runtime to the formats of
records entered and allows creation and incorporation of new components
obtained from any program that permits integration and reuse of external
legacy codes and storage engines.
Questions Set 1
informatica vs ab initio
Feature AB Initio Informatica
About Tool Code based ETL Engine based ETL
Parallelism Supports One Types of parallelism Supports three types of parallelism
Scheduler No scheduler Schedule through script available
Error Handling Can attach error and reject files One file for all
Robust Robustness by function comparison Basic in terms of robustness
Feedback Provides performance metrics for each component executed Debug
mode, but slow implementation
Delimiters while reading Supports multiple delimeters Only dedicated delimeter
Q. What exactly do you understand with the term data processing and businesses can trust
this approach?
Processing is basically a procedure that simply covert the data from a useless form into a
useful one without making a lot of efforts. However, the same may vary depending on
factors such as the size of data and its format. A sequence of operations is generally carried
out to perform this task and depending on the type of data, this sequence could be
automatic or manual. Because in the present scenario, most of the devices that perform this
task are PC’s automatic approach is more popular than ever before. Users are free to obtain
data in forms such as a table, vectors, images, graphs, charts and so on. This is the best
things that business owners can simply enjoy.
Q. How data is processed and what are the fundamentals of this approach?
There are certain activities which require the collection of the data and the best thing is
processing largely depends on the same in many cases. The fact is data needs to be stored
and analyzed before it is actually processed. This task depends on some major factors are
they are
1. Collection of Data
2. Presentation
3. Final Outcomes
4. Analysis
5. Sorting
These are also regarded as the basic fundamentals that can be trusted to keep up the pace
in this matter.
Q. Do you think effective communication is necessary in the data processing? What is your
strength in terms of same?
The biggest ability that one could have in this domain is the ability to rely on the data or the
information. Of course, communication matters a lot in accomplishing several important
tasks such as representation of the information. There are many departments in an
organization and communication make sure things are good and reliable for everyone.
Q. Suppose we assign you a new project. What would be your initial point and the key steps
that you follow?
The first thing that largely matters is defining the objective of the task and then engages the
team in it. This provides a solid direction for the accomplishment of the task. This is
important when one is working on a set of data which is completely unique or fresh. After
this, next big thing that needs attention is effective data modeling. This includes finding the
missing values and data validation. Last thing is to track the results.
Q. Suppose you find the term Validation mentioned with a set of data, what does that
simply represent?
It represents that the concerned data is clean, correct and can thus be used reliably without
worrying about anything. Data validation is widely regarded as the key points in the
processing system.
Q. What do you mean by data sorting?
It is not always necessary that data remains in a well-defined sequence. In fact, it is always a
random collection of objects. Sorting is nothing but arranging the data items in desired sets
or in sequence.
Q. Name the technique which you can use for combining the multiple data sets simply?
It is known as Aggregation
Q. Name any two stages of the data processing cycle and provide your answer in terms of a
comparative study of them?
The first is Collection and second one is preparation of data. Of course, the collection is the
first stage and preparation is the second in a cycle dealing with data processing. The first
stage provides baseline to the second and the success and simplicity of the first depends on
how accurately the first has been accomplished. Preparation is mainly the manipulation of
important data. Collection break data sets while Preparation joins them together.
Q. What do you mean by the overflow errors?
While processing data, calculations which are bulky are often there and it is not always
necessary that they fit the memory allocated for them. In case a character of more than 8-
bits is stored there, this errors results simply
Q. Name one method which is generally considered by remote workstation when it comes
to processing
Distributed processing
Q. What do you mean by a transaction file and how it is different from that of a Sort file?
The Transaction file is generally considered to hold input data and that is for the time when
a transaction is under process. All the master files can be updated with it simply. Sorting is
done to assign a fixed location to the data files on the other hand.
Q. What is the use of aggregation when we have rollupas we know rollup component in
abinitio is used to summarize group of data record. Then where we will use aggregation?
Aggregation and Rollup both can summarize the data but rollup is much more convenient to
use. In order to understand how a particular summarization being rollup is much more
explanatory compared to aggregate. Rollup can do some other functionality like input and
output filtering of records.Aggregate and rollup perform same action, rollup display
intermediate result in main memory, Aggregate does not support intermediate result.
Q. What is the diff b/w look-up file and look-up, with a relevant example?
Generally, Lookup file represents one or more serial files (Flat files). The amount of data is
small enough to be held in the memory. This allows transform functions to retrieve records
much more quickly than it could retrieve from Disk.
Q. How many components in your most complicated graph?
It depends the type of components you us. Usually avoid using much complicated transform
function in a graph.
Q. Can sorting and storing be done through single software or you need different for these
approaches?
Well, it actually depends on the type and nature of data. Although it is possible to
accomplish both these tasks through the same software, many software have their own
specialization and it would be good if one adopts such an approach to get the quality
outcomes. There are also some pre-defined set of modules and operations that largely
matters. If the conditions imposed by them are met, users can perform multiple tasks with
the similar software. The output file is provided in the various formats.
Q. What are the different forms of output that can be obtained after processing of data?
These are
1. Tables
2. Plain Text files
3. Image files
4. Maps
5. Charts
6. Vectors
7. Raw files
Sometime data is required to be produced in more than one format and therefore the
software accomplishing this task must have features available in it to keep up the pace in
this matter.
Q. Give one reason when you need to consider multiple data processing?
When the required files are not the complete outcomes which are required and need
further processing.
Q. What are the types of data processing you are familiar with?
The very first one is the manual data approach. In this, the data is generally processed
without the dependency on a machine and thus it contains several errors. In the present
time, this technique is not generally followed or only a limited data is proceed with this
approach. The second type is the Mechanical data processing. The mechanical devices have
some important roles in it this approach. When the data is a combination of different
formats, this approach is adopted. The next approach is the Electronic data processing
which is regarded as fastest and is widely adopted in the current scenario. It has top
accuracy and reliability.
Q. Name the different type of processing based on the steps that you know about?
They are:
1. Real-Time processing
2. Multiprocessing
3. Time Sharing
4. Batch processing
5. Adequate Processing
1. Initialize
2. Rollup
3. Finalize
Also need to declare one temporary variable if you want to get counts of a particular group.
For each of the group, first it does call the initialize function once, followed by rollup
function calls for each of the records in the group and finally calls the finalize function once
at the end of last rollup call.
Q. What is the difference between partitioning with key and round robin?
Partition by Key or hash partition ->This is a partitioning technique which is used to partition
data when the keys are diverse. If the key is present in large volume then there can large
data skew? But this method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the data on
each of the destination data partitions. The skew is zero in this case when no of records is
divisible by number of partitions. A real life example is how a pack of 52 cards is distributed
among 4 players in a round-robin manner.