Abinitio Intew
Abinitio Intew
Abinitio Intew
I'm having trouble finding information about the AB_JOB variable. Where and how can I set this variable?
1. Adapted from a response by Bob on Thursday, May 06, 2004 You can change the value of the AB_JOB variable in the start script of a given graph. This will enable you to run the same graph multiple times at the same time (thus parallel). However, make sure you append some unique identifier such as timestamp or sequential number to the end of each AB_JOB variable name you assign. You will also need to vary the file names of any outputs to keep the graphs from stepping on each others outputs. I have used this technique to create a "utility" graph as a container for a start script that runs another graph multiple times depending on the local variable input to the "utility" graph. Be careful you don't max out the capacity of the server you are running on.
1. Adapted from a response by santoshachhra on 12/12/2005 1. Informatica and Ab Initio both support parallelism. But Informatica supports only one type of parallelism but the Ab Initio supports three types of parallesims. In Informatica the developer need to do some partions in server manager by using that you can achieve parallelism concepts. But in Ab Initio the tool it self take care of parallelism we have three types of parallelisms in Ab Initio 1. Component 2. Data Parallelism 3. Pipe Line parallelism this is the difference in parallelism concepts. 2. We don't have scheduler in Ab Initio like Informatica you need to schedule through script or u need to run manually. 3. Ab Initio supports different types of text files means you can read same file with different structures that is not possible in Informatica, and also Ab Initio is more user friendly than Informatica so there is a lot of differences in Informatica and Ab initio. 4. Both tools are fundamentally different. Which one to use depends on the work at hand and existing infrastructure and resources available. 5. Informatica is an engine based ETL tool, the power this tool is in it's transformation engine and the code that it generates after development cannot be seen or modified. 6. Ab Initio is a code based ETL tool, it generates ksh or bat etc. code, which can be modified to achieve the goals, if any that cannot be taken care through the ETL tool itself.
7. Initial ramp up time with Ab Initio is quick compare to Informatica, when it comes to standardization and tuning probably both fall into same bucket. 8. AbInitio doesn't need a dedicated administrator, UNIX or NT Admin will suffice, where as other ETL tools do have administrative work. 9. With recent releases of Informatica, it has a built in Change Data Capture capabilities (extract only the chnaged data through the DB logs), where as Ab Initio has to rely on DB to provide the CDC capabilities, as of now it doesn't have a way to sniff the DB logs. 10. Another interesting difference noticed is, with Ab Initio you can read data with multiple delimiter in a given record, where as Informatica force you to have all the fields be delimited by one standard delimiter. 11. If we go into component level, each tool has it's own way of implementing these transformation components. 12. When making a choice there are lot of factors which drive the decision like Existing infrastructure, resources, Project time line. Metadata management, complexity of transforms, data volumes, integration with third party tools, Tool Support etc. not just Budget. * Error Handling - In Ab Initio you can attach error and reject files to each transformation and capture and analyze the message and data separately. Informatica has one huge log! Very inefficient when working on a large process, with numerous points of failure. * Robust transformation language - Informatica is very basic as far as transformations go. While I will not go into a function by function comparison, it seems that Ab Initio was much more robust. * User defined functions - I never developed any in Ab Initio, but I know that you can. Informatica requires that you code custom transformations in C++ (or VB if you are on a windows platform). Ab Initio also allows for custom components, but I have never had to develop one. * Instant feedback - On execution, Ab Initio tells you how many records have been processed/rejected/etc. and detailed performance metrics for each component. Informatica has a debug mode, but it is slow and difficult to adapt to. * Consolidated Interface - Ab Initio
as a unit. An Ad hoc multifile is a multifile created 'on the fly' out of a set of serial files, without needing to define a multifile system to contain it. This enables you to represent the needed set of serial files with a single input file component in the graph. Moreover, the set of files used by the component can be determined at runtime. This lets the user customize which set of files the graph uses as input without having to change the graph itself, even after it goes into production. Ad hoc multifiles can be used as output, intermediate, and lookup files as well as input files. The simplest way to define an Ad hoc multifile is to list the files explicitly as follows: 1. Insert an input file component in your graph. 2. Open the properties dialog. Select Description tab. 3. Select Partitions in the Data Location of the Description tab 4. Click Edit to open the Define multifile Partitions dialog box. 5. Click New and enter the first file name. Click New again and enter the second file name and so on. 6. Click OK. If you have added 'n' files, then the input file now acts something like a file in a n-way multifile system, whose data partitions are the n files you listed. It is possible for components to run in the layout of the input file component. However, there is no way to run commands such as m_ls or m_dump on the files, because they do not comprise a real multifile system. There are other ways than listing the input files explicitly in an Ad hoc multifile. 1. Listing files using wildcards - If the input file names have a common pattern then you can use a wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are found at the runtime matching the wild card pattern will be taken for the Ad hoc multifile. 2. Listing files in a variable. You can create a runtime parameter for the graph and inside the parameter you can list all the files separated by spaces. 3. Listing files using a command - E.g. $(ls $AI_SERIAL/ad_hoc_input_*.dat), which produces the list of files to be used for the ad hoc multifile. This method gives maximum flexibility in choosing the input files, since you can use complex commands also that involves owner of file or date time stamp.
group Summary: I don't understand the Assign-Keys component. When would the contents of the *first* port and the *new* port be different? When would the output from these two ports be different? Full Article: Disclaimer: Contents are not reviewed for correctness and are not endorsed or recommended by ITtoolbox or any vendor. Popular Q&A contents include summarized information from ITtoolbox
Abinitio-L discussion unless otherwise noted. 1. Adapted from a response by mr On Saturday, October 22, 2005 Assign Keys can be used for generating Surrogate keys. Surrogate key will be generated based on Natural Key. Here is how it works. Suppose today you are executing a graph. This graph is a daily graph. The same graph you have to execute with a different data. Surrogate key will be assigned to new records starting with the max value which already have been assigned. This max value of surrogate key will take from records coming from Key port. Usage of First Port: If multiple records having same natural key are sent to Assign key and all records are new, only one record will be sent to First port and remaining will be sent to new port. 2. Adapted from a response by Remediator On Saturday, October 22, 2005 The real trick in using Assign Keys is to maintain a master key file for each natural/surrogate pair. For example, if you have five downstream target tables, you'll need five master/surrogate files. Each file will be used for input to assign new surrogate keys or mate existing natural keys with their original surrogate. Keep in mind that the highest risk for manufacturing surrogate keys is in losing track of the relationships and requiring a complete rebuild. The surrogates tend to be leveraged by other tables (ideally) so losing the cross-reference between natural and surrogate (due to a system or database crash for example) can be devastating. Assign Keys coupled with a master key file strategy can be powerful assets in developing highperformance data models. look at it like this - an integer key (used as a an index) is in many cases over 100 times faster for both lookup and multidimensional queries than its natural (usually character-based) equivalent. Put it to the test - build a table of 100,000 rows with an integer and string version of the same columnar data (call them intKey and strKey for example). Use them as foreign keys in another table containing 10,000 rows. Now perform some joins, summaries and order-by operations on the tables using one or the other key. The difference in performance is dramatic even for tables this small. A trap exists in partitioning - consider the following output DML for a Reformat: record string(10) strkey; integer(4) intkey; decimal(10) deckey; end
now consider the following transform for that same reformat: out::reformat(in) = begin let integer(4) nseq =next_in_sequence(); out.intkey :: nseq; out.deckey :: nseq; out.strkey :: decimal_lpad((decimal(10))nseq,10); end; Put a Generate Records before the Reformat, set its record count to 1000, set its layout to a serial file and this will effectively output the following sequence:
Record 1: [record strkey "0000000001" intkey 1 deckey " 1"] Record 2: [record strkey "0000000002" intkey 2 deckey " 2"] Record 3: [record strkey "0000000003" intkey 3 deckey " 3"]
Now run this Reformat's output into a Replicate, then through three separate Partition by Keys that each feed a Trash component.
Make each of the Trash Components use a 4-way multifile so the Paritioners will hash the flows. Set the Generate Records layout to a serial file. Make sure each of the Partition-by-Key's is propagating from Generate Records, not Trash. Set one of the Partition by Keys to partition on strKey, one on intKey, one on deckey Run it
1. Adapted from a response by Srinivas On Sunday, June 05, 2005 Broadcast and Replicate are similar components but generally Replicate is used to increase Component Parallelism, emitting multiple straight flows to seperate pipelines. Broadcast is used to increase data parallelism by feeding records to fan-out or all-to-all flows. 2. Adapted from a response by rajaravisankar_adari On Monday, June 06, 2005 Replicate is old component when compared to broadcast. You can use Broadcast as join component, where as Replicate you can't use as join. By Default, Replicate is Straight flow and Broadcast is fan-out or All-To-All Flow. Broadcast is used for Data Parallism whereas Replicate is used for Component Parallesim. 3. Adapted from a response by azaman On Monday, June 06, 2005 Replicate Supports component parallelism
Input File -------> Replicate --------> Format ---->Output File | | | --------->Rollup-------> output File
Input File2 is a serial file and it is being joined with a mf, input file2, without being partitioned. The compoment, Broadcast, is writing data to all partitions of Input file1, creating an implicit fan out flow. 4. Adapted from a response by Remediator On Monday, June 06, 2005 The short answer is that the Replicate copies a flow while a Broadcast multiplies it. Broadcast is a partitioner where Replicate is a simple flow-copy mechanism. Replicate appears in over 90% of all AI graphs (across the board of all implementations worldwide) where Broadcast appears in less than 1% of all graphs. You won't see any difference in the two until you start using data-parallel, then it will go south rather quickly. Here's an experiment: Use a simple serial input file, followed by a broadcast, then a 4-way multifile output file component. If you run the graph with say, 100 records from the input file, it will create 400 records in the output file - 100 records for each flow partition encountered. If you had used a Replicate, it would have read and written 100 records.
1. Adapted from a response by xiningding on 2/7/2006 The CDC function in Oracle 9.2 is synchronized, and the generation of change records is tied of the original transaction. When there are many changes in source database, there will be performance problems. Meanwhile, in the Oracle10g, the CDC function is asynchronous which gets the change data from redo log. I know a little about Oracle Streams, it can be used as message queue as well. It can also transform data between different databases in asynchronous form by using redo log.
2. Adapted from a response by Nat G on 2/7/2006 We have done a similar exercise to identify the best CDC component available.
We have evaluated Informatica PowerExchange, Oracle Streams and Data Guard (Replication to avoid the load on the Production box, so that CDC can work on the Standby database). One way or the other, all of these CDC tools works closely with the Oracle Logminer when you have to capture the changes from Oracle. So it is important that you take the DBA with you for the exercise as we had a situation when the DBA's objected to the running Logminer and supplemental logging the production database, when they came to know of the CDC architecture. Oracle Streams: I believe it supports only Synchronous data capture. You have to build manually the ETL programs using the Oracle Streams. The only advantage is its free of cost. Informatica PowerExchange/Datastage CDC(IBM): Not much off a difference. But physically evaluate the tool (install the laboratory licensed version and see the CPU/Memory utilization.) I believe both these tools calls for a good investment and also you need to buy their core ETL product for a good integration. (Though they promise that their CDC component can work as a stand-alone solution, integration will be a big head-ache). We have finalized on Informatica PowerExchange as our client was already using Informatica PowerCenter for most of their ETL requirements. Another product worth considering could be OWB 10g Paris. I believe Oracle started shipping this release from Jan 06. This release of OWB seems to have CDC option as well. This might be a cost-effective option but the caution is it is pretty new. You might be the first one to use the CDC option on the OWB Paris release.