Teradata Informatica Best Practices
Teradata Informatica Best Practices
Teradata Informatica Best Practices
Introduction This document discusses configuration and supplies how-to examples for using Informatica PowerCenter 7.1.2 and NCR Teradata RDBMS. It covers Teradata basics and also describes some tweaks that experience has shown may be necessary to adequately deal with some of the common practices you may encounter at a Teradata account. The Teradata documentation (especially the MultiLoad, FastLoad, and TPump reference) is highly recommended reading material, as is the External Loader section of the Server Manager Guide for PowerCenter. Additional Information: All Teradata documentation can be downloaded from the NCR Web site (http://www.info.ncr.com/Teradata/eTeradata-BrowseBy.cfm). Finally, a Teradata Forum provides a wealth of information that can be useful (http://www.Teradataforum.com). Teradata Basics Teradata is a relational database management system from NCR. It offers high performance for very large database tables because of its highly parallel architecture. It is a major player in the retail space. Although Teradata can run on other platforms, it predominantly appears on NCR hardware (which runs NCRs version of UNIX). Teradata is very fast and very scalable. Teradata Hardware The NCR computers on which Teradata runs support both massively parallel processing (MPP) and symmetric multiprocessing (SMP). Each MPP node (or semi-autonomous processing unit) can support SMP. Teradata can be configured to communicate directly with a mainframes input/output (I/O) channel, known as channel attached. Alternatively, it can be network attached, configured to communicate via transmission control protocol/Internet protocol (TCP/IP) over a local area network (LAN). Because PowerCenter runs on UNIX, you will be dealing with a network-attached configuration most of the time. However, once in a while, clients will want to use their existing channel-attached configuration under the auspices of better performance. Do not assume that channel attached is always faster than network attached. Similar performance has been observed across a channel attachment as well as a 100-MB LAN. In addition, channel attachment requires an additional sequential data move: Data must be moved from the PowerCenter server to the mainframe before moving the data across the mainframe channel to Teradata. Teradata Software In the Teradata world, there are Teradata Director Program IDs (TDPIDs), databases, and users. The TDPID is simply the name for connections from a Teradata client to Teradata server (similar to an Oracle tnsnames.ora entry). Teradata also looks at databases and users somewhat synonymously. A user has a userid, password, and space to store tables. A database is basically a user without a login and password (or a user is a database with a userid and password). Teradata AMPs are access module processors. Think of them as Teradatas parallel database engines. Although they are strictly software (virtual processors according to NCR terminology), Teradata employees often use AMP and hardware node interchangeably because an AMP previously was a piece of hardware. Client Configuration Basics for Teradata The client-side configuration is wholly contained in the hosts file (/etc/hosts on UNIX or winnt\system32\drivers\etc\hosts on Win). Informatica does not run on NCR UNIX, so you should not have to deal with the server side. Teradata uses a naming nomenclature in the hosts file. The name of the Teradata instance (that is, TDPID) is indicated by the letters and numbers that precede the string cop1 in a hosts file entry. For example: 127.0.0.1 192.168.80.113 localhost curly demo1099cop1 pcop1
This tells Teradata that when a client tool references the instance demo1099, it should direct requests to localhost (or IP address 127.0.0.1); when a client tool references instance p, it is located on the server curly (or IP address 192.168.80.113). There is no tie here to any kind of database server specific information (This is not similar to Oracles instance id. TDPID <> Oracle instance id.). That is, the TDPID is used strictly to define the name a client uses to connect to a server. You can call a server whatever you want. Teradata does not
2 Informatica Confidential. Do not duplicate.
care. It simply takes the name you specify, looks in the hosts file to map the <name>cop1 (or cop2, and so on) to an IP address, and then attempts to establish a connect with Teradata at the IP address. Sometimes youll see multiple entries in a hosts file with similar TDPIDs: 127.0.0.1 192.168.80.113 192.168.80.114 192.168.80.115 192.168.80.116 localhost curly_1 curly_2 curly_3 curly_4 demo1099cop1 pcop1 pcop2 pcop3 pcop4
This setup allows load balancing of clients among multiple Teradata nodes. That is, most Teradata systems have many nodes, and each node has its own IP address. Without the multiple hosts file entries, every client will connect to one node and eventually this node will be doing more than its fair share of client processing. With multiple hosts file entries, if it takes too long for the node specified with the cop1 suffix to respond (that is, curly_1) to the client request to connect to p, then the client will automatically attempt to connect to the node with the cop2 suffix (that is, curly_2). Informatica/Teradata Touch Points Informatica PowerCenter 7.1.2 accesses Teradata through various Teradata tools. Each will be defined according to how it is configured within PowerCenter. ODBC: Teradata provides 32-bit ODBC drivers for Windows and UNIX platforms. If possible, use the ODBC driver from Teradatas TTU7 release (or above) of its client software because this version supports array reads. Tests have shown these new drivers (3.02) can be 20 to 30 percent faster than the old drivers (3.01). This latest release of Teradatas TTU8.0 uses ODBC v3.0421. Teradatas ODBC is on a performance par with Teradatas SQL CLI. In fact, ODBC is Teradatas recommended SQL interface for its partners. Do not use ODBC to write to Teradata unless youre writing very small data sets (and even then, you should probably use TPump, defined later, instead) because Teradatas ODBC is optimized for query access, not for writing data. ODBC is good for sourcing and lookups. PowerCenter Designer uses Teradatas ODBC to import Source and Target table. If you are having performance problems, you can use a cmd task with a shell script to call BTEQ. A SQL with an intermediate work table can be sourced by PowerCenter.
ODBC Windows:
ODBC UNIX When the PowerCenter server is running on UNIX, then ODBC is required to read (both sourcing and lookups) from Teradata. As with all UNIX ODBC drivers, the key to configuring the UNIX ODBC driver is adding the appropriate entries to the .odbc.ini file. To correctly configure the .odbc.ini file, there must be an entry under [ODBC Data Sources] that points to the Teradata ODBC driver shared library (tdata.sl on HP-UX, the standard shared library extension on other flavors of UNIX). The following example shows the required entries from an actual .odbc.ini file (note that the path to the driver may be different on each computer): [ODBC Data Sources] dBase=MERANT 3.60 dBase Driver Oracle8=MERANT 3.60 Oracle 8 Driver Text=MERANT 3.60 Text Driver Sybase11=MERANT 3.60 Sybase 11 Driver Informix=MERANT 3.60 Informix Driver DB2=MERANT 3.60 DB2 Driver MS_SQLServer7=MERANT SQLServer driver
TeraTest=tdata.sl [TeraTest] Driver=/usr/odbc/drivers/tdata.sl Description=Teradata Test System DBCName=148.162.247.34 Similar to the client hosts file setup, you can specify multiple IP addresses for the DBCName to balance the client load across multiple Teradata nodes. Consult with the Teradata administrator for exact details on this (or copy the entries from the PC clients hosts file (see the section Client Configuration Basics for Teradata earlier in this document). Important note: Make sure that the Merant ODBC path precedes the Teradata ODBC path information in the PATH and SHLIB_PATH (or LD_LIBRARY_PATH, and so on) environment variables. This is necessary because both sets of ODBC software use some of the same filenames. PowerCenter should use the Merant files because this software has been certified. Teradata External Loaders PowerCenter 7.1.2 supports four different Teradata external loaders: TPump, FastLoad, MultiLoad, and Teradata Warehouse Builder (TWB). The actual Teradata loader executables (TPump, mload, fastload, tbuild) must be accessible by the PowerCenter server generally in the path statement. All of the Teradata loader connections will require a value to the TDPID attribute. Refer to the first section of this document to understand how to correctly enter the value. All of these loaders require: A load file, which can be configured to be a stream/pipe and is autogenerated by PowerCenter A control file of commands to tell the loader what to do (PowerCenter autogenerates)
All of these loaders will also produce a log file, which will be the means to debug the loader if something goes wrong. Because these are external loaders, PowerCenter will only receive back from the loader whether it ran successfully or not.
By default, the input file, control file, and log file will be created in $PMTargetFileDir of the PowerCenter server executing the workflow.
You can use any of these loaders by configuring the target in the PowerCenter session to be a File Writer and then choosing the appropriate loader.
The autogenerated control file can be overridden. Click the Pencil icon next to the loader connection name.
Scroll to the bottom of the connection attribute list and click the value next to the Control File Content Override attribute. Then click the Down arrow.
Click the Generate button and change the control file as you wish. The repository stores the changed control file.
Most of the loaders also use some combination of internal work, error, and log tables. By default, these will be in the same database as the target table. All of these can now be overridden in the attributes of the connection.
To land the input flat file to disk, check the Is Staged attribute. If the Is Staged attribute is not set, then the file will be piped/streamed to the loader. If you select the nonstaged mode for a loader, also set the checkpoint property to 0. This effectively turns off the checkpoint processing. Checkpoint processing is used for recovery/restart of FastLoad and MultiLoad sessions. However, if you are using a named pipe instead of a physical file as input, then the recovery/restart mechanism of the loaders does not work. Besides impacting performance (the checkpoint processing is not free and we want to eliminate unnecessary overhead when possible), a nonzero checkpoint value will sometimes cause seemingly random errors and session failures when used with named pipe input (as is the case with streaming mode). Teradata Loader Requirements for PowerCenter Servers on UNIX All Teradata load utilities require a non-null standard output and standard error to run properly. Standard output (stdout) and standard error (stderr) are UNIX conventions that determine the default location for a program to write output and error information. When you start the pmserver without explicitly defining stdout and stderr, these both point to the current terminal session. If you log out of UNIX, then UNIX redirects stdout and stderr to /dev/null (that is, a placeholder that throws out anything written to it). At this point, Teradata loader
11 Informatica Confidential. Do not duplicate.
sessions will fail because they do not permit stdout and stderr to be /dev/null. Therefore, you must start pmserver as follows (cd to the PowerCenter installation directory): ./pmserver ./pmserver.cfg > ./pmserver.out 2>&1 This starts the pmserver using the pmserver.cfg config file and points stdout and stderr to the file pmserver.out. In this way, stderr and stdout will be defined even after the terminal session logs out. Important note: There are no spaces in the token 2>&1. This tells UNIX to point stderr to the same place stdout is pointing. As an alternative to this method, you can specify the console output filename in the pmserver.cfg file. That is, information written to standard output and standard error will go the file specified as follows: ConsoleOutputFilename=<FILE_NAME> With this entry in the pmserver.cfg file, you can start the pmserver normally (that is, ./pmserver). Partitioned Loading With PowerCenter v7.x, if you set a round robin partition point on the target definition and set each target instance to be loaded using the same loader connection instance, then PowerCenter automatically writes all data to the first partition and only starts one instance of FastLoad or MultiLoad. You will know you are getting this behavior if you see the following entry in the session log: MAPPING> DBG_21684 Target [TD_INVENTORY] does not support multiple partitions. All data will be routed to the first partition. If you do not see this message, then chances are the session fails with the following error: WRITER_1_*_1> WRT_8240 Error: The external loader [Teradata Mload Loader] does not support partitioned sessions. WRITER_1_*_1> Thu Jun 16 11:58:21 2005 WRITER_1_*_1> WRT_8068 Writer initialization failed. Writer terminating.
TPump TPump is an external loader that supports inserts, updates, upserts, deletes, and data-driven updates. Multiple TPumps can execute simultaneously against the same table because TPump doesnt use many resources or require table-level locks. It is often used to trickle load a table. As stated earlier, TPump will be a faster way to update a table than using ODBC, but will not be as fast as the other loaders.
MultiLoad This sophisticated bulk load utility is the primary method PowerCenter uses to load/update mass quantities of data into Teradata. Unlike bulk load utilities from other vendors, MultiLoad supports inserts, updates, upserts, deletes, and data-driven operations in PowerCenter. You can also use variables and embed conditional logic into MultiLoad scripts. It is very fast (millions of rows in a few minutes). It can be resource intensive and will take a table lock.
Cleaning up after a failed MultiLoad: MultiLoad supports sophisticated error recovery. That is, it allows load jobs to be restarted without having to redo all of the prior work. However, for the types of problems normally encountered during a POC (loading null values into a column that does not support nulls, incorrectly formatted date columns), the error recovery mechanisms tend to get in the way. To learn about MultiLoads sophisticated error recovery, read the Teradata MultiLoad manual. To learn how to work around the recovery mechanisms to restart a failed MultiLoad script from scratch, read this section.
MultiLoad puts the target table into the MultiLoad state. Upon successful completion, the target table is returned to the normal (nonMultiLoad) state. Therefore, when a MultiLoad fails for any reason, the table is left in the MultiLoad state, and you cannot simply rerun the same MultiLoad. MultiLoad will report an error. In addition, MultiLoad also queries the target tables MultiLoad log table to see if it contains any errors. If a MultiLoad log table exists for the target table, then you also will not be able to rerun your MultiLoad job. To recover from a failed MultiLoad, release the target table from the MultiLoad state and also drop the MultiLoad log table. You can do this using BTEQ or QueryMan to issue the following commands: drop table mldlog_<table name>; release mload <table name>; Note: The drop table command assumes that youre recovering from a MultiLoad script generated by PowerCenter (PowerCenter always names the MultiLoad log table mldlog_<table name>). If youre working with a hand-coded MultiLoad script, the name of the MultiLoad log table could be anything. Here is the actual text from a BTEQ session that cleans up a failed load to the table td_test owned by the user infatest: BTEQ -- Enter your DBC/SQL request or BTEQ command: drop table infatest.mldlog_td_test; drop table infatest.mldlog_td_test; *** Table has been dropped. *** Total elapsed time was 1 second. BTEQ -- Enter your DBC/SQL request or BTEQ command: release mload infatest.td_test; release mload infatest.td_test; *** Mload has been released. *** Total elapsed time was 1 second. Using One Instance of MultiLoad to Load Multiple Tables MultiLoad is a big consumer of resources on a Teradata system. Some systems will have hard limits on the number of concurrent MultiLoad sessions allowed. By default, PowerCenter will start an instance of MultiLoad for every target file. Sometimes, this is illegal (if the multiple instances target the same table). Other times, it is just expensive. Therefore, a prospect may ask that PowerCenter use a single instance of MultiLoad to load multiple tables (or to load both inserts and updates into the same target table). To make this happen, we must heavily edit the generated MultiLoad script file. Note: This should not be an issue with TPump because TPump is not as resource intensive as MultiLoad (and a multiple concurrent instances of TPump can target the same table). Heres the workaround: 1) 2) 3) Use a dummy session (that is, set test rows to 1 and target a test database) to generate MultiLoad control files for each of the targets. Merge the multiple control files (one per target table) into a single control file (one for all target tables). Configure the session to call MultiLoad from a postsession script using the control file created in step 2. Integrated support cannot be used because each input file is processed sequentially and this causes problems when combined with the integrated named pipes and streaming of PowerCenter.
Details on merging the control files: 1) There is a single log file for each instance of MultiLoad. Therefore, you do not have to change or add anything the LOGFILE statement. However, you might want to change the name of the log table because it may be a log that spans multiple tables.
2) 3) 4) 5) 6) 7) 8)
Copy the work and error tables delete statements into the common control file. Modify the BEGIN MLOAD statement to specify all the tables that the MultiLoad will be hitting. Copy the Layout sections into the common control file and give each a unique name. Organize the file such that all the layout sections are grouped together. Copy the DML sections into the common control file and give each a unique name. Organize the file such that all the DML sections are grouped together. Copy the Import statements into the common control file and modify them to reflect the unique names created for the referenced layout and DML sections created in steps 4 and 5. Organize the file such that all the import sections are grouped together. Run chmod w on the newly minted control file so PowerCenter doesnt overwrite it, or, better yet, name it something different so PowerCenter cannot overwrite it. Remember, a single instance of MultiLoad can target at most five tables. Therefore, dont combine more than five target files into a common file.
Heres an example of a control file merged from two default control files: .DATEFORM ANSIDATE; .LOGON demo1099/infatest,infatest; .LOGTABLE infatest.mldlog_TD_TEST; DROP TABLE infatest.UV_TD_TEST ; DROP TABLE infatest.WT_TD_TEST ; DROP TABLE infatest.ET_TD_TEST ; DROP TABLE infatest.UV_TD_CUSTOMERS ; DROP TABLE infatest.WT_TD_CUSTOMERS ; DROP TABLE infatest.ET_TD_CUSTOMERS ; .ROUTE MESSAGES WITH ECHO TO FILE c:\LOGS\TgtFiles\td_test.out.ldrlog ; .BEGIN IMPORT MLOAD TABLES infatest.TD_TEST, infatest.TD_CUSTOMERS ERRLIMIT 1 CHECKPOINT 10000 TENACITY 10000 SESSIONS 1 SLEEP 6 ; /* Begin Layout Section */ .Layout InputFileLayout1; .Field .Field .Field .Field .Field .Field CUST_KEY 1 CHAR( 12) NULLIF CUST_KEY = '*' ; CUST_NAME 13 CHAR( 20) NULLIF CUST_NAME = '*' ; CUST_DATE 33 CHAR( 10) NULLIF CUST_DATE = '*' ; CUST_DATEmm 33 CHAR( 2) ; CUST_DATEdd 36 CHAR( 2) ; CUST_DATEyyyy 39 CHAR( 4) ;
.Layout InputFileLayout2; .Field CUSTOMER_KEY 1 CHAR( 12) ; .Field CUSTOMER_ID 13 CHAR( 12) ; .Field COMPANY 25 CHAR( 50) NULLIF COMPANY = '*' ; .Field FIRST_NAME 75 CHAR( 30) NULLIF FIRST_NAME = '*' ; .Field LAST_NAME 105 CHAR( 30) NULLIF LAST_NAME = '*' ; .Field ADDRESS1 135 CHAR( 72) NULLIF ADDRESS1 = '*' ; .Field ADDRESS2 207 CHAR( 72) NULLIF ADDRESS2 = '*' ; .Field CITY 279 CHAR( 30) NULLIF CITY = '*' ; .Field STATE 309 CHAR( 2) NULLIF STATE = '*' ; .Field POSTAL_CODE 311 CHAR( 10) NULLIF POSTAL_CODE = '*' ; .Field PHONE 321 CHAR( 30) NULLIF PHONE = '*' ; .Field EMAIL 351 CHAR( 30) NULLIF EMAIL = '*' ; .Field REC_STATUS 381 CHAR( 1) NULLIF REC_STATUS = '*' ; .Filler EOL_PAD 382 CHAR( 2) ; /* End Layout Section */ /* begin DML Section */ .DML Label tagDML1; INSERT INTO infatest.TD_TEST ( CUST_KEY , CUST_NAME , CUST_DATE ) VALUES ( :CUST_KEY , :CUST_NAME , :CUST_DATEtd ) ; .DML Label tagDML2; INSERT INTO infatest.TD_CUSTOMERS ( CUSTOMER_KEY , CUSTOMER_ID , COMPANY , FIRST_NAME , LAST_NAME , ADDRESS1 , ADDRESS2 , CITY , STATE , POSTAL_CODE , PHONE , EMAIL , REC_STATUS ) VALUES (
17 Informatica Confidential. Do not duplicate.
:CUSTOMER_KEY :CUSTOMER_ID :COMPANY :FIRST_NAME :LAST_NAME :ADDRESS1 :ADDRESS2 :CITY :STATE :POSTAL_CODE :PHONE :EMAIL :REC_STATUS ) ; /* end DML Section */
, ,
, , , ,
, , , , ,
/* Begin Import Section */ .Import Infile c:\LOGS\TgtFiles\td_test.out Layout InputFileLayout1 Format Unformat Apply tagDML1 ; .Import Infile c:\LOGS\TgtFiles\td_customers.out Layout InputFileLayout2 Format Unformat Apply tagDML2 ; /* End Import Section */ .END MLOAD; .LOGOFF;
Multiple Workflows That MultiLoad to the Same Table Because MultiLoad puts a lock on the table, we require that all MultiLoad sessions must handle wait events so they don't try to access the table simultaneously. Also, any log files should be given unique names for the same reason. FastLoad As the name suggests, this utility is the fastest method to load data into Teradata. However, there is one major restriction: the target table must be empty.
Teradata Warehouse Builder (TWB) Teradata Warehouse Builder (TWB) is a single utility that was intended to replace FastLoad, MultiLoad, TPump, and FastExport. It was to support a single scripting environment with different modes, where each mode roughly equates to one of the legacy utilities. It also was to support parallel loading (that is, multiple instances of a TWB client could run and load the same table at the same timesomething the legacy loaders cannot do). Although PowerCenter supports TWB, NCR/Teradata does not. TWB has never been formally released. According to NCR, its general release was delayed primarily because of issues with the mainframe version. If you find a prospect willing to use TWB, please do. Its ability to support parallel load clients makes some tasks easier.
PowerCenter 7.1.2 PAM for Teradata Server Client Software Platform Version & Version Source Target
Comment
PowerCenter uses the following components of the Teradata Tools and Utilities (TTU): Teradata ODBC driver, FastLoad, MultiLoad, and TPump. The version numbers of each of the Teradata Client components vary with release (e.g., TTU7 contains ODBC 3.02, FastLoad 7.05, MultiLoad 3.03, TPump 1.07). The TTU was previously called Teradata Utilities Foundation (TUF). Compatibility between a particular version of the Teradata RDBMS and the Teradata Client software is determined by Teradata -- not Informatica.
NCR Teradata
UNIX and NT
V2R4
UNIX and NT
x The Teradata Client/Teradata RDBMS pairings listed here represent our understanding based on Teradata's documentation. Note that the minimum version number for the Teradata ODBC driver is 3.00.01.04. Teradata has made many fixes to the 3.02 ODBC driver. If you are using this driver, please contact NCR support for the latest maintenance release.
UNIX and NT
V2R5
TTU 7
UNIX and NT
V2R5.1 TTU 7
UNIX and NT
V2R6
TTU 8
Copyright 2005 Informatica Corporation. Informatica and PowerCenter are registered trademarks of Informatica Corporation. Teradata is a registered trademark of NCR Corporation. All other company, product, or service names may be trademarks or registered trademarks of their respective owners.
J50665 (9/29/2005)