In 100 ReferenceDataGuide en
In 100 ReferenceDataGuide en
0)
This product includes software licensed under the terms at http://www.tcl.tk/software/tcltk/license.html, http://www.bosrup.com/web/overlib/?License, http://
www.stlport.org/doc/ license.html, http://asm.ow2.org/license.html, http://www.cryptix.org/LICENSE.TXT, http://hsqldb.org/web/hsqlLicense.html, http://
httpunit.sourceforge.net/doc/ license.html, http://jung.sourceforge.net/license.txt , http://www.gzip.org/zlib/zlib_license.html, http://www.openldap.org/software/release/
license.html, http://www.libssh2.org, http://slf4j.org/license.html, http://www.sente.ch/software/OpenSourceLicense.html, http://fusesource.com/downloads/licenseagreements/fuse-message-broker-v-5-3- license-agreement; http://antlr.org/license.html; http://aopalliance.sourceforge.net/; http://www.bouncycastle.org/licence.html;
http://www.jgraph.com/jgraphdownload.html; http://www.jcraft.com/jsch/LICENSE.txt; http://jotm.objectweb.org/bsd_license.html; . http://www.w3.org/Consortium/Legal/
2002/copyright-software-20021231; http://www.slf4j.org/license.html; http://nanoxml.sourceforge.net/orig/copyright.html; http://www.json.org/license.html; http://
forge.ow2.org/projects/javaservice/, http://www.postgresql.org/about/licence.html, http://www.sqlite.org/copyright.html, http://www.tcl.tk/software/tcltk/license.html, http://
www.jaxen.org/faq.html, http://www.jdom.org/docs/faq.html, http://www.slf4j.org/license.html; http://www.iodbc.org/dataspace/iodbc/wiki/iODBC/License; http://
www.keplerproject.org/md5/license.html; http://www.toedter.com/en/jcalendar/license.html; http://www.edankert.com/bounce/index.html; http://www.net-snmp.org/about/
license.html; http://www.openmdx.org/#FAQ; http://www.php.net/license/3_01.txt; http://srp.stanford.edu/license.txt; http://www.schneier.com/blowfish.html; http://
www.jmock.org/license.html; http://xsom.java.net; http://benalman.com/about/license/; https://github.com/CreateJS/EaselJS/blob/master/src/easeljs/display/Bitmap.js;
http://www.h2database.com/html/license.html#summary; http://jsoncpp.sourceforge.net/LICENSE; http://jdbc.postgresql.org/license.html; http://
protobuf.googlecode.com/svn/trunk/src/google/protobuf/descriptor.proto; https://github.com/rantav/hector/blob/master/LICENSE; http://web.mit.edu/Kerberos/krb5current/doc/mitK5license.html; http://jibx.sourceforge.net/jibx-license.html; https://github.com/lyokato/libgeohash/blob/master/LICENSE; https://github.com/hjiang/jsonxx/
blob/master/LICENSE; https://code.google.com/p/lz4/; https://github.com/jedisct1/libsodium/blob/master/LICENSE; http://one-jar.sourceforge.net/index.php?
page=documents&file=license; https://github.com/EsotericSoftware/kryo/blob/master/license.txt; http://www.scala-lang.org/license.html; https://github.com/tinkerpop/
blueprints/blob/master/LICENSE.txt; http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html; https://aws.amazon.com/asl/; https://github.com/
twbs/bootstrap/blob/master/LICENSE; https://sourceforge.net/p/xmlunit/code/HEAD/tree/trunk/LICENSE.txt; https://github.com/documentcloud/underscore-contrib/blob/
master/LICENSE, and https://github.com/apache/hbase/blob/master/LICENSE.txt.
This product includes software licensed under the Academic Free License (http://www.opensource.org/licenses/afl-3.0.php), the Common Development and Distribution
License (http://www.opensource.org/licenses/cddl1.php) the Common Public License (http://www.opensource.org/licenses/cpl1.0.php), the Sun Binary Code License
Agreement Supplemental License Terms, the BSD License (http:// www.opensource.org/licenses/bsd-license.php), the new BSD License (http://opensource.org/
licenses/BSD-3-Clause), the MIT License (http://www.opensource.org/licenses/mit-license.php), the Artistic License (http://www.opensource.org/licenses/artisticlicense-1.0) and the Initial Developers Public License Version 1.0 (http://www.firebirdsql.org/en/initial-developer-s-public-license-version-1-0/).
This product includes software copyright 2003-2006 Joe WaInes, 2006-2007 XStream Committers. All rights reserved. Permissions and limitations regarding this
software are subject to terms available at http://xstream.codehaus.org/license.html. This product includes software developed by the Indiana University Extreme! Lab.
For further information please visit http://www.extreme.indiana.edu/.
This product includes software Copyright (c) 2013 Frank Balluffi and Markus Moeller. All rights reserved. Permissions and limitations regarding this software are subject
to terms of the MIT license.
See patents at https://www.informatica.com/legal/patents.html.
DISCLAIMER: Informatica LLC provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the implied
warranties of noninfringement, merchantability, or use for a particular purpose. Informatica LLC does not warrant that this software or documentation is error free. The
information provided in this software or documentation may include technical inaccuracies or typographical errors. The information in this software and documentation is
subject to change at any time without notice.
NOTICES
This Informatica product (the "Software") includes certain drivers (the "DataDirect Drivers") from DataDirect Technologies, an operating company of Progress Software
Corporation ("DataDirect") which are subject to the following terms and conditions:
1. THE DATADIRECT DRIVERS ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT
INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT
LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.
Part Number: IN-REF-DG-10000-0001
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Informatica Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Informatica My Support Portal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Informatica Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Informatica Product Availability Matrixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Informatica Web Site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica How-To Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Knowledge Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Support YouTube Channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Marketplace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Velocity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Global Customer Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Table of Contents
Managing Rows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Finding and Replacing Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Exporting Reference Table Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Enable and Disable Edits in an Unmanaged Reference Table. . . . . . . . . . . . . . . . . . . . . . 27
Refresh the Reference Table Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Audit Trail Events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Viewing Audit Trail Events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Rules and Guidelines for Reference Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Table of Contents
Table of Contents
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Table of Contents
Preface
The Informatica Reference Data Guide includes information about the reference data objects and files that
you can use in Informatica Developer and Informatica Analyst. It is written for data analysts, data stewards,
and others who use reference data to verify and enhance the accuracy and usability of organization data.
Informatica Resources
Informatica My Support Portal
As an Informatica customer, the first step in reaching out to Informatica is through the Informatica My Support
Portal at https://mysupport.informatica.com. The My Support Portal is the largest online data integration
collaboration platform with over 100,000 Informatica customers and partners worldwide.
As a member, you can:
Search the Knowledge Base, find product documentation, access how-to documents, and watch support
videos.
Find your local Informatica User Group Network and collaborate with your peers.
Informatica Documentation
The Informatica Documentation team makes every effort to create accurate, usable documentation. If you
have questions, comments, or ideas about this documentation, contact the Informatica Documentation team
through email at infa_documentation@informatica.com. We will use your feedback to improve our
documentation. Let us know if we can contact you regarding your comments.
The Documentation team updates documentation as needed. To get the latest documentation for your
product, navigate to Product Documentation from https://mysupport.informatica.com.
Informatica Marketplace
The Informatica Marketplace is a forum where developers and partners can share solutions that augment,
extend, or enhance data integration implementations. By leveraging any of the hundreds of solutions
available on the Marketplace, you can improve your productivity and speed up time to implementation on
your projects. You can access Informatica Marketplace at http://www.informaticamarketplace.com.
Informatica Velocity
You can access Informatica Velocity at https://mysupport.informatica.com. Developed from the real-world
experience of hundreds of data management projects, Informatica Velocity represents the collective
knowledge of our consultants who have worked with organizations from around the world to plan, develop,
deploy, and maintain successful data management solutions. If you have questions, comments, or ideas
about Informatica Velocity, contact Informatica Professional Services at ips@informatica.com.
Preface
The telephone numbers for Informatica Global Customer Support are available from the Informatica web site
at http://www.informatica.com/us/services-and-training/support-services/global-support-centers/.
10
Preface
CHAPTER 1
Reference Tables, 13
11
telephone area codes, postcode formats, first names, occupations, and acronyms. You can edit
Informatica reference tables.
Informatica content sets
Repository objects and data files that Informatica develops. You import content sets when you import
accelerator objects to the Model repository. A content set contains different types of reference data that
you can use to perform search operations with data quality transformations.
Address reference data files
Reference data files that contain data for the deliverable addresses in a country. The Address Validator
transformation reads the reference data. You cannot create or edit address reference data files.
Address reference data is current for a defined period and you must refresh your data regularly, for
example every quarter.
Identity population files
Reference data files that contain information on personal, household, and corporate identities. The
Match transformation and the Comparison transformation use population files to find potential identities
in input data. You cannot create or edit identity population files.
12
The data rows in the column contain the same type of information.
A source value is valid. The reference object might contains a list of the valid values, or the reference
object might contain a list of values that are not valid.
The following table lists common examples of project data columns that can contain reference data:
Information
Use an SKU column to create a reference table of valid SKU code for an organization. Use
the reference table to find correct or incorrect SKU codes in a data set.
Employee codes
Customer account
numbers
Run a profile on a customer account column to identify account number patterns. Use the
profile to create a token set of incorrect data patterns. Use the token set to find account
numbers that do not conform to the correct account number structure.
Customer names
When a customer name column contains first, middle, and last names, you can create a
probabilistic model that defines the expected structure of the strings in the column. Use the
probabilistic model to find data strings that do not belong in the column.
Reference Tables
Create and update reference tables in the Analyst tool and the Developer tool.
Reference tables store metadata in the Model repository. Reference tables can store column data in the
reference data warehouse or in another database. When the reference data warehouse stores the column
data, the Informatica services identify the table as a managed reference table. When another database stores
the column data, the Informatica services identify the table as an unmanaged reference table.
The Content Management Service stores the reference data warehouse database connection. You can
specify an IBM DB2 database, a Microsoft SQL Server database, or an Oracle database as a reference data
warehouse.
When you import data to the reference data warehouse from another database, use a native connection or an
ODBC connection to import the data. When you specify an unmanaged database as the data source for a
reference table, use a native connection to connect to the database.
Reference Tables
13
run the mapping with a product database table. When the mapping runs, the Labeler creates a column that
identifies the product records that do not contain valid SKU numbers.
To edit data in an unmanaged reference table, verify also that you configured the reference table object to
permit edits.
Note: If you edit the metadata for an unmanaged reference table in a database application, use the Analyst
tool to synchronize the Model repository with the table. You must synchronize the Model repository and the
table before you use the unmanaged reference table in the Developer tool.
Labeler transformation
Standardizer transformation
Note: Use the infacmd ms runMapping command to run a mapping at the command prompt.
14
When the reference data objects are not under version control, the Model repository locks a reference data
object that you edit. Other users cannot edit a locked object that you work on. When you close the object, the
Model repository releases the lock and other users can edit the object.
Note: Version control applies to the metadata that the Model repository stores for an unmanaged reference
table object. Version control does not apply to the data in an unmanaged reference table. You cannot view or
restore the reference data from an earlier version of an unmanaged reference table.
15
CHAPTER 2
16
database connection name, and valid column name. The column properties include the column names,
precision values, and scale values.
You can view the properties in read-only mode. To update the properties, edit or check out the reference
table.
Description
Name
Description
Location
Valid Column
Created On
The creation date and time for the reference table name.
Created By
The login name of the user who created the reference table.
Last Modified
The date and time of the most recent update to the reference table.
Last Modified By
The login name of the user who made the most recent update.
Connection Name
The connection name for the database that stores the reference data values.
Type
The reference table type. The reference table can be managed or unmanaged.
Description
Name
Datatype
The data type for the data in each column. You can select one of the following data types:
-
bigint
date/time
decimal
double
integer
string
You cannot select a double data type when you create an empty reference table or create a
reference table from a flat file.
17
Property
Description
Precision
The precision for each column. Precision is the maximum number of digits or the maximum number
of characters that the column can accommodate.
The precision values you configure depend on the data type.
Scale
The scale for each column. Scale is the maximum number of digits that a column can accommodate
to the right of the decimal point. Applies to decimal columns.
The scale values you configure depend on the data type.
Description
Nullable
Key
Identifies a key column. The Analyst tool can identify a key column if you import the reference data
from a table that specifies a key column.
2.
Select the option to Use the reference table editor, and click Next.
3.
Use the Add New Column option to add columns to the table.
4.
5.
Optionally, add a column to include low-level descriptions as metadata in the reference table.
6.
18
7.
Click Next.
8.
Enter a name for the reference table, and select a location for the reference table object in the Model
repository.
9.
Click Finish.
Browse a profile column and add a subset of the column data to a reference table.
Select a column in the profile and add the pattern values for that column to a reference table.
2.
3.
Open the profile that contains the column to add to a reference table.
The profile overview lists the profile column names.
4.
5.
In the detailed profile view, select the data values to add to the reference table. You can select values
one by one, or you can select all.
6.
The number 1 identifies the Add to Reference Table option in the image.
7.
19
Note: You can also select an option to add the data to a current reference table.
8.
Click Next.
The column name appears by default as the reference table name. Optionally, update the name.
9.
10.
Click Next.
11.
12.
Click Next.
13.
14.
15.
Click Finish.
2.
3.
Open the profile that contains the value patterns to add to the reference table.
The profile overview lists the profile column names.
4.
Select the column that defines the pattern data that you want to add to the reference table.
5.
6.
In the detailed profile view, select the column patterns that you want to add.
7.
Right-click the patterns that you selected, and select Add to Reference Table.
The following image shows the data patterns for a column in the detailed profile view:
20
The number 1 identifies the Add to Reference Table option in the image.
8.
9.
Click Next.
The column name appears by default as the reference table name. Optionally, update the name.
10.
11.
Click Next.
12.
13.
Click Next.
14.
15.
16.
Click Finish.
Description
Delimiters
Character used to separate columns of data. Use the Other field to enter a different delimiter.
Delimiters must be printable characters and must be different from the escape character and
the quote character if selected.
You cannot select non-printing multibyte characters as delimiters.
Text Qualifier
21
Properties
Description
Column Names
Imports column names from the first line. Select this option if column names appear in the first
row.
The wizard uses data in the first row in the preview for column names.
Default is not enabled.
Values
Option to start value import from a line. Indicates the row number in the preview at which the
wizard starts reading when it imports the file.
2.
3.
Click Next.
4.
5.
Select a code page that matches the data in the flat file.
6.
7.
Click Next.
8.
9.
To preview the properties that you configured, refresh the Preview pane.
10.
Click Next.
11.
12.
Optionally, add a column to include low-level descriptions as metadata in the reference table.
13.
22
14.
Click Next.
15.
Enter a name for the reference table, and select a location for the reference table object in the Model
repository.
16.
17.
Click Finish.
2.
3.
4.
5.
6.
7.
Optionally, add a column to include low-level descriptions as metadata in the reference table.
8.
9.
Click Next.
10.
Enter a name for the reference table, and select a location for the reference table object in the Model
repository.
11.
12.
Click Finish.
23
2.
Select the Reference Tables asset category, and select a reference table name.
The reference table opens in read-only mode.
3.
4.
When you complete work on the reference table, click Finish. The Analyst tool saves your changes to
the reference table.
If you checked out the reference table from a versioned Model repository, check in the object. A
versioned Model repository does not update the reference table version until you check in the object.
24
Managing Columns
You can add columns to a reference table and update the column properties. You can also update the
editable status of an unmanaged reference table.
1.
Click Open.
The asset library opens.
2.
Select the Reference Tables asset category, and select a reference table name.
The reference table opens in read-only mode.
3.
4.
5.
Add a column.
Managing Rows
You can add, edit, or delete rows in a reference table.
1.
Click Open.
The asset library opens.
2.
Select the Reference Tables asset category, and select a reference table name.
The reference table opens in read-only mode.
3.
4.
Edit the data rows. You can edit the data rows in the following ways:
To update a single data value, click the value and update the data.
After you update the data, use the row-level options to accept or reject the data. You cannot enter an
audit note when you enter data directly in the data row.
To update the data values in a row, select Actions > Edit Row.
In the Edit Row dialog box, enter a value in one or more columns. Optionally, enter an audit note.
Click Apply to update the data in the columns that you selected.
25
To update the values in multiple rows, select the rows to edit and select Actions > Edit Row.
In the Edit Multiple Rows dialog box, enter a value in one or more columns. Optionally, enter an
audit note.
Click OK to update the data in the columns that you selected.
To delete rows, select the rows to delete and click Actions > Delete.
In the Delete Rows dialog box, optionally enter an audit note.
Click OK to delete the rows.
Note: Use the Developer tool to edit row data in a large reference table. For example, if a reference table
contains more than 500 rows, edit the table in the Developer tool.
Click Open.
The asset library opens.
2.
Select the Reference Tables asset category, and select a reference table name.
The reference table opens in read-only mode.
3.
4.
5.
6.
Select the columns to search. By default, the operation searches all columns.
Use the following options to replace values one by one or to replace all values:
Use the Next and Previous options to find values one by one.
Click Open.
The asset library opens.
2.
Select the Reference Tables asset category, and select a reference table name.
The reference table opens in read-only mode.
26
3.
Description
File Name
Name of the file to contain the data. The export operation creates the file.
File Format
Format of the file to contain the data. Select one the following formats:
4.
Column name option. Select the option to indicate that the first row of the
file contains the column names.
Code Page
Code page of the reference data. The default code page is UTF-8.
Click Open.
The asset library opens.
2.
Select the Reference Tables asset category, and select a reference table name.
The reference table opens in read-only mode.
3.
4.
5.
27
Description
Date
Start and end dates for the actions to display. Use the calender options to set the
dates.
Type
Type of audit trail event. You can view the following event types:
- Data. Events that relate to the data values in the reference table. Events include
operations to add a row, to delete a row, and to update a row.
- Metadata. Events that relate to the reference table metadata. Events include operations
to create the reference table, add or delete a column, and check in the reference table.
User who edited the reference table. The filter displays the full name and the login
name of the user.
Status
Status of the audit trail log events. The status corresponds to the action that you
performed in the reference table editor. For example, the status might indicate that a
user created the reference table or added a row.
The audit trail log events also include the audit trail comments and the column values that you inserted,
updated, or deleted.
Click Open.
The asset library opens.
2.
Select the Reference Tables asset category, and select a reference table name.
The reference table opens in read-only mode.
3.
4.
5.
Click Show.
The log events appear for the filter options that you specified.
28
When you import a reference table from an Oracle, IBM DB2, or Microsoft SQL Server database, the
Analyst tool cannot display the preview if the table, view, schema, synonym, or column names contain
mixed case or lowercase characters.
To preview data in tables that reside in case-sensitive databases, set the Support Mixed Case Identifiers
attribute on the database connection to true.
When you create a reference table from inferred column patterns in one format, the Analyst tool populates
the reference table with column patterns in a different format.
For example, when you create a reference table for the column pattern X(5), the Analyst tool displays the
following format for the column pattern in the reference table: XXXXX.
When you import an Oracle database table, verify the length of any VARCHAR2 column in the table. The
Analyst tool cannot import an Oracle database table that contains a VARCHAR2 column with a length
greater than 1000.
To read a reference table, you need execute permissions on the connection to the database that stores
the table data values. For example, if the reference data warehouse stores the data values, you need
execute permissions on the connection to the reference data warehouse. You need execute permissions
to access the reference table in read or write mode. The database connection permissions apply to all
reference data in the database.
29
CHAPTER 3
Reference Tables, 32
Content Sets, 36
30
Address Validator. Reads address reference data to verify the accuracy of addresses.
Case Converter. Reads reference data tables to identify strings that must change case.
Classifier. Reads content set data to identify the type of information in a string.
Parser. Reads content set data to parse strings based on the information the contain.
The Data Quality Content Installer file set includes Informatica reference data objects that you can import.
2.
3.
31
Save any change that you made to the reference table or the content set.
2.
3.
4.
5.
6.
Reference Tables
You add a reference table to a transformation in the Developer tool. You configure the transformation to find
reference table values in input data and to write the corresponding valid values from the reference table as
output.
To create a reference table in the Developer tool, use one of the following methods:
32
Description
Name
Description
Description
Valid
Name
Data Type
Precision
Scale
Description
Description of the contents of the column. You can optionally add a description when
you create the reference table.
Indicates that the reference table contains a column for descriptions of column data.
Default value
Default value for the fields in the column. You can optionally add a default value
when you create the reference table.
Connection Name
Name of the connection to the database that contains the reference table data
values.
Select File > New > Reference Table from the Developer tool menu.
2.
3.
4.
Reference Tables
33
5.
Add two or more columns to the table. Click the New option to create a column.
The following table describes the properties for each column:
Property
Default Value
Name
column
Data Type
string
Precision
10
Scale
Description
6.
Select the column that contains the valid values. You can change the order of the columns that you
create.
7.
Default Value
Cleared
Audit note
Empty
Default value
Empty
Click Finish.
The reference table opens in the Developer tool workspace.
Select File > New > Reference Table from the Developer tool menu.
2.
In the new table wizard, select Reference Table from a Flat File.
3.
Browse to the file you want to use as the data source for the table.
4.
5.
34
6.
7.
8.
If the flat file contains column names, select the option to import column names from the first line of the
file.
9.
Default Value
Text qualifier
No quotation marks
Line 1
Row Delimiter
\012 LF (\n)
Cleared
Escape character
Empty
Cleared
500
Click Next.
10.
11.
Default Value
Cleared
Audit note
Empty
Default value
Empty
500
Click Finish.
The reference table opens in the Developer tool workspace.
Select File > New > Reference Table from the Developer tool menu.
Reference Tables
35
2.
In the table creation wizard, select Reference Table from a Relational Source.
Click Next.
3.
4.
5.
6.
7.
To create a reference table that does not store data in the reference data warehouse, select
Unmanaged table.
To enable users to edit an unmanaged reference table, select the Editable option.
Click Next.
8.
9.
The following table describes optional properties that you can specify:
10.
Property
Default Value
Cleared
Description
Cleared
Default value
Empty
Audit note
Empty
500
Click Finish.
Content Sets
A content set is a Model repository object that stores data or metadata for other reference data objects. A
content set can include character sets, pattern sets, token sets, regular expressions, probabilistic models,
36
and classifier models. Use a content set to define and organize reference data objects that relate to a single
project, information type, or business purpose.
The Developer tool includes system-defined character sets and token sets that do not appear in the Model
repository. To view and use the system-defined objects, configure a strategy in the Labeler transformation,
Parser transformation, or Standardizer transformation.
Character Sets
A character set contains expressions that identify specific characters and character ranges. You can use
character sets in Labeler transformations that use character labeling mode.
Character ranges specify a sequential range of character codes. For example, the character range "[A-C]"
matches the uppercase characters "A," "B," and "C." This character range does not match the lowercase
characters "a," "b," or "c."
Use character sets to identify a specific character or range of characters as part of labeling operations. For
example, you can label all numerals in a column that contains telephone numbers. After labeling the
numbers, you can identify patterns with a Parser transformation and write problematic patterns to separate
output ports.
Description
Label
Defines the label that a Labeler transformation applies to data that matches the character
set.
Standard Mode
Enables a simple editing view that includes fields for the start range and end range.
Start Range
End Range
Specifies the last character in a character range. For a range with a single character, leave
this field blank.
Advanced Mode
Enables an advanced editing view where you can manually enter character ranges using
range characters and delimiter characters.
Range Character
Temporarily changes the symbol that signifies a character range. The range character
reverts to the default character when you close the character set.
Delimiter
Character
Temporarily changes the symbol that separates character ranges. The delimiter character
reverts to the default character when you close the character set.
Classifier Models
A classifier model analyzes input strings and determines the types of information that the strings are most
likely to contain. You use a classifier model in a Classifier transformation.
A classifier model contains reference data rows and label values. The rows represent the input data on the
port that you might connect to the Classifier transformation. The label values describe the types of
Content Sets
37
information that the data rows contain. When you configure a classifier model, you assign a label to each
reference data row in the model.
To link the reference data rows to the labels in a classifier model, you compile the model. The compilation
process generates a series of logical associations between the data rows and the label values. When you run
a mapping that reads the model, the Data Integration Service applies the model logic to the Classifier
transformation input data. The Data Integration Service returns the labels that most accurately describe the
information in each input data field.
You create a classifier model in the Developer tool. The Model repository stores the classifier model object.
The Developer tool writes the data rows, the labels, and the compilation data to a file in the Informatica
directory structure.
Pattern Sets
A pattern set contains expressions that identify data patterns in the output of a token labeling operation. You
can use pattern sets to analyze the Tokenized Data output port and write matching strings to one or more
output ports. Use pattern sets in Parser transformations that use pattern parsing mode.
For example, you can configure a Parser transformation to use pattern sets that identify names and initials.
This transformation uses the pattern sets to analyze the output of a Labler transformation in token labeling
mode. You can configure the Parser transformation to write names and initials in the output to separate ports.
Description
Pattern
Defines the patterns that the pattern parser searches for. You can enter multiple patterns for one
pattern set. You can enter patterns constructed from a combination of wildcards, characters, and
strings.
Probabilistic Models
A probabilistic model analyzes input data values and determines the types of information that the values are
most likely to contain. Use a probabilistic model in a Labeler transformation and a Parser transformation.
A probabilistic model contains reference data values and label values. The reference data values represent
the data on an input port that you connect to the transformation. The label values describe the types of
information that the reference data values contain. You assign a label to each reference data value in the
model.
To link the reference data values to the labels in a probabilistic model, you compile the model. The
compilation process generates a series of logical associations between the data values and the labels. When
you run a mapping that reads the model, the Data Integration Service applies the model logic to the
transformation input data. The Data Integration Service returns the label that most accurately describes the
input data values.
You create a probabilistic model in the Developer tool. The Model repository stores the probabilistic model
object. The Developer tool writes the data values, the labels, and the compilation data to a file in the
Informatica directory structure.
38
Regular Expressions
In the context of content sets, a regular expression is an expression that you can use in parsing and labeling
operations. Use regular expressions to identify one or more strings in input data. You can use regular
expressions in Parser transformations that use token parsing mode. You can also use regular expressions in
Labeler transformations that use token labeling mode.
Parser transformations use regular expressions to match patterns in input data and parse all matching strings
to one or more outputs. For example, you can use a regular expression to identify all email addresses in input
data and parse each email address component to a different output.
Labeler transformations use regular expressions to match an input pattern and create a single label. Regular
expressions that have multiple outputs do not generate multiple labels.
Description
Number of Outputs
Defines the number of output ports that the regular expression writes.
Regular Expression
Test Expression
Contains data that you enter to test the regular expression. As you type data in this field,
the field highlights strings that matches the regular expression.
Next Expression
Moves to the next string that matches the regular expression and changes the font of that
string to bold.
Previous Expression
Moves to the previous string that matches the regular expression and changes the font of
that string to bold.
Token Sets
A token set contains expressions that identify specific tokens. You can use token sets in Labeler
transformations that use token labeling mode. You can also use token sets in Parser transformations that use
token parsing mode.
Use token sets to identify specific tokens as part of labeling and parsing operations. For example, you can
use a token set to label all email addresses that use that use an "AccountName@DomainName" format. After
labeling the tokens, you can use the Parser transformation to write email addresses to output ports that you
specify.
Content Sets
39
40
Property
Description
Name
N/A
Description
N/A
N/A
Label
Regular Expression
Regular Expression
Regular Expression
Test Expression
Regular Expression
Next Expression
Regular Expression
Previous Expression
Regular Expression
Label
Character
Standard Mode
Character
Start Range
Character
End Range
Character
Property
Description
Advanced Mode
Character
Range Character
Character
Delimiter Character
Character
When you run a mapping that includes a model, the Data Integration Service applies the compiled model
logic to the transformation input data. The Data Integration Service does not read the data values or the
labels in the model when the mapping runs.
You can optionally remove the data values and the labels from a probabilistic model or a classifier model.
For example, you might decide to remove sensitive data or proprietary data from a model. You can
remove individual data values and labels in the Developer tool. You can remove all data values and labels
when you export a model from the Model repository.
Note: If you remove all data values and labels from a model, you cannot compile the model.
When you remove one or more data values or labels from a model, the compiled model logic no longer
represents the current data in the model file. To synchronize the model logic and the data values and
labels, compile the model again. Do not compile the model if you want to maintain the current model logic.
To protect the data in a classifier model or a probabilistic model, back up the model file in the Informatica
directory structure. Back up the file before you remove all the data values and labels from a model.
Find the model files in the Content Management Service host machine.
Probabilistic model files have the following default location and file name extension:
<Informatica_Installation_Directory>/tomcat/bin/ner/<filename>.ner
Classifier model files have the following default location and file name extension:
<Informatica_Installation_Directory>/tomcat/bin/classifier/<filename>.classifier
Content Sets
41
If you upgrade the Informatica installation, you might need to compile the probabilistic models and
classifier models before you use the models in a mapping. If a model does not contain any data, replace
the current file in the Informatica directory structure with the backup file that contains the data.
2.
3.
Related Topics:
In the Object Explorer view, select a project or folder to store the content set.
2.
3.
4.
Optionally, select Browse to change the Model repository location for the content set.
5.
Click Finish.
Open a content set in the editor and select the Content view.
2.
3.
Click Add.
4.
5.
6.
Click Finish.
Tip: You can copy reference data objects from one content set to another. Use the Copy To and Paste From
options to create a copy of an object in a content set. Use the CTRL key to select multiple content set
objects.
42
CHAPTER 4
Classifier Models
This chapter includes the following topics:
Classifier Scores, 44
The input data contains text. Classifier models apply natural language processes to text data to identify
the types of information in the text. Natural language processes detect relevant words in the input string.
Natural language processes disregard words that are not relevant.
The input data strings contain multiple values. For example, you can create a data column that contains
the contents of an email message in each field.
The Classifier transformation reads string datatypes. The transformation imposes no limit on the length of the
input strings.
You compile classifier models in the Developer tool. When you compile a model, you create associations
between similar data values in the model. The Classifier transformation uses the compiled data to search for
information in the input data.
43
Classifier Scores
A Classifier transformation compares each row of input data with every row of reference data in a classifier
model. The transformation calculates a score for each comparison. The scores represent the degrees of
similarity between the input row and the reference data rows.
When you run a mapping that contains a Classifier transformation, the mapping returns the label that
identifies the reference data row with the highest score. The score range is 0 through 1. A high score
indicates a strong match between the input data and the model data.
Review the classifier scores to verify that the label output accurately describes each row of input data. You
can also review the scores to verify that the classifier model is appropriate to the input data. If the
transformation output contains a large percentage of low scores, the classifier model might be inappropriate.
To improve the comparisons, compile the model again. If the compiled model does not improve the scores,
replace the model in the transformation.
44
1.
2.
Create a data object in the Model repository that reads the file or the database table.
3.
Create data objects in the Model repository for each language that a message uses.
4.
Create a classifier model that contains sample text for each language.
Note: You can use sample data from the email messages data as source data for the model.
5.
6.
Add the Classifier transformation and the data objects to the mapping.
Connect the Classifier transformation output ports to the target data objects.
When you run the mapping, the Classifier transformation analyzes the email messages and writes the email
text to the correct data target. You can share the data targets with the team members in each department.
45
Filter field
Filters the list of reference data rows based on the data value or the label that you specify.
2.
Add Row
Inserts a blank reference data row.
3.
Append data
Imports data from a data object in the Model repository.
4.
Delete
Deletes the reference data rows that you select. Use the check boxes to select the rows.
5.
Assign Label
Assigns a label to one or more reference data rows that you select. Use the check boxes to select the
rows.
6.
Edit Properties
Displays the classifier model properties.
7.
Manage Labels
Opens the Manage Labels dialog box. Use the dialog box to add or delete label values from the
classifier model.
8.
Compile
Compiles the classifier model.
9.
Total records
Indicates the number of reference data rows in the classifier model.
10.
Label field
Displays a label value that you can apply to the current reference data row.
11.
Find field
Finds a data value that you specify in the current reference data row.
46
A reference data field can be of any length. You can enter pages of text into each data field.
You cannot edit reference data values. However, you can delete a data row.
When you compile a classifier model, the compilation process disregards any number values in the
reference data.
Name column.
Contains the label values that the Classifier transformation can apply to the input data rows. You can
sort the labels by name.
2.
Type column.
Identifies the source of the label values. The classifier model identifies all labels as user-defined values.
3.
Usage column.
Indicates the number of reference data rows that use each label. You can sort the labels by the number
of rows.
4.
Add button.
Adds a label to the classifier model. Enter a label value in the Name column on the row.
47
Note: To update a label value, double-click the value and enter the value that you need.
5.
Delete button.
Deletes a label from the classifier model.
6.
Up arrow.
Moves the label up a single row in the dialog box.
7.
Down arrow.
Moves the label down a single row in the dialog box.
Identify the reference data values and the label values to add to the model.
You can use a fragment of the data that you want to classify. Create a data object in the Model
repository that reads the data fragment.
2.
Create a content set, and add a classifier model to the content set.
3.
4.
5.
6.
After you compile the classifier model, you can use the model in a Classifier transformation.
2.
3.
4.
5.
48
Browse the Model repository and select the data object that contains the data to import.
Review the columns on the data object, and select one or more columns to add to the model. You can
add reference data columns and a label column in the same operation.
To import a column of data as reference data, select the column name and click Data.
You can select multiple data columns. The Developer tool merges the contents of the columns that
you select to a single column.
To import a column of data as label values, select the column name and click Label.
When you import reference data and label values, the Developer tool assigns the label on each row to
the reference data string on the same row. You can preview the data before you select the columns. You
can change the label assignments after you create the model.
Click Next.
7.
8.
After you create the model, verify the label assignments and compile the model.
2.
3.
4.
Browse the Model repository and select the data object that contains the data to import.
Do not select a social media data object.
Click Next.
5.
Review the columns on the data object, and select one or more columns to add to the model. You can
add reference data columns and a label column in the same operation.
To import a column of data as reference data, select the column name and click Data.
You can select multiple data columns. The Developer tool merges the contents of the columns that
you select to a single column.
To import a column of data as label values, select the column name and click Label.
When you import reference data and label values, the Developer tool assigns the label on each row to
the reference data string on the same row. You can preview the data before you select the columns. You
can change the label assignments after you create the model.
Click Next.
6.
7.
49
2.
3.
4.
2.
3.
4.
Click New.
The Developer tool adds a row at the bottom of the list of labels.
5.
Double-click the default value in the Name column, and enter a label name.
6.
Click OK.
After you create the label, you can assign the label to one or more rows of reference data. The Usage column
in the Manage Labels dialog box indicates the number of rows that use the label.
2.
3.
Select one or more reference data rows. Use the check box options to select the rows.
Note: You can use the filter option to show all of the rows that contain a data value that you specify. Use
the Select All check box option to select all of the rows that contain the value.
4.
5.
Optionally, compile the model to add the label names to the classifier model logic.
50
2.
3.
Open the Manage Labels dialog box. The dialog box lists the labels in the classifier model.
4.
2.
3.
Select one or more reference data rows. Use the check box options to select the rows.
4.
Click Delete.
The Developer tool removes the rows that you selected from the classifier model.
2.
3.
4.
Click Delete.
5.
6.
To update the compilation data, open the model in the Developer tool and click Compile.
51
Find the reference data rows that contain a value that you enter.
Find the reference data rows that use a label that you select.
You can also search for a data value within a row of reference data.
2.
3.
2.
3.
2.
3.
4.
5.
52
Use the Up arrow or the Down arrow to find additional instances of the value in the row.
2.
3.
4.
Click OK.
The Developer tool copies the classifier model to the selected content set.
2.
3.
4.
Click OK.
The Developer tool pastes the classifier model to the current content set.
53
CHAPTER 5
Probabilistic Models
This chapter includes the following topics:
54
Natural language processes can recognize similar data values and apply the same label to the values.
Natural language processes can compare a data value to the adjacent values in the string. Natural
language processes analyze the sequence of values to understand the usage of each string and to verify
the types of information that the strings represent.
Field 1
Field 2
Field 3
19132954
AIM SECURITIES
PETRIE TAYBRO
10110169
JASE TRAPANI
10111786
JAN SEEDORF
10112299
FELIX LEVENGER
HARVARD MAGAZINE
10112036
RICHARD TREMBLAY
BERGER ASSOCIATES
10111101
DAREEN HULSMAN
55
Row ID
Field 1
Field 2
Field 3
19131385
PATRICK MCKINNIE
LAKENYA PASKETT
15954710
When you run the mapping, the Labeler transformation compares the input data with the probabilistic model
reference data. The Labeler transformation selects a label for the data on each input port. The transformation
writes the labels to an output port. Each output row contains a set of labels that defines the data structure on
the corresponding input row.
The following table describes the labels that the Labeler transformation adds to the output port:
Row ID
Output Labels
56
Product Name
Product Type
Product Details
Product Size
Sunnydream
Orange Juice
Unsweetened
12 oz
Add Row
Inserts a blank data row.
2.
Append data.
Imports data from a data object in the Model repository.
3.
Cut
Removes a data row from the probabilistic model and adds the data row to the clipboard.
57
4.
Copy
Copies a data row to the clipboard.
5.
Paste
Pastes a data row from the clipboard to the probabilistic model.
6.
Delete
Deletes a data row from the probabilistic model.
7.
Manage Labels
Opens the Manage Labels dialog box. Use the dialog box to add or delete label values from the
probabilistic model.
8.
Assign Label
Assigns a label to one or more reference data values that you select. You can use the option to assign a
label to all instances of a reference data value in the model.
9.
Edit Properties
Displays the probabilistic model properties.
10.
Compile
Compiles the probabilistic model.
11.
Find field
Finds rows in the model that contain the reference data value that you enter. Use the Up arrow and
Down arrow to move to the rows that contain the value.
12.
13.
14.
Label field
Displays a label value that you can apply to the reference data value that you select.
15.
Label menu
Displays a list of options that you can use to assign a label to one or more reference data values. To
open the menu, right-click a reference data value in the reference data editor.
58
Manage Labels
Opens the Manage Labels dialog box. Use the dialog box to add or delete label values from the
probabilistic model.
2.
Assign Label
Assigns a label to one or more reference data values that you select.
You can assign a label to a single data value, or you can assign a label to multiple values in a single
operation.
3.
Edit Properties
Displays the probabilistic model properties.
4.
Compile
Compiles the probabilistic model.
5.
6.
Assignment filter
Filters the list of reference data values that use the label that you select. The filter options show or hide
the reference data values based on the method that you used to assign the label to the data values.
When you apply a filter, the total number of labeled values in the Label view reflects the number of
values that satisfy the filter condition.
7.
59
8.
60
Label Name
Product_Name
Product_Name
Quantity
Quantity
Location
Location
Barcode
Barcode
SKU
Stock_Keeping_Unit
Arrival_Date
Arrival_Date
Cost_Price
Cost_Price
Note: You can use the input column names, or you can use other names. The names do not need to match.
Overflow Label
When a transformation cannot apply a label to an input data value, the transformation treats the data value
as overflow data. The Labeler transformation applies an overflow label to any data value that it cannot
identify. The Parser transformation writes any data value that it cannot identify to an overflow port.
The following table shows how a Parser transformation might use an overflow port to parse address data
elements that a probabilistic model does not recognize:
Input Data
Street_Name port
Street_Descriptor
port
Overflow port
Park Place
Park
Place
No overflow data
Park Avenue
Park
Avenue
No overflow data
Madison Avenue
Madison
Avenue
No overflow data
Central Park
Central
Park
No overflow data
Washington Square
Park
Washington
Square
Park
Madison Square
Garden
Madison
Square
Garden
The Parser transformation also writes values to an overflow port when the number of input values is greater
than the number of labels in the model. Before you use a probabilistic model in a transformation, review the
input data and verify that the model contains the correct number of label values.
61
Related Topics:
Rules and Guidelines for Probabilistic Models and Classifier Models on page 41
Identify the reference data values and the label values to add to the model.
You can use a fragment of the data that you want to analyze. Create a data object in the Model
repository that reads the data fragment.
2.
Create a content set, and add a probabilistic model to the content set.
3.
4.
5.
6.
After you compile the probabilistic model, you can use the model in a transformation.
2.
3.
4.
5.
6.
62
Click Finish.
2.
3.
4.
5.
6.
Browse the Model repository and select the data object that contains the data to import.
Do not select a social media data object.
Click Next.
7.
Review the columns on the data object, and select one or more columns to add to the model. You can
add reference data columns and a label column in the same operation.
To import a column of data as reference data, select the column name and click Data.
You can select multiple data columns. The Developer tool merges the contents of the columns that
you select to a single column.
To import a column of data as label values, select the column name and click Label.
When you import reference data and label values, the Developer tool assigns the label on each row to
the reference data string on the same row. You can preview the data before you select the columns. You
can change the label assignments after you create the model.
Click Next.
8.
9.
Specify the delimiters for the data values that you import.
You can specify different delimiters for reference data values and label values. The default delimiter is a
character space.
10.
After you create the probabilistic model, verify the label assignments and compile the model.
2.
3.
63
4.
Browse the Model repository and select the data object that contains the data to import.
Do not select a social media data object.
Click Next.
5.
Review the columns on the data object, and select one or more columns to add to the model. You can
add reference data columns and a label column in the same operation.
To import a column of data as reference data, select the column name and click Data.
You can select multiple data columns. The Developer tool merges the contents of the columns that
you select to a single column.
To import a column of data as label values, select the column name and click Label.
When you import reference data and label values, the Developer tool assigns the label on each row to
the reference data string on the same row. You can preview the data before you select the columns. You
can change the label assignments after you create the model.
Click Next.
6.
7.
Specify the delimiters for the data values that you import.
You can specify different delimiters for reference data values and label values. The default delimiter is a
character space.
8.
2.
3.
4.
Select the row that you added, and enter one or more reference data values to the row.
5.
After you save the model, assign a label to each value in the row. Optionally, compile the model.
2.
3.
4.
64
5.
Edit the label name. Optionally, update the color for the label.
6.
7.
After you add the label, assign the label to at least one data value.
2.
3.
4.
Find a data value that does not have a label or that has an incorrect label. Data values that use a label
are color-coded.
5.
6.
Right-click a data value in the editor and select a label from the context menu.
The Developer tool assigns the label to the data value.
7.
After you save the probabilistic model, optionally compile the model.
2.
3.
4.
5.
Match case.
Specifies that the search operation is case sensitive. Do not use wildcard characters with the option.
Match full string. Specifies that the search operation looks for a complete match between the
characters in the reference data value and the characters that you enter. Do not use wildcard
characters with the option.
6.
Select a label to assign to the reference data values that match the search criteria.
65
You can also select the No Label option. Select the option to remove the label from the reference data
values that include the characters that you enter.
7.
Click Start.
The Developer tool assigns the label to all reference data values that match the search criteria that you
define.
Note: To view the reference data values that you labeled in a single operation, use the Assigned by
bulk filter in the Label view.
2.
3.
4.
Click Delete.
The Developer tool removes the rows that you selected from the classifier model.
2.
3.
4.
5.
Click Delete.
6.
7.
Note: A label is a structural element in a probabilistic model. If you add or remove a label after you add the
model to a transformation, you invalidate the operation that uses the model. To use the model that you
updated, delete and re-create the transformation operation.
66
To compile the model, open the model in the Developer tool and click Compile.
2.
3.
4.
5.
Use the Up arrow or Down arrow to move to other rows that contain the value.
2.
3.
4.
Apply a filter to the list of reference data values that use the label.
Select one of the following filters:
All. Displays the reference data values that use the label. All is the default option.
Assigned by user. Displays any reference data value that you selected individually when you
assigned the label.
Assigned by bulk. Displays the reference data values to which you assigned a label as part of a bulk
assignment operation.
The probabilistic model displays the reference data values that satisfy the filter condition.
2.
3.
67
2.
3.
4.
Click OK.
The Developer tool copies the probabilistic model to the selected content set.
2.
3.
4.
Click OK.
The Developer tool pastes the probabilistic model to the current content set.
2.
3.
4.
You can use the Ctrl + V keys to paste the rows to a text editor or to the Data view of another probabilistic
model.
68
Index
A
Analyst tool
find and replace reference data values 26
C
character sets 37
classifier models
in content sets 37
rules and guidelines 41
Content Management Service
reference table privileges 14
content sets
character sets 37
classifier models 37
pattern sets 38
probabilistic models 38
regular expressions 39
token sets 39
version control 14, 24, 31
creating a reference table from column patterns
reference tables 20
creating a reference table from profile column data
reference tables 19
creating a reference table manually
reference tables 18
E
exporting a reference table
reference tables 26
I
importing a reference table
reference tables 22
M
managed reference tables 13
managing columns
reference tables 25
managing rows
reference tables 25
R
reference tables
Analyst tool overview 16
Content Management Service 13
creating a reference table from column patterns 20
creating a reference table from profile columns 19
creating a reference table manually 18
Developer tool overview 32
exporting a reference table 26
finding and replacing values in the Analyst tool 26
importing a reference table 22
in pattern-based parsing 13
managed and unmanaged 13
managed reference tables 13
managing columns 25
managing rows 25
privileges 14
properties in the Analyst tool 17
properties in the Developer tool 33
reference data warehouse 13
refreshing in Analyst tool 27
unmanaged reference tables 13
version control 14, 24, 31
viewing audit trail tables 28
regular expressions 39
T
token sets 39
U
unmanaged reference tables
definition 13
synchronizing with the Model repository 14
V
version control
content sets 14, 31
reference tables 14
reference tables in the Analyst tool 24
reference tables in the Developer tool 31
viewing audit table events
reference tables 28
pattern sets 38
privileges
Content Management Service 14
probabilistic models
in content sets 38
69
70
Index