Iq8 Modguide
Iq8 Modguide
Iq8 Modguide
Modifiers Guide
Notices
Published in the United States of America by Firstlogic, Inc., 100 Harborview Plaza,
La Crosse, Wisconsin 54601-4071.
Customer Care
Technical help is free for customers who are current on their ESP. Advisors are
available from 8 a.m. to 6 p.m. central time, Monday through Friday. When you call,
have at hand the users manual and the version number of the product you are using.
Call from a location where you can operate your software while speaking on the
phone. To save time, fax or e-mail your questions, and an advisor will call or e-mail
back with answers prepared. Or visit our Knowledge Base on the Customer Portal
web site, where you can find answers on your own, right away, at any time of the day
or night.
Our Customer Care group also manages our customer database and order processing.
Call them for order status, shipment tracking, reporting damaged shipments or flawed
media, changes in contact information, and so on.
Legal notices
Phone
Web site
http://www.firstlogic.com/customer
customer@firstlogic.com
Product literature
Corporate receptionist
The Firstlogic Technical Publications group strives to bring you the most useful and
accurate publications possible. Please give us your opinion about our documentation
by filling out the brief survey at http://www.firstlogic.com/customer/surveys/
default.asp. We appreciate your feedback! Thank you!
2004 Firstlogic, Inc. All rights reserved. This publication and accompanying software are protected by U.S. copyright
law and international treaties. No part of this publication or accompanying software may be copied, transferred, or
distributed to any person without the express written permission of Firstlogic, Inc.
National ZIP+4 Directory 2004 United States Postal Service. Firstlogic Directories 2004 Firstlogic, Inc. All City,
ZCF, state ZIP+4, regional ZIP+4, and supporting directories are also protected under the Firstlogic copyright. Firstlogic,
Inc. is a nonexclusive interface distributor of the USPS and holds a nonexclusive license to publish and sell ZIP+4
databases on optical and magnetic media. Firstlogic publishes this document and offers the Firstlogic product to the public
under a nonexclusive license from the United States Postal Service. The price of the Firstlogic product is not established,
controlled, or approved by the U.S. Postal Service.
Firstlogic, Inc., or any authorized dealer distributing this product, makes no warranty, expressed or implied, with respect to
this computer software product or with respect to this manual or its contents, its quality, performance, merchantability, or
fitness for any particular purpose or use. It is solely the responsibility of the purchaser to determine its suitability for a
particular purpose or use. Firstlogic, Inc. will in no event be liable for direct, indirect, incidental, or consequential damages
resulting from any defect or omission in this software product, this manual, the program disks, or related items and
processes, including, but not limited to, any interruption of service, loss of business or anticipatory profit, even if
Firstlogic, Inc. has been advised of the possibility of such damages. This statement of limited liability is in lieu of all other
warranties or guarantees, expressed or implied, including warranties of merchantability or fitness for a particular purpose.
1L, 1L (ball design), ACE, ACSpeed, DataJet, DocuRight, eDataQuality, Entry Planner, Firstlogic, Firstlogic InfoSource,
FirstPrep, FirstSolutions, GeoCensus, idCentric, IQ Insight, iSummit, Label Studio, MailCoder, Match/Consolidate,
PostWare, Postalsoft, Postalsoft Address Dictionary, Postalsoft Business Edition by Firstlogic, Postalsoft DeskTop Mailer,
Postalsoft DeskTop PostalCoder, Postalsoft DeskTop Presort, Postalsoft Manifest Reporter, PrintForm, RapidKey, Total
Rewards, and TrueName are registered trademarks of Firstlogic, Inc. DataRight, IRVE, and TaxIQ are trademarks of
Firstlogic, Inc. CASS, DPV, eLOT, FASTforward, NCOAlink and ZIP are trademarks of the United States Postal Service.
All other trademarks are the property of their respective owners.
Contents
Preface .............................................................................................................5
Chapter 1:
Custom parsing dictionaries......................................................................... 7
Step 1: Query the dictionary.............................................................................9
Step 2: Create a parsing transaction file.........................................................12
Step 3: Put your entries in the transaction file ...............................................13
Step 4: Build your custom parsing dictionary................................................15
Step 5: Maintain and update your custom dictionary.....................................16
Sample transaction: Add a new word.............................................................17
Sample transaction: Add a title phrase...........................................................18
Sample transaction: Add a multiple-word firm name ....................................20
Sample transaction: Add a firm that looks like a personal name ...................22
Sample transaction: Modify information codes .............................................23
Sample transaction: Modify standards and standard-types ............................24
Sample transaction: Add an acronym for acronym conversion .....................25
Rules for working with match standards........................................................26
Chapter 2:
Custom capitalization dictionaries ............................................................ 29
Step 1: Create a capitalization transaction file ...............................................30
Step 2: Put your entries in the transaction file ...............................................31
Step 3: Build your custom capitalization dictionary ......................................32
Step 4: Update your custom dictionary ..........................................................33
Chapter 3:
User-defined pattern matching (UDPM)................................................... 35
Overview of UDPM .......................................................................................36
Working with the pattern file .........................................................................37
Introduction to regular expressions ................................................................39
Creating regular expressions ..........................................................................42
Example of defining a pattern ........................................................................43
Alternate expressions .....................................................................................45
Multiple rules .................................................................................................46
Example of a user-defined pattern file ...........................................................47
Chapter 4:
Modify the rule file...................................................................................... 51
What is the rule file? ......................................................................................52
How the rule file is organized ........................................................................53
Rule example..................................................................................................54
Definition section of a parsing rule ................................................................55
Action section of a parsing rule......................................................................58
Example of a parsing rule...............................................................................62
Contents
Chapter 5:
Check parsing results with QuickParse.................................................... 67
Get started with QuickParse .......................................................................... 68
Run QuickParse ............................................................................................. 70
Appendix A:
UMD configuration file, umd.cfg................................................................ 75
Appendix B:
UMD command line..................................................................................... 77
Appendix C:
Information codes and standard-type codes ............................................. 79
Index.............................................................................................................. 85
Preface
This guide explains the User-Modifiable Dictionary (UMD), which is a tool for
viewing and customizing dictionary files.
This guide explains how to use the command-line version of UMD to create
custom parsing dictionaries, and custom capitalization dictionaries
Another way you can modify the Data Cleanse transforms behavior to suit your
needs is to edit the pattern file (drludpm.dat). You edit the pattern file in order to
parse user-defined data patterns.
Additionally, you can edit the rules that are used to parse different types of name
and firm data. For more information, see Modify the rule file on page 51.
The last chapter of this guide explains how to check your results with QuickParse.
Use QuickParse to quickly see how data that you input would parse if input
through the Data Cleanse transform.
Related documents
Before using UMD, you should understand how your Firstlogic product uses the
dictionaries. For details, see your product documentation.
Conventions
Description
Bold
We use bold type for file names, paths, emphasis, and text that you
should type exactly as shown. For example, Type iq8\bin.
Italics
We use italics for emphasis and text for which you should substitute
your own data or values. For example, Type a name for your
project, and the .xml extension (projectname.xml).
Menu commands We indicate commands that you choose from menus in the following
format: Menu Name > Command Name. For example, Choose File
> New.
Changes
Chapter 1:
Custom parsing dictionaries
What is a parsing
dictionary?
The parsing dictionary identifies and parses name, title, and firm data. The parser
looks up words in the parsing dictionary to retrieve information. The parser then
uses the dictionary information, as well as the rule file, to identify and parse
name, title, and firm data.
The parsing dictionary contains entries for words and phrases. Each entry tells
how the word or phrase might be used. For example, the dictionary indicates that
the word Engineering can be used in a firm name (such as Smith Engineering,
Inc.) or job title (such as VP of Engineering).
The dictionary also contains other information:
Type of information
in dictionary
Description
Acronyms
Match
standards
Gender
Address
Our base parsing dictionary contains thousands of name, title, and firm entries.
You might tailor the dictionary to better suit your data. For example:
You might customize the dictionary to correct specific parsing behavior. For
example, given the name Mary Jones, CRNA, the word CRNA is parsed as a
job title. In reality, CRNA is a postname (Certified Registered Nurse
Anesthetist). To correct this, you could add CRNA to the parsing dictionary as
a postname.
You might tailor the dictionary to better suit your data by adding regional or
ethnic names, special titles, or industry jargon. For example, if you process
data for the real estate industry, you might add postnames such as CRS
(Certified Residential Specialist) and ABR (Accredited Buyer
Representative).
If a specific title or firm name is parsed incorrectly, you can add an entry for
the entire phrase. For example, the parser previously identified Hewlett
Packard as a personal name, so we added Hewlett Packard to the dictionary
as a firm name.
Chapter 1:
Overview of creating a
dictionary
Transaction file
A database containing
your additions and
changes
Supporting files
Files that enable UMD to
read the transaction file
UMD
Build
Custom dictionary
A new dictionary containing entries from the
source dictionary with
your additions and
changes
Qualifications
A note about
examples
The sample queries and transactions in this chapter are for example only. By the
time you read this manual, the particular examples may have been added to our
base parsing dictionaries, so your query results may differ from what is shown.
To query a dictionary, run UMD Show. To run UMD Show, use the command line
(see UMD Show on page 77). UMD Show is interactive. You enter a query and
UMD Show responds, either with data or a message that your query was not
found in the dictionary.
To query a single word, type the word at the Enter> prompt. Do not include any
punctuation. If the word is in the dictionary, UMD Show displays the dictionary
entry:
C:\umd /s parsing.dct
Using a parsing Dictionary.
Enter a query, or press <Esc> to exit.
Enter> Beth
Usage: 99
Intl Code(s): USENGLISH
Info Code(s): NAME_STRONG_FN NAMEGEN5
Standard(s) for BETH:
- BETHANY
NAME_MTC
- BETHEL
NAME_MTC
- ELIZABETH
NAME_MTC
If the word is not in the dictionary, UMD Show tells you the entry was not found:
C:\umd /s parsing.dct
Using a parsing Dictionary.
Enter a query, or press <Esc> to exit.
Enter> Michelangelo
Text not found in dictionary.
To look up a multiple-word title, you must query the lookup form of the title
the same form as the parser would look up (see the procedure, below).
Note: If a line contains consecutive words that are marked as phrase
words, the parser retrieves the standard for each word, removes any
punctuation, and looks up the phrase.
Procedure
Example
Chapter 1:
C:\umd /s parsing.dct
Enter a query, or press <ESC> to exit.
Enter> Chief
Usage: 0
Intl Code(s): USENGLISH
Info Code(s): PRENAME TITLE TITLE_INIT TITLE_TERM PREGEN3 PHRASE_WRD
FIRMMISC
Standard(s) for CHIEF:
- CHIEF
FIRM_MTC, FIRM_STD, PRENAME_MTC, PRENAME_STD,
TITLE_MTC, TITLE_STD
Query a multiple-word
firm name
If you want to query a firm name that is also a personal name, such as
Hewlett Packard or Johnson & Johnson, see Query a firm name that looks
like a personal name on page 11.
To look up a multiple-word firm name, you must query the lookup form of the
firm namethe same form as the parser would look up:
10
Procedure
Example
General Motors
Gen. Motors
Gen Motors
Some firms are named after peoplefor example, Hewlett Packard or Johnson
and Johnson.
To look up this type of firm name, you must query the lookup form of the firm
namethe same form of the name that the parser would look up:
Procedure
Example
4. Remove all firm-terminator words, such as Corporation, Inc, Ltd, and so on.
This is the lookup form of the firm name.
Johnson Johnson
If all of the words in a line are identified as both FIRMNAME and NAME
words, the parser removes noise words and punctuation, then looks to see
whether the name is listed as a firm name. If so, the line is parsed as a firm
name. If not, the line is parsed as a personal name.
Query the lookup form of the firm name:
C:\umd /s parsing.dct
Using a Parsing Dictionary.
Enter a query, or press <ESC> to exit.
Enter> Johnson Johnson
Usage: 1
Intl Code(s): USENGLISH
Info Code(s): FIRMNAME
Standard(s) for JOHNSON JOHNSON:
- JOHNSON JOHNSON
FIRM_MTC, FIRM_STD
Chapter 1:
11
Create a transaction
database
The quickest, easiest way to create a transaction database file and its supporting
files is to use the output file feature of UMD Show. (See UMD Show on
page 77.)
1. Use UMD Show to query our base parsing dictionary, parsing.dct. Include
the o option on the command line. Use the file name that you plan to use for
your custom dictionary, but with the extension .trnfor example,
my_parse.trn.
If you plan to use a database program or spreadsheet program to edit the
file, we recommend creating a dBASE3 or ASCII file. If you plan to use a
text editor or word processor to edit the file, we recommend you create a
delimited file. However, be aware that our UMD Views program doesnt
support updating of delimited files.
2. Query a word that is in the dictionary, such as Bob.
3. To save the query to your output file, press Enter. To exit, press Escape.
C:\umd /s parsing.dct /o my_parse.trn /d dBase3
Using a Parsing Dictionary
Enter a query, or press <ESC> to exit.
Enter> Bob
Usage: 99
Intl Code(s): USENGLISH
Info Code(s): NAME_STRONG_FN NAMEGEN1
Standard(s) for BOB:
- ROBERT
NAME_MTC
Enter a query, or press
to save, or press <ESC> to exit
Enter>
Previous query appended to C:\my_parse.trn.
Enter a query, or press <ESC> to exit.
Enter> <Esc>
12
When you create a transaction database as described above, UMD Show creates a
supporting file such as my_parse.def. For ASCII and delimited transaction files,
UMD Show also creates an additional supporting file such as my_parse.fmt or
my_parse.dmt. To open and read the transaction file, UMD requires these files.
If you move the transaction file to a new location, make sure you also move the
corresponding supporting files.
Field
Data to enter
Action
Choose one:
Code
Description
Primary
Type the word or phrase that you want to add or whose entry you want to
modify. Fifty four characters maximum, not case-sensitive, do not include
any punctuation.
For phrases and multiple-word firm names, use the lookup form. To get
the lookup form, see Query a title phrase on page 9 and Query a firm
name that looks like a personal name on page 11.
Secondary
Intl
Type USENGLISH.
Info
Type all information codes that apply, if not already in the dictionary. Put
one space (no punctuation) between codes.
For a list of information codes, see Information codes on page 79.
Stdtype
Type all standard-type codes that apply, if not already in the dictionary.
Put one space (no punctuation) between codes.
For a list of standard-type codes, see Standard-type codes on page 79.
Chapter 1:
13
Required fields
For each action, you must provide certain information. In the table below, a check
mark ( ) means that you must provide information for that field.
Type of change
Action
Delete a standard
Primary
Secondary
Usage
Intl
Information
Stdtype
Note 1
Note 2
Note 3
Must be blank
Note 4
Must be blank
Note 5
Note 6
1) Required only if the necessary Info code is not already in the existing
dictionary entry. For example, if you add the Stdtype code TITLE_MTC, you
must specify the Info code TITLE unless one of those Info codes is already
specified in the existing dictionary entry.
2) Required if the Info field contains anything besides PHRASE_WRD.
3) Must include all of the Info codes listed in the existing dictionary entry.
4) Must include all of the Stdtype codes listed in the existing dictionary entry.
5) This field is ignored. UMD automatically deletes dependent standard types.
6) PREGENx or NAMEGENx only. The existing dictionary entry must contain a
corresponding gender code. For example, if the existing entry contains the gender
code NAMEGEN1, you may change it to any other NAMEGENx code.
14
To build your dictionary, run UMD Build. The easiest way to convey your
instructions to UMD is through the UMD configuration file.
1. Open a copy of the configuration file umd.cfg.
2. Type entries for the UMD Build parameters. Specify our dictionary,
parsing.dct, as your Source Dictionary.
For descriptions of the configuration-file parameters, see UMD
configuration file, umd.cfg on page 75.
# UMD Show
Output File Name (path & file name) ....
Output File Type (See NOTE) ............
#
# UMD Build
Dictionary Type (See NOTE) .............
Source Dictionary (path & dct) .........
Transaction File Name (path & file name)
Target Dictionary (path & dct) .........
Verify Input File Only (YES/NO) ........
Error Message Log File (path & name) ...
Work Directory (path) ..................
=
=
=
=
=
=
=
=
=
Parsing
c:\pathname\parsing.dct
c:\pathname\my_parse.trn
c:\pathname\my_parse.dct
NO
c:\pathname\my_parse.log
r:\temp
Before UMD builds your custom dictionary, it checks to make sure the entries in
your transaction file are valid. If a validation error or warning occurs, look at the
error log file. If an error occurred, fix your transaction file, then run UMD Build
again.
If the transaction file is free of errors, UMD builds your custom dictionary.
During the build process, UMD takes the source dictionary, makes the changes
and additions specified in your transaction file, and creates your custom
dictionary.
Chapter 1:
15
If you want to update your custom dictionary, put your changes and additions in
your existing transaction file. Your custom dictionary will be much easier to
manage if you accumulate all your entries in one transaction file, rather than
scattering them among many files.
When you rebuild your parsing dictionary, always use our base parsing
dictionary, parsing.dct, as the source dictionary.
16
To add the word to the dictionary, you would add the following record to your
transaction file:
Capitalization
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
N
ABSFC
ABSFC
0
USENGLISH
HONPOST
HONPOST_STD HONPOST_MTC
As you add words to the parsing dictionary, make a note of any words that have
unusual mixed-case capitalization. To get the correct mixed-case capitalization,
you must also add these words to your custom capitalization dictionary.
For example, if you add ABSFC to the parsing dictionary, you should also add it
to your custom capitalization dictionary. Otherwise, the mixed-casing will be
Absfc rather than ABSFC.
Chapter 1:
17
1. Query the lookup form of the phrase (see Query a title phrase on page 9).
For example, to add the phrase Vice President of Marketing to the dictionary,
use the lookup form Vice Pres of Mktg.
2. If the phrase is not in the dictionary, create a new entry in your transaction
file. Use the lookup form as the primary and secondary:
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
N
Vice Pres of Mktg
Vice Pres of Mktg
0
USENGLISH
TITLE
TITLE_STD TITLE_MTC
3. Query each word in the original phrase (for example, Vice, President, of, and
Marketing). Make sure each word meets the following requirements:
18
For our example, the word President is in the dictionary but is not identified
as a phrase word, so we need to mark it as a PHRASE_WRD. We also need
to mark the word of as a TITLE word and a PHRASE_WRD.
Field name
Entry for of
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
President
Pres.
A
of
of
PHRASE_WRD
PHRASE_WRD TITLE
TITLE_MTC TITLE_STD
For best results. Perform steps 3 and 4 for variant spellings and
abbreviations of each word. For our example, we would check to make
sure that Pres and Mktg are marked as phrase words. This enables the
parser to recognize variant raw forms of the phrasesuch as Vice Pres. of
Marketing, Vice President of Mktg., and Vice Pres. of Mktg.in addition to
the original phrase Vice President of Marketing.
Chapter 1:
19
To add a multiple-word firm name to the dictionary, you must do two things:
Make sure each at least one of the words is in the dictionary and has the
FIRMNAME information code.
Enter the lookup form of the firm name so that the parser will find it (see
Step 1: Query the dictionary on page 9). Otherwise, the entry will have no
affect on parsing results.
To add a multi-word
entry to the
dictionary:
1. If the firm name looks like a personal namefor example, Hewlett Packard,
Merrill Lynch, Johnson & Johnsonsee Query a firm name that looks like a
personal name on page 11.
2. In your transaction file, create a new entry for the lookup form of the firm
name (see Query a multiple-word firm name on page 10).
Field
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
N
Emery Worldwide
Emery Worldwide
0
USENGLISH
FIRMNAME
FIRM_STD FIRM_MTC
3. Make sure that both of the words in the firm name (for example, both Emery
and Worldwide) meet the following requirements:
20
Most often, youll need to mark one of the words as a FIRMNAME word, as
shown here.
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
Emery
Emery
FIRMNAME
FIRM_MTC FIRM_STD
Chapter 1:
21
If a firm name looks like a personal name,1 you must do two things:
Make sure each word is in the dictionary and has both the NAME and
FIRMNAME information codes.
Create an entry for the lookup form of the firm name.
1. Query the lookup form of the firm name (see Query a firm name that
looks like a personal name on page 11).
2. If the firm name is not in the dictionary, create a new entry in your
transaction file. Use the lookup form as the primary and secondary.
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
N
Robert W. Baird
Robert W. Baird
0
USENGLISH
FIRMNAME
FIRM_MTC FIRM_STD
3. Query each word. Make sure it is in the dictionary and is identified as both a
NAME and a FIRMNAME. If not, add the word (or modify it) by putting an
entry in your transaction file. In our example, Robert W Baird, all three words
are in the dictionary, but none has the FIRMNAME information code.
For each word, we would put an entry in the transaction file to add the
FIRMNAME information code, as shown here for Robert.
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
Robert
Robert
FIRMNAME
FIRM_MTC FIRM_STD
1. To the parser, a line looks like a personal name if all of the words in the line are marked as NAME words. For
example, Check N Go looks like a personal name because the words Check, N, and Go are all NAME words.
22
In your custom dictionary, you could specify that PsyD is also an honorary
postname (Doctor of Psychiatry). To do this, modify the existing entry to add the
honorary-postname codes:
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
PsyD
PsyD
HONPOST
HONPOST_STD HONPOST_MTC
Notice that when you add a new information code, you must also specify at least
one standard for that type of information. In this case, we specified PsyD as the
standard and match standard for honorary postnames.
Chapter 1:
23
24
Field name
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
Engineer
Eng.
D
Engineer
Engineer
TITLE_STD
TITLE_STD
Before looking for an acronym, the parser removes all punctuation and noise
words and gets the first appropriate match standard for each word. You must use
the same phrase that the parser will actually look upotherwise, the parser wont
find your entry and wont generate the acronym.
To add an acronym
Procedure
Example
4. Query each remaining word. Get the first appropriate match standard for each. For example, if you
are adding a firm match standard, get the first
FIRM_MTC.1
1. If the word is not in the dictionary, create a new entry for the word (see page 17). If the word is in the dictionary but
does not list an appropriate match standard, create an entry to add the appropriate information code and match-standard
type (see pages 23 and 24). For example, for the word Residential we would add the information code HONPOST and the
standard-type code HONPOST_MTC.
To add the phrase to the dictionary, put an entry in your transaction file. Use the
lookup form of the phrase as the primary, and use the acronym itself as the
secondary:
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
N
Cert Residential Specialist
CRS
0
USENGLISH
HONPOST
HONPOST_ACR
Chapter 1:
25
Allen
Alfredo
Alex
Alphonso
Alonzo
Albert
Alan
Alfred
Alexander
Alphonse
Almon
Al
For the name Al, the match standards are Albert, Alan, Alfred, Alexander,
Alphonse, and Almon.
For the name Alberto, the match standard is Albert. (Likewise, for Allen the
match standard is Alan; for Alfredo, Alfred; and so on.)
If two different names return the same match standard, you can use your
matching software to do multiway comparisons and find a match. For example,
since Alberto and Al both return Albert as a match standard, your matching
software could match Alberto Smith to Al Smith.
Here are partial dictionary entries for the name Al and its direct match standards.
Primary
Standard
ALBERT
ALBERT
ALAN
ALAN
ALFRED
ALFRED
ALEXANDER
ALEXANDER
ALPHONSE
ALPHONSE
ALMON
ALMON
AL
Notice that each match standard has its own entry, and that in that entry, the
standard is the same as the primary.
Work with match
standards
To use a word as a match standard, it should have its own entry in the dictionary
(or have its own entry in the transaction file).2 In that entry, the word must be a
match standard of itselfin other words, the match standard must be the same as
the query word.
2. Technically, you could also use a word as a match standard if that word does not have an entry in the dictionaryfor
example, you could use Michelangelo as a match standard because Michelangelo is not in the dictionary. In practice,
however, if you use a word as a match standard, youll probably also want that word to have its own entry in the dictionary,
so we make that assumption in our guidelines.
26
For example, you could use the word Dr as a match standard because it is in the
dictionary and has itself, Dr, as a match standard:
Enter a query, or press <ESC> to exit.
Enter> Dr
Usage: 0
Intl Code(s): USENGLISH
Info Code(s): PRENAME_ALONE HONPOST PREGEN3 SUFFIX
Standard(s) for DR:
- DR. HONPOST_MTC, HONPOST_STD, PRENAME_MTC, PRENAME_STD
- DR ADDRESS_STD
Field name
Transaction entry
Action
Primary
Secondary
Usage
Intl
Info
Stdtype
A
Doc
DR.
PRENAME PREGEN3
PRENAME_STD PRENAME_MTC
Chapter 1:
27
28
Chapter 2:
Custom capitalization dictionaries
In a custom capitalization dictionary, you can specify the correct casing for a
word in different situations. For example, you can specify that when MCKAYE is
used as a last name, the casing should be McKaye.
Most users find that our capitalization dictionary, pwcap.dct, produces good
mixed-case results. However, if a word is not cased as you would like, you can
enter that word in a custom capitalization dictionary.
For example, if you want the word TECHTEL to be cased as TechTel, you could
add the word TechTel to your custom dictionary.
Most of our products allow you to use two capitalization dictionaries at once, so
we expect that most users will employ our base dictionary as is and build their
own, separate dictionary as an extension. When you use your dictionary, you can
give it priority over ours by specifying your dictionary as Dictionary #2.
Create transactions,
build your dictionary
For each entry you want to place in your capitalization dictionary, you will create
a record, or transaction, in a database called a transaction file.
After you make all of your entries in the transaction file, you will run the UMD
Build process. UMD Build reads the entries from your transaction file and creates
your custom dictionary.
You can look up words in our dictionary, pwcap.dct, or your custom dictionary.
For example, if you want to see how we capitalize the word PHD, you can query
the dictionary:
c:\umd /s pwcap.dct
Using a Capital Dictionary.
Enter a query, or press <ESC> to exit.
Enter> PHD
PHD is capitalized as follows:
-PhD is used with EVERY occurrence.
For more details about querying a capitalization dictionary, see Query your
dictionary on page 33.
Chapter 2:
29
!
Create a transaction
database
If you are working with an existing custom dictionary, use the existing
transaction file for that dictionary. Do not create more than one transaction
file for each custom dictionary.
The quickest, easiest way to create a transaction database and its supporting files
is to use the output file feature of UMD Show. (See UMD Show on page 77.)
1. Use UMD Show to query our base capitalization dictionary, pwcap.dct.
Include the o option on the command line. Use the base file name that you
plan to use for your custom dictionary, but with the extension .trnfor
example, my_cap.trn.3
2. Query a word that is in the dictionary, such as PhD.
3. Press Enter to save the query to your output file. Press Esc again to exit.
C:\umd /s pwcap.dct /o my_cap.trn /d dBase3
Using a Capital Dictionary
Enter a query, or press <ESC> to exit.
Enter> phd
PHD is capitalized as follows:
-PhD is used with EVERY occurrence
Enter a query, or press
to save, or press <ESC> to exit
Enter>
Previous query appended to C:\my_cap.trn.
Enter a query, or press <ESC> to exit.
Enter> <Esc>
UMD Show will create an output database filefor example, my_cap.trn. You
can use this database as your transaction file. For instructions on adding your
entries to the transaction file, see Step 2: Put your entries in the transaction file
on page 31.
Keep supporting files
with transaction file
When you create a transaction database as described above, UMD Show creates a
supporting file such as my_cap.def. If the transaction file is ASCII or delimitedASCII, UMD Show also creates an additional file such as my_cap.fmt or
my_cap.dmt.
To open and read the transaction file, UMD needs these files. If you move the
transaction file to a new location, make sure you also move the corresponding
supporting files.
3. If you plan to use a database program or spreadsheet program to edit the file, we recommend creating a dBASE3 or
ASCII file. If you plan to use a text editor or word processor to edit the file, we recommend creating a delimited file.
30
The table below describes what information to put in each field in your
transaction file.
Field
Data to enter
Action
Choose one:
N Create a new entry.
D Delete the existing entry from the source dictionary.1
Primary
Type the word in the preferred casing, 54 characters maximum. Type a single word (no spaces). Do not include any punctuation.
Attribute
Specify when this casing should be used. Include all that apply, separated
by one space (no punctuation):
PRENAME Prenames
FIRSTNAME First names
LASTNAME Last names
PRELASTNAME Last-name prefixes
POSTNAME Postnames
TITLE Job titles
FIRM Firm data
ADDRESS Address lines2
CITY City names
STATE State names
EVERY Every occurrence
You may type the entire word or just the portion shown in bold (for example, FIRSTNAME or FIRS).
If the Action field contains D, you may leave this field blank.
1. This command is used rarely, if ever. If you want to delete an entry from your custom dictionary, simply delete that
record from your transaction file, then rebuild the dictionary. If you dont like the casing for a word in our base dictionary,
pwcap.dct, you dont need to delete the entry from our dictionary. Instead, put the desired casing in your custom
dictionary. When you process data, specify your dictionary as Dictionary #2 so that your entry will override ours.
2. Do not specify the ADDRESS, CITY, or STATE attribute unless the product that uses the dictionary has addressparsing capability.
Sample entries
Primary
Attribute
dos
PRELASTNAME
McCathie
EVERY
Chapter 2:
31
To build your dictionary, run UMD Build. The easiest way to convey your
instructions to UMD is through the UMD configuration file.
1. Open a copy of the configuration file umd.cfg.
2. Type your instructions in the UMD Build parameters. For descriptions of the
parameters, see Appendix A.
# UMD Show
Output File Name (path & file name) ....
Output File Type (See NOTE) ............
#
# UMD Build
Dictionary Type (see NOTE) .............
Source Dictionary (path & dct) .........
Transaction File Name (path & file name)
Target Dictionary (path & dct) .........
Verify Input File only (YES/NO) ........
Error Message Log File (path & name) ...
Work Directory (path) ..................
...
=
=
=
=
=
=
=
=
=
Capital
c:\pathname\my_cap.trn
c:\pathname\my_cap.dct
NO
c:\pathname\my_cap.log
r:\temp
3. Save the configuration file. We recommend using the same base file name as
your dictionary, but with the extension .cfgfor example, my_cap.cfg.
4. Run UMD with the cfg option. For example:
umd /cfg my_cap.cfg
During the build process, UMD reads the entries from your transaction file and
creates your custom dictionary.
Tips:
We recommend that you use the same base file name for the
transaction file and custom dictionary, and store both files in the
same location.
We recommend that you accumulate all your custom entries in one
transaction file and build your custom dictionary from the
transaction file only. If you do this, you will not need to specify a
source dictionary when you run UMD Build.
32
To update an existing dictionary, add your new entries to your existing transaction
file. Your custom dictionary will be much easier to manage if you accumulate all
of your entries in one transaction file, rather than scattering them among many
files.
To add a word, create a record for that word in the transaction file. To delete a
word, delete the record for that word from the transaction file. For dBASE3 files,
UMD supports non-destructive delete marking.
After you add your new entries, run UMD Build as instructed on page 32. UMD
will rebuild your custom dictionary based on your updated transaction file.
You may wish to query your custom capitalization dictionary to see whether it
contains a particular word. To query a dictionary, run UMD Show (see UMD
Show on page 77). For example:
umd /s my_cap.dct
If you look up a word that is in the dictionary, UMD displays the preferred casing
and tells you when that casing is used:
c:\umd /s my_cap.dct
Using a Capital Dictionary.
Enter a query, or press <ESC> to exit.
Enter> TECHTEL
TECHTEL is capitalized as follows:
-TechTel is used with FIRM occurrences.
Chapter 2:
33
34
Chapter 3:
User-defined pattern matching (UDPM)
One way you can modify the Data Cleanse transforms behavior to suit your
needs is to edit the pattern file (default name is drludpm.dat). You edit the
pattern file in order to parse with your own data patterns.
Proceed with care. The pattern file controls how incoming data is
parsed. Accordingly, changing this file changes how items are parsed.
Before you define new patterns to parse, proceed with great caution. If
you aren't careful when you add user-defined patterns, you may receive
unexpected and unwanted results. Make backups of your files just in case
you need to revert to previous parsing rules.
Treat this ability as you would treat any feature you add to an application. Before
you integrate any modification into your enterprise, you should put your
modification through a stringent cycle of research, quality assurance, testing,
regression testing, and so on. Don't find out at release time that you've changed
something for the worse. As always, we recommend that you maintain adequate
backups and test your results with the QuickParse utility. For more information
see Check parsing results with QuickParse on page 67.
Chapter 3:
35
Overview of UDPM
With the User-Defined Pattern Matching (UDPM) system, you can parse data that
the Data Cleanse transform currently doesnt parse. For example, records in your
database may contain a customer ID number that is unique to your company. With
UDPM, you can locate and parse this number.
The UDPM utility provides a method for you to define one or several data
patterns that are specific to types of the data you want to parse, such as part
numbers, customer account number, employee numbers, product numbers or any
other specific pattern of data that you have a need to parse.
Parsing patterns
The Data Cleanse transform parses UDPM patterns that are either by themselves
on the input line or surrounded by noise text. For example, input text could be
Here is the part number 123AB. Or just 123AB. If you have defined a pattern
that fits this part number 123AB, then the Data Cleanse transform parses 123AB.
How does Data Cleanse do it? You edit Data Cleanses UDPM Pattern File
(drludpm.dat).
Pattern file
The user-defined patterns are stored in a pattern file. The pattern file is a plain
text file that you can edit in any text-editing program. This pattern file consists of
a definition section and a rule section. The definition section is where you can
define subcomponents using a syntax that uses Perl (PCRE) regular expressions.
You can then combine these subcomponents with other elements of valid regular
expression in the rule section.
Regular expressions are powerful, while offering a flexible, widely used syntax.
Coupled with the Data Cleanse transforms rule language, you can easily create
definitions and rules to parse your own patterns of data. For more information
about creating regular expressions, see Introduction to regular expressions on
page 39.
36
Definition section
This section is used to define
the patterns for the subcomponents of the data that you
want to parse.
Month = 0?[1-9]|1[0-2];
Separator = [/-];
Year = [0-9]{2,2};
!end_def
Rule Section
This section is used to set the
rule for how to parse your
user-defined components,
based on the subcomponent
definitions you create.
ILINE1:UDPM1:Date=({Month}){Separator}({Year});
Ending the file: You must end the file with a hard return. If you dont
insert a carriage return/linefeed at the end of the file, the file wont be
read correctly.
Before you can begin creating user-defined patterns, its important that you
understand what makes up these two sections of the file. See the following:
Definition section of a user-defined pattern on page 38
Rules section of a user-defined pattern on page 38
Chapter 3:
37
Definition section of a
user-defined pattern
In the definition section, you may define the subcomponents that will make up the
data pattern that you want to parse. The definition you create is a combination of
a simple language specific to the Data Cleanse transform and Perl (PCRE) regular
expressions. The diagram and table below explain how the definition section of a
user-defined pattern is set up.
1
Month = 0?[1-9]|1[0-2];
Separator = [/]-];
Year = [0-9]{2,2};
!end_def
Element
Description
Macro name
Regular expression Following the equals sign after the subcomponent name, you
use a simple regular expression to define what the subcomponent will equal.
End definition
The !end_def command indicates the end of the pattern definition. This is a required element.
In the rule section, you must explain the rule or rules for how to parse the
subcomponents that you defined in the definition section. The diagram and table
below explain how the rule section of a user-defined pattern is set up.
ILINE1:UDPM1:Date=({Month}){Separator}({Year});
1
38
Element
Input and
Each rule begins with the input field and output field, separated by
output fields colons.
Rule name
Macros
To add a macro to a rule, you must use the macro name as designated
in the definition section. The macro name must be surrounded by
curly brackets { }. You can use any number of macros per rule.
Description
Example
Description
Precise
[Hh]orse
Fuzzy
[A-Z][[:digit:]]{5}
PCRE (Perl)
Keep in mind that there are several varieties, or families, of regular expressions.
When you refer to additional documentation on functions and capabilities of
regular expressions, be sure youre researching PCRE (Perl) regular expressions.
([0-9]{4}[[:space:]]){4}
Input data:
Chapter 3:
39
Operators
Description
Example
Logical OR.
()
(0?[1-9])|(1[0-2])
{}
Using metacharacters
literally
40
Character Classes
You may also see or use a number of character class names inside brackets. These
class names must be surrounded by colons and brackets, as well, so the
expression will look like:
[[:alnum:]]
or
[[:digit:]]
Name
Description
alnum
digit
Digits.
punct
Punctuation characters.
alpha
Alphabetic characters.
graph
space
blank
lower
upper
cntrl
Control characters.
Non-blank (not control characters and the like, but includes spaces)
xdigit
Chapter 3:
41
Brackets enclose lists, any of whose members can match a single input character.
For example:
This string:
Will match:
[aei]
only an a, e, or i character
[^aei]
Negation. If the first character of the list is ^ then the list only matches what is
not in the character set.
Range. A hyphen specifies a range of characters as defined by their collating
sequence. For example:
[a-d]
[2-4]
[A-Z]
Grouping characters
with parentheses (...)
matches a, or b, or c, or d
matches 2, or 3, or 4
matches any upper-case letter
Will match:
(aei)
(mn|xy)
a string of either mn or xy
42
Write several examples of the data you want to parse. For example:
A
A
A
N
L
B
3
4
3
4
6
4
6
5
2
3
9
6
5
8
9
9
0
9
9
0
Step 2:
Identify any literal elements
Decide if there are any characters (either alphabetical or numeric) that must be
seen without variance, for the input to be valid. If so, put a pair of brackets around
each. For example, you may decide that the first letter must always be A, but
that all the others may vary to some extent.
A
A
A
N
L
B
3
4
3
4
6
4
6
5
2
3
9
6
5
8
9
9
0
9
9
0
[A]
Step 3:
Define the ranges of the
variable elements
For each of the variable elements, decide the range of the optional data. Here,
weve decided that the second letter can be any of four possible letters: B, C, L, or
N. In addition, the last two digits must be either 99 or 00 (representing a year).
A
A
A
N
L
B
3
4
3
4
6
4
[A] [BCLN]
6
5
2
3
9
6
5
8
9
9
0
9
9
0
(99|00)
Chapter 3:
43
Step 4:
Define acceptable variances
in quantity
Define any acceptable differences in the quantity of the elements. In this example,
its acceptable for the input data to have either 4 or 5 digits after the first two
letters, and before the last two digits. Use quantity indicators to show that
acceptable range:
A
A
A
N
L
B
3
4
3
4
6
4
[A] [BCLN]
6
5
2
3
9
6
5
8
9
9
0
9
[0-9] {4,5}
9
0
(99|00)
Step 5:
Expand the pattern for
acceptable deviances
Finally, think about how your data was input, or what was its source, in order to
predict what sort of deviances you might find in the format of the data that would
not precisely fit your pattern so far, but which would reflect a useful datathat is,
not exactly right, but close enough to be usefully parsed.
In this example, you may note that some of your record data starts with small
letters instead of capital letters, and that in some cases, the data was input with a
space between the two letters and the numbers, like these:
AN3463599
AL4659800
Ab 3452399
an 623400
[Aa] [BCLNbcln] [[:space:]]?[0-9] {4,5} (99|00)
Try out your expression with QuickParse or by running the job on a small portion
of your data. If necessary, adjust your expression to suit the results.
Literal Characters
(alpha, numeric, &)
To specify a literal character, just include it in the expression. You can use any of
the extended ASCII characters, like A through Z, a through z, 0 through 9, and
the specialized characters, like @, #, $, and so on.
Note: Some of these characters are Regular Expression MetaCharacters,
and must be treated as described next.
44
Alternate expressions
You can search for more than one variation of a pattern. For example, heres how
you could search for data that fits the two patterns of Wisconsin license plates:
Through 1999, these were 3 letters followed by 3 numbers (ABC123). Then,
starting in 2000, the state switched to a pattern of 3 numbers followed by 3 letters
(123ABC).
In the pattern
matching file:
1. Set up one line to match the early numbers (for example, ABC123).
2. Add a second line to match the late numbers.
3. End the definition section.
4. Make a rule that accepts either expression.
1)
2)
3)
4)
Implications for
output
early=[A-Z]{3}[[:space:]]?[[:digit:]]
late= [[:digit:]] [[:space:]]? [A-Z]{3}
!end_def
IUDPM1:L_UDPM1:wis_plate=({early}|{late})
Chapter 3:
45
Multiple rules
You can also search for more than just one pattern. For example, you also want to
search for data that matches the pattern of a Wisconsin auto title number, which
is: 2 year digits, 3 0-9 digits, 1 A-Z alpha, 4 0-9 digits, a hyphen, and
finally a 0-9 digit. An example of a Wisconsin auto title is 95172L1031-0.
In the pattern
matching file:
1. Keep the expressions for the license plates, and define another for titles
2. Add another rule for the title expression:
1)
2)
46
early=[A-Z]{3}[[:space:]]?[[:digit:]];
late=[[:digit:]][[:space:]]?[A-Z]{3};
title=[0-9]{5}[L][13][0][[:digit:]]{2}[-][0-9];
!end_def
IUDPM1:UDPM1:wis_plate=({early}|{late});
IUDPM1:UDPM1:wis_title=({title});
Firm
Contact
Payment Type_Number
Imaginary Industries
Joe Edwards
Ppo123-456
Fake, Inc.
Mary Peterson
c1234-5678-9123-4567
Before you can set up the definition section of your pattern file, you have to
decide what subcomponents you might look for in the data you want to parse. You
dont need to worry about the Firm field or Contact field, because the application
can already parse those types of data. You need only worry about the Payment
Type/Number field, which can be composed of three main subcomponents:
Payment type code
Purchase order number
Credit card number
Step 2:
Define the
subcomponents
Now that youve identified the subcomponents that you need, you can go about
defining them. The first subcomponent is the payment type abbreviation. This
code is always going to be either C for credit card or P for purchase order.
However, as shown in the example table, it may be entered as either lowercase or
uppercase.
To edit the pattern file, open it in any text editing program. We recommend you
make a backup copy of the drludpm.dat file, in case you want to revert to the
original file later.
Payment type
subcomponent
As with all definitions for user-defined patterns, you need to name it first, and
then use regular expressions to look for the pattern. The definition for this
subcomponent should look like this.
payment_type=[p|P] | [c|C];
With this rule, we are telling the application to look for a P or a C that can be
uppercase or lower case.
Chapter 3:
47
Purchase order
subcomponent
Credit card
subcomponent
The last subcomponent is a credit card number, in the typical format of xxxx-xxxxxxxx-xxxx where x is any digit between 0-9. You also want to check for a hyphen
or space between the groups of digits, though this hyphen or space may not be
there. The definition would look like this:
credit_card=[0-9][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9][ -]?[0-9]
[0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9];
Completed definition
section
After adding the !end_def command to designate the end of the subcomponent
definitions, the file would look something like this:
payment_type=[p|P]|[c|C];
po=[P|p][o|O][1|2][0-9][0-9]-[0-9][0-9][0-9]?;
credit_card=[0-9][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9][ -]?[0-9]
[0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9];
!end_def;
Step 3:
Define the rule(s)
If you closed the file now, the application would do nothing differently. Though
you have explained some patterns, you need to create a rule to actually look for
those subcomponents, as well as to explain how those subcomponents should
appear before they are parsed.
First you indicate the fields the data is coming in on and going to. You separate
the names of these input and output fields with a colon.
Next you name the rule, and explain it in terms of the subcomponents that youve
defined. You want to tell the application to look for a payment_type
subcomponent, immediately followed by either a po subcomponent or credit_card
subcomponent. The rule only applies for the input field specified. The rule would
look like this:
input field
output field
Pattern
ILINE1:UDPM1:account_info=({payment_type})({po}|
{credit_card});
Subpatterns
Remember that the order listed in the rule is very important. The Data Cleanse
transform will only parse a rule if it finds the subcomponents in the exact order
listed in your rule.
Step 4:
Save the modified
pattern file
48
Now that you have completed the sample file, it should look like this:
DRL UDPM Pattern File v1.0
credit_card=[0-9][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9][ -]?[09][0-9][0-9][0-9][ -]?[0-9][0-9][0-9][0-9];
po=[P|p][o|O][1|2][0-9][0-9]-[0-9][0-9][0-9]?;
payment_type=[p|P]|[c|C];
!end_def;
ILINE1:UDPM1:account_info=({payment_type})({po}|{credit_card});
Now that youve created a user-defined pattern, you can configure your
application to apply the rules to incoming records. You define your input fields in
the rule section of the pattern file.
Use the ILINE field when your user-defined data appears on a line mixed in with
other data. If your user-defined data is in its own discrete field, use IUDPM1 ...
IUDPM4 field. Use this field in the rule section of your file.
Step 6:
Retrieve user-defined
data
Finally, you can configure your application to retrieve the parsed user-defined
data that you have defined in your pattern file.
You can request the entire user-defined field (as defined in your rule) with the
UDPM field. Or, you can request individual subcomponents in your rule with the
UDPM_SUB1-5 components.
Chapter 3:
49
50
Chapter 4:
Modify the rule file
One of the ways you can modify the Data Cleanse transforms behavior is by
creating or editing parsing rules in the rule file (drlrules.dat). The rule file
controls parsing name data, firm data, and addresses.
Proceed with care. The rule file controls how incoming data is parsed.
Accordingly, changing this file changes how items are parsed. Before you
change the parsing rules, proceed with great caution. If you aren't careful
when you add or edit parsing rules, you may receive unexpected and
unwanted results. Make backups of your files just in case you need to
revert to previous parsing rules.
Treat this ability as you would treat any feature you add to an application. Before
you integrate any modification into your enterprise, you should put your
modification through a stringent cycle of research, quality assurance, testing,
regression testing, and so on. Don't find out at release time that you've changed
something for the worse. As always, we recommend that you maintain adequate
backups and test your results with the QuickParse utility. For more information
see Check parsing results with QuickParse on page 67.
Chapter 4:
51
The Data Cleanse transform already provides hundreds of rules for many
different possible combinations of data. These rules will likely satisfy the parsing
needs of most users.
However, you may encounter data that isnt being parsed as youd like it to be.
Or, maybe you would like to tweak a rule so that it returns a different confidence
score by adding a confidence booster. In situations like this, it is very handy to be
able to edit the rule file.
General guidelines
When modifying the rule file, you should keep the following points in mind:
Start small
You should create new rules that define very specific situations. Start
conservatively with very narrow parameters. Only after you master narrowly
defined rules should you proceed to create rules that cover broader situations.
Be careful
You should always double-check the syntax and take great care when you apply
operators. Its easy to enter an inappropriate operatorand it may not be as easy
to spot it later.
If the item parsed incorrectly and you want to write a rule, the confidence still
factors in. Because of this, your results might not be exactly as you had
anticipated.
Test results
You should always test your results very thoroughly. See Check parsing results
with QuickParse on page 67 for more information.
52
The files header identifies the rule file. You must not alter or delete the header.
DRL Rule File v1.0;
# DO NOT EDIT, MODIFY OR REMOVE THE ABOVE LINE!!!!!
#
Explanatory information
Lines that start with a pound sign (#) are commented out.
#
#
#
#
#
Group
The file consists of rules
for several types of data.
Groups of rules include:
Name rules
Dual name rules
Firm rules
Address Line rules
Last Line rules
Optional rules, not
enabled by default
Rule
Chapter 4:
53
Rule example
The following is an example of a rule that already exists in the rule file
(drlrules.dat).
The rules within the rule file can be divided into two main sections: the definition
section and the action section. These sections can then be further divided into
smaller components. Though each rule is unique, all rules follow the same
structure as explained in the next sections of this chapter.
When you work with rules, keep in mind the differences between the rule file and
the pattern file. Though the pattern file only has one definition, the rule file has a
separate definition section for each rule defined in the file.
Definition section
Here you name the
rule and define what
type of data pattern
to look for to match
this rule.
Lines that start with a
pound sign (#) are
commented out.
#######################################################
#
# Prename with last name and prename with first, last name
#
Mr Smith and Mrs Mary Jones
#
nfdual34 =
# Prename
PRENAME_ALONE +
# last name
[NAME_STRONG_FN | NAME_WEAK_FN |
LOOKUP_NOT_FOUND | NAME_WEAK_LN |
NAME_STRONG_LN | NAME_AMBIGUOUS |
PREFIRST | NO_VOWEL] & !INITIAL & !ALPHA NUM & !NUMBER+
# connector
CONNECTOR +
# Prename
PRENAME_ALONE +
# first name
[NAME_STRONG_FN | NAME_WEAK_FN |
LOOKUP_NOT_FOUND | NAME_WEAK_LN |
NAME_STRONG_LN | NAME_AMBIGUOUS |
INITIAL | PREFIRST] +
# last name
[NAME_STRONG_FN | NAME_WEAK_FN |
LOOKUP_NOT_FOUND | NAME_WEAK_LN |
NAME_STRONG_LN | NAME_AMBIGUOUS |
PREFIRST | NO_VOWEL] & !INITIAL & !ALPHA NUM & !NUMBER;
Action section
Here you specify the
action that is performed when the rule
is matched.
54
action = PERSON : D;
PERSON = 1 : PRENAME : 1;
PERSON = 1 : LAST_NAME : 2;
PERSON = 2 : PRENAME : 4;
PERSON = 2 : FIRST_NAME : 5;
PERSON = 2 : LAST_NAME : 6;
PERSON = 1 : NAME_CONNECTOR : 3;
end_action
Rule label
#######################################################
#
# Prename with last name and prename with first, last name
#
Mr Smith and Mrs Mary Jones
#
nfdual34 =
# Prename
PRENAME_ALONE +
Rule definition
Rule description
Rule label
Rule definition
# last name
[NAME_STRONG_FN | NAME_WEAK_FN |
LOOKUP_NOT_FOUND | NAME_WEAK_LN |
NAME_STRONG_LN | NAME_AMBIGUOUS |
PREFIRST | NO_VOWEL] & !INITIAL & !ALPHA NUM & !NUMBER+
Component
Description
In this line, you can designate a description for the rule. This is
an optional line, and therefore it must begin with at least one
pound sign (#) so the application treats it as a comment. This
line is helpful to use so you know what the rule is intended to
parse.
Here you can enter an example of the data that you will parse
with the rule. We recommend that you use such a line, because
it is helpful in locating the rule you want to edit.
This is an optional line, and therefore it must begin with at least
one pound sign (#) so so that it is treated as a comment.
Rule label
Rule definition
The rule definition lists the components that make up the parse.
This line and the components that make it up are described in
more detail in the following section.
If you think of the rule definition as an equation, it may help you understand it.
The rule label (before the equal sign) can be equated with the description (after
the equal sign).
#
nfdual34 =
# Prename
PRENAME_ALONE +
# last name
[NAME_STRONG_FN | NAME_WEAK_FN |
LOOKUP_NOT_FOUND | NAME_WEAK_LN |
NAME_STRONG_LN | NAME_AMBIGUOUS |
PREFIRST | NO_VOWEL] & !INITIAL & !ALPHA NUM & !NUMBER+
Chapter 4:
55
Rule label
The rule label must be unique; no two rules can have the same label. The table
below explains how the rule label is created.
Character
Values
First
Second
(only if
first character is n)
additional
Description
Use f in the first character spot to indicate that this is a firm rule.
any
Use any combination of letters, numbers, or underscore characters for any additional characters in the rule label. The only stipulation is that each rule label is unique.
Note: Although the additional characters for a rule label are completely up
to you, you should label the rule so you can understand it and you can
separate it from others.
Rule definition
The rule definition is a combination of token types that the application looks for
when parsing data. This section always follows the equals sign (=).
In this section of the rule file you can use only certain dictionary types. For a
listing of dictionary types, see the Information codes in Appendix C.
In addition to dictionary types, you can use some token types that are not
dictionary components in the rule file.
Order of types
56
Valid type
Description
ALPHA_NUM
PUNCTUATION
CONTAINS_PUNC
NO_VOWEL
LOOKUP_NOT_FOUND
LOOKUP_ANY
The Data Cleanse transform looks for the token types youve listed in the precise
order youve listed them in. Multiple identifiers are connected by a plus sign (+).
Additionally, by adding an asterisk (*) after a token type, you signify that there
can be one or more of these type of tokens in the incoming data.
Within the rule definition, you can use any of the following operators.
Symbol Also known as
Description
Pound sign
Equal sign
Shows the relationship between items, such as equating the first part of a line with the second part.
&
Ampersand
Associates tokens.
[]
Brackets
Pipe (or)
Exclamation mark
(not)
Question mark
Asterisk
Colon
Semicolon
Chapter 4:
57
1
2
3
4
Component
Description
Options line
Action line
In this line, you assign the output type of the parsed item.
In these lines, you assign the output type for each of these
subcomponents.
end_action command
You enter end_action to signify the end of the action section and, in effect, the end of the rule.
The components that make up action lines and action item lines are discussed in
more detail in the following sections.
How to terminate lines
58
Except for the last line (end_action), you must terminate each line of the rules
action section with a semicolon (;) after the last component or indicator.
Options line
The options line lists optional components, such as whether matching should start
at the end or beginning of data. The options line is optional to the rule file.
Components
The options line consists of two partsthe label that tells you what the line is for
(start options command) and the options themselves.
1
2
options = no_multiline : begin : end;
Available options
Component
Description
Start options
command
Option
The options line accepts only three values as options. An example from the
default rule file (State) shows all three of these options:
# State
STATE;
Options
Description
begin
end
no_multiline
Note: When begin and end options are used together in the rule file, the
data to be found must be by itself on a line; it cant be pulled out of the
middle.
Chapter 4:
59
Action line
The diagram and table below explain the components that make up the action
line.
1
2
3
4
action = PERSON : D conf: 40;
Component
Description
Output type
Enter the output type for the parsed component. Valid output types are:
PERSON
FIRM
ADDRESS
LAST_LINE
If your rule is for two or more people (Mr. and Mrs. John
Smith, for example), enter D after the output type.
The dual rule indicator is needed in the action line only if
the rule is a dual rule. If you need to enter a dual name
indicator, follow the output type with a colon (:).
Confidence score
Each subcomponent used in the rule definition usually has a corresponding action
item line. The diagram and table below explain the components that make up the
action item lines.
1
2
3
4
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
60
=
=
=
=
=
=
1
1
2
2
2
1
:
:
:
:
:
:
PRENAME
LAST_NAME
PRENAME
FIRST_NAME
LAST_NAME
NAME_CONNECTOR
:
:
:
:
:
:
1;
2;
4;
5;
6;
3;
# Component
Description
1 Output type
For this component, enter the output type used in the action line,
followed by the equal sign (=).
2 Item index
number
3 Output type
subcomponent
Enter the subcomponent of the output type for the information code
that this line corresponds to. For example, if this action item line
corresponds to the PRENAME information code in the rule definition, then PRENAME would be the subcomponent you use here.
For a list of valid subcomponents for each output type, see Valid
output type subcomponents on page 61.
Follow the output type subcomponent with a colon (:).
4 Information
code index
number
Use an index of L here to include the whole line (data in the firm
line that may not match the rule definition).
Follow the information code index number with a semicolon (;) to
terminate the action item line.
The following table lists the valid subcomponents for each output type.
Output type
Subcomponents
PERSON
PRENAME
FIRST_NAME
MID_NAME
LAST_NAME
OTH_POST
MAT_POST
TITLE
NAME_DESIG
PRELAST
NAME_SPEC
NAME_CONNECTOR
FIRM
FIRM
FIRM_LOC
ADDRESS
ADDRESS
LAST_LINE
LAST_LINE
Chapter 4:
61
Your incoming data often contains names with three last names in the Name field.
For example, one record contains Juan Carlos Fernandez Torres Perez. You want
to create a rule to parse this as one name with discrete components of first name,
middle name, last name, last name, last name.
Note: The Data Cleanse transform already has rules for scenarios with two
last names, but not for a third so we will modify an existing rule to account
for this extra name component.
You already have an example of the data you want to parse: Juan Carlos
Fernandez Torres Perez. Before adding a rule, you should see how this example
currently parses. Using QuickParse you can find that it parses most of the entry
correctly, but it sends the final last name to the extra field. The rule it matches is
nfname15.
Now that you know what you want to parse and how it currently parses, you can
begin to edit the rule file for your scenario. Open drlrules.dat in any text editing
program. It is best to create a backup of this rule in order to reverse any changes,
if necessary.
The definition section includes a number of optional lines, and one required
linethe rule definition line. We recommend including comment lines before
your rule definition as an explanation of what the rule will parse. To add comment
lines to this rule, you could enter the following:
####################################################
#
#First name first rule for names with a first name,
#middle name, and 3 last names
#
#Examples: Juan Carlos Fernandez Torres Perez
Next, you must create the rule label. Name each rule label with a descriptive
name. For this example, we will build on the rule we are using as a base and name
it:
nfname15_extralast =
Because this is a name rule, you use n in the first character position. Again
because it is a name rule, you must include a second specific character in your
label to indicate the name order. Juan, the first name, is listed first, therefore you
use an f as the second character. From there the rest of the name is up to you.
Now you need to list the token types that make up the subcomponents of the main
component you hope to parse. In Step 1, you found that adding an extra last name
subcomponent would fix your problem, so you add that here, joining it to the
previous subcomponents with the plus (+) sign:
62
# last name
[ LOOKUP_NOT_FOUND |
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
PREFIRST] & !INITIAL & !ALPHA_NUM & !NUMBER & !CONNECTOR & !PUNCTUATION +
# maturity post (Jr.)
MATURPOST? +
# honorary post (phd)
HONPOST*? +
# occupational title
TITLE_ALONE*?;
Notice that some subcomponents in this rule are optional. The name designator,
prename, maturity post name, honorary post name, and occupational title are all
followed by the ? operator indicating that they may or may not be present in the
input.
Step 3: Add the
options line
We only want to apply this rule on a nameline, so we must add the following
options line:
Options = no_multiline;
Chapter 4:
63
Every action line needs an action = (which equals to begin) followed by the
output type. Follow up with any optional components for this line, which are
separated by colons. The line ends with a semi colon.
Because the rule we have based this rule on is also a one person name rule, the
action line remains the same:
action = PERSON;
To finish the action section, you need to add one action item line for the extra last
name subcomponent. Also, you need to update the index number for the
subcomponents below this inserted line:
PERSON = 1
PERSON = 1
PERSON = 1
PERSON = 1
end_action
:
:
:
:
LAST_NAME : 7;
MAT_POST : 8;
OTH_POST : 9;
TITLE : 10;
The action line begins with PERSON = because this is a name rule. There is only
one person parsed in this rule, therefore all subcomponents have a 1 as the item
index number.
You must tell specify the output type that subcomponents in this line refers to. For
a list of valid subcomponents, see Valid output type subcomponents on
page 61. Finally, remember that the token types in your rule definition correspond
directly to the example of data you want to parse. You must enter the index
number of the token type that the line applies to.
Here is the final action item lines you should have:
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
PERSON
64
=
=
=
=
=
=
=
=
=
=
1
1
1
1
1
1
1
1
1
1
:
:
:
:
:
:
:
:
:
:
NAME_DESIG : 1;
PRENAME : 2;
FIRST_NAME : 3;
MID_NAME : 4;
LAST_NAME : 5;
LAST_NAME : 6;
LAST_NAME : 7;
MAT_POST : 8;
OTH_POST : 9;
TITLE : 10;
To finish your rule, enter end_action after the action item lines. To enable your
changes, save drlrules.dat. See the final rule on the next page.
####################################################
#First name first rule for names with a first name, middle name and 3 last names
#Examples: Juan Carlos Fernandez Torres Perez
nfname15_extralast =
# name designator (ATTN:)
NAMEDESIG? +
# prename (mr.)
PRENAME_ALONE? +
# first name
[NAME_STRONG_FN |
NAME_WEAK_FN ] +
# middle name
[INITIAL | NAME_STRONG_FN | NAME_WEAK_FN] +
# last name
[ LOOKUP_NOT_FOUND |
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
PREFIRST] & !INITIAL & !ALPHA_NUM & !NUMBER & !CONNECTOR & !PUNCTUATION +
# last name
[ LOOKUP_NOT_FOUND |
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
PREFIRST] & !INITIAL & !ALPHA_NUM & !NUMBER & !CONNECTOR & !PUNCTUATION +
# last name
[ LOOKUP_NOT_FOUND |
NAME_WEAK_LN |
NAME_STRONG_LN |
NAME_AMBIGUOUS |
NO_VOWEL |
PREFIRST] & !INITIAL & !ALPHA_NUM & !NUMBER & !CONNECTOR & !PUNCTUATION +
# maturity post (Jr.)
MATURPOST? +
# honorary post (phd)
HONPOST*? +
# occupational title
TITLE_ALONE*?;
action = PERSON;
PERSON = 1 : NAME_DESIG : 1;
PERSON = 1 : PRENAME : 2;
PERSON = 1 : FIRST_NAME : 3;
PERSON = 1 : MID_NAME : 4;
PERSON = 1 : LAST_NAME : 5;
PERSON = 1 : LAST_NAME : 6;
PERSON = 1 : LAST_NAME : 7;
PERSON = 1 : MAT_POST : 8;
PERSON = 1 : OTH_POST : 9;
PERSON = 1 : TITLE : 10;
end_action
Chapter 4:
65
66
Chapter 5:
Check parsing results with QuickParse
About QuickParse
QuickParse is a tool to help you check parsing results. QuickParse lets you
quickly see how data that you input would parse if input through your Data
Cleanse transform. With QuickParse you can manually type in questionable
records or use an input file and shuffle through the entries.
When you make any modifications to the parsing setup or to the user-modifiable
dictionary, you should use QuickParse to make sure that your changes produce
the results you want.
QuickParses main
window
When you start QuickParse, the program initially opens the window below. This
window is where you find out how the Data Cleanse transform would parse
records.
Components for
the selected
parsed item are
shown here.
Setup needed. This window doesnt display any data until you set up
QuickParse. For information on setting up QuickParse, see Get started
with QuickParse on page 68. For more information on the above window,
see Run QuickParse on page 70.
Chapter 5:
67
1. From the QuickParse windows menu bar, select Setup > QuickParse. The
Setup window opens.
Here you specify the type of
data to be used as input for
QuickParsewhether its a
database or data that you
will input manually.
2. Specify the configuration file you want to use by typing in the path or by
browsing for it.
3. As necessary, specify the type of casing, text type, input setup, and greeting
you want to use.
4. Click OK.
Specify manual and/
or database input
68
With QuickParse you can specify the type of data to be used as input by selecting
either Manual Input or Database Input at the Setup window.
Type of input
Description
Manual
Database
If you specify database input at QuickParses Setup window, the Database Setup
window (below) opens. When you specify a valid input file (ASCII, Delimited,
dBase3 ), the Database fields box displays a list of the fields in the file.
1. Click on the fields in each box (Database fields and DataRight fields), and
click Map. The Mapped fields box then shows what fields the input is coming
in on.
Because fields can only be mapped once, the fields in the Database fields box
and DataRight fields box are removed when theyre in use.
2. To delete or change a mapping, click Remove. This will put the appropriate
fields back into the lists to make them available to be mapped again.
3. When you have the fields set up the way you want, click OK. QuickParse
takes you back to the main window where you can begin going through the
records.
Chapter 5:
69
Run QuickParse
After you set up QuickParse (see Get started with QuickParse on page 68), the
applications main window opens.
QuickParses main
window
QuickParses main window lets you view all the parsed information about any
record in an input file.
Shows items the way
they were parsed.
Shows all the components of
the item selected above.
Configuration file.
The name of the configuration file youre using is noted
on the top title bar.
Input lines
activated in
.cfg file.
Input data.
Arrow buttons
for navigation.
Parsed item line.
What type the item
parsed as.
Confidence score.
Type of line the input
came in on.
Rule this item hit if addr,
name or firm.
How fields are mapped.
Data file.
If youre using input from a file,
the file name and number of
records along with which record
displayed is listed on the bottom.
Navigate your
input file
The arrow buttons (in the center on the left, beneath the input) let you navigate
through your input file. You can move forward to the next or ending record, and
backward to the previous or starting record. You can also go directly to a specific
record by typing its record number in the entry box and then pressing the Enter
key.
Enter records in
database mode
If youre using an input file, you still have the ability to type in entries or make
modifications to an entry to see what that change would do. No changes will ever
be made to the original data file. Simply type in the change or addition and click
Parse Current. You can then continue with the file as you were before.
70
If you have a record that is not parsing correctly and want to take note of it, you
can save it to a log file. You simply have to set up the log file and then save
entries when you come across them.
Log files can also be extremely helpful to Firstlogic Customer Support when you
call in with questions.
2. Type or browse for the directory you want and type the file name.
3. Specify the type of log file you want created.
4. Click OK to complete the setup.
Each time you start a new session of QuickParse you have to specify a new log
file. You cant append to an existing log file.
To add to a log file
You can add an entry to your log file from QuickParses main window.
1. When the entry you want recorded is active, click Log. QuickParse opens the
Log Item Information window.
2. Fill out the Correct item type entry, indicating how you wanted the item to
parse. (QuickParse automatically fills out the item text, input, database name,
current item type and input line fields. If the entry came from an input file,
the log file will also include the input records number.)
3. In the Notes box, enter any additional comments pertaining to this entry.
Note: If you press the Enter key while entering comments in the Notes
box, youll insert a return character, which shows up in your log file as
another line. You may want to have only one line per log file entry.
4. When the log contains the information you want, click OK.
Chapter 5:
71
Change your
configuration file
If you want to change the options set in the configuration file youre using,
choose Edit > Config file. QuickParse opens a text box with the configuration
file you specified at setup.
Yes
No
AssignPrenames = YES
#AssignPrenames = NO
#AssignPrenames = YES
AssignPrenames = NO
Instead of having to go through all the steps of setup every time you open
QuickParse, you have the option to save the session. Saving a session means
you are saving in your registry the name of the configuration file and other
options on the setup screen to be easily accessed.
If you are using database input, the name and the field mappings are also retained.
Remember, you cant append to a log file, so if youre using a log file youll have
to specify a new one each time you open QuickParse.
72
To save a session:
1. At QuickParses main window, choose File > Save Session. The Save
Session window opens.
To remove a session:
Chapter 5:
73
74
Appendix A:
UMD configuration file, umd.cfg
umd.cfg
Rather than type a long command line, you can use the UMD configuration file.
Make a copy of umd.cfg and save it under a different file name. Then edit and use
your copy.
# UMD Show
Output File Name (path & file name) ....
Output File Type (See NOTE) ............
#
# UMD Build
Dictionary Type (see NOTE) .............
Source Dictionary (path & dct) .........
Transaction File Name (path & file name)
Target Dictionary (path & dct) .........
Verify Input File only (YES/NO) ........
Error Message Log File (path & name) ...
Work Directory (path) ..................
#
# Dictionary Types:
#
parsing
#
generic
#
capital
#
# Output File Types:
#
delimited
#
ascii
#
dbase3
=
=
=
=
=
=
=
=
=
In the configuration file, do not edit anything to the left of the equal signs. To
insert comments, prefix them with a pound sign (#). Complete either the UMD
Show section or the UMD Build section, not both. (Exception: For UMD Show,
specify the dictionary at the Source Dictionary parameter.)
Command line
When you run UMD, include the configuration file as a parameter on the UMD
command line:
Platform
Command line
UNIX
Windows
These parameters are for UMD Show mode only. If you want to record query
results in an output file, enter the path and file name. Specify a file type of ASCII,
dBASE3, or Delimited.4
Dictionary Type
Enter the type of dictionary you want to modify. Possible dictionary types are
Parsing, Capital, and Generic (search-and-replace).
4. If you plan to use the output file as a transaction file, we recommend the following: If you plan to use a database
program or spreadsheet program to edit the file, create a dBASE3 or ASCII file. If you plan to use a text editor or word
processor to edit the file, create a delimited file.
Appendix A:
75
Source Dictionary
If you are building a custom dictionary or table, enter the path and file name of
the source dictionary:
If you are creating a parsing dictionary, specify our parsing dictionary
parsing.dct as the source dictionary.
If you are creating a capitalization dictionary or search-and-replace table, you
should usually leave this blank.
If you are querying an existing dictionary (UMD Show), specify the path and file
name of the dictionary you want to query.
Type the path and file name of the transaction file containing your entries.
Target Dictionary
Type the path and file name of the custom dictionary you want to create. If the file
already exists, UMD overwrites the existing file.5 If you do not specify a target,
UMD uses the source dictionary as the target.
Do not overwrite any of our base dictionaries. Instead, give your custom
dictionary a separate name. Each time you install a software update, we
overwrite our base dictionaries. If you use our file names for your
dictionaries, your custom dictionaries may be overwritten.
If you set this option to Yes, UMD checks all the entries in the transaction file but
does not actually produce the target dictionary. This is handy if you want to verify
during the day and run the build process during the night.
If you set this option to No, UMD checks the entries in the transaction file. If no
verification errors occur, UMD builds the target dictionary.
We recommend that you specify an error log file. UMD will write any error or
warning messages to the log file so you can review them later.
If you leave this parameter blank, UMD sends error and warning messages to the
screen (standard output). If any messages scroll off the screen, you will not be
able to retrieve them.
Work Directory
By default, UMD places its temporary work files in the current directory. If you
would like to use some other location, specify a path.
To estimate the space required for work files, use this formula:
Work space = 4 x (size of transaction file + size of source dictionary)
5. Before overwriting an existing dictionary, UMD makes a backup copy of the existing file. For example, if the
dictionary is named custom.dct, UMD creates a backup file named custom.001. The next time, UMD creates a backup
named custom.002, and so on up to custom.999.
76
Appendix B:
UMD command line
You can query an existing dictionary or table by using the UMD Show command
line.
Platform
Command line
UNIX
Windows
Parameter
Description
s dct_file.dct
o out_file
Path and file name of the output file. If you save a query, UMD
writes it to this file. If the file already exists, UMD appends to the
end of the file.
Note: You can edit the output file and use it as a transaction file
d db_type
Database type for the output file. Choose one: dBASE3, ASCII
(default), or Delimited.1
1. If you plan to use the output file as a transaction file, we recommend the following: If you plan to use a database
program or spreadsheet program to edit the file, create a dBASE3 or ASCII file. If you plan to use a text editor or word
processor to edit the file, create a delimited file.
UMD Build
If you prefer not to use the configuration file, you can place all the UMD Build
parameters on the command line.
Platform
Command line
UNIX
umd dct_type -i trans [-s source] [-t target] [-e err_log] [-p work] [-v]
Windows
umd dct_type /i trans [/s source] [/t target] [/e err_log] [/p work] [/v]
Parameter
Description
dct_type
i trans
Path and file name of the transaction file containing your custom
entries.
s source
Path and file name of the source dictionary to use as a base for your custom dictionary.
Appendix B:
77
Parameter
Description
t target
Path and file name of the custom dictionary to create. If the file already
exists, UMD will overwrite it.1 If you do not specify a target, UMD
uses the source dictionary as the target.
e err_log
Log file for validation warnings and errors. We recommend that you
include this parameter.
p work
Path and directory to use for temporary storage of work files. To estimate space requirements, use this formula:
Work space = 4 x (size of transaction file + size of source dictionary)
Verify only. If you include this option, UMD checks all the entries in
the transaction file but does not actually produce the target dictionary.
1. Before overwriting an existing dictionary, UMD makes a backup copy of the existing file. For example, if the
dictionary is named custom.dct, UMD creates a backup file named custom.001. The next time, UMD creates a backup
named custom.002, and so on up to custom.999.
UMD Config
Rather than type the UMD Show or UMD Build command line, you can specify
file names and options in the UMD configuration file (see UMD configuration
file, umd.cfg on page 75).
To run UMD with the configuration file, use the following command:
78
Platform
Command line
UNIX
Windows
Appendix C:
Information codes and standard-type codes
Information codes
Standard-type codes
Information code
Description
DIRECTIONAL
Refers to the part of the address that gives directional information for delivery, such as N, S, N.E
FIRMDESIG
FIRMINIT
FIRMLOC
FIRMMISC
FIRMNAME
This code is used for firm names that may be parsed incorrectly. For example, Hewlett Packard could be incorrectly
parsed as a personal name, so Hewlett, Packard, and Hewlett
Packard are all listed as Firm Name words.
FIRMNAME_ALONE
FIRMTERM
HONPOST
INITIAL
MATURPOST
NAMEDESIG
Appendix C:
79
80
Information code
Description
NAMEGEN1-5
NAMESPEC
A word that may appear in a name line, such as Family, Resident, Occupant.
NUMBER
PHRASE_WRD
POST_OFFICE
PREGEN1-5
PREFIRST
A first-name prefix.
PRELAST
PRENAME
PRENAME_ALONE
PRIVATE_ADDR
REGION
TITLE
TITLE_INIT
TITLE_TERM
TITLE_ALONE
SEC_ADDR
STATE
SUFFIX
Information code
Description
MIL_ADDR
MIL_LAST
MIL_STATE
NAME_STRONG_FN
NAME_WEAK_FN
NAME_AMBIGUOUS
NAME_WEAK_LN
NAME_STRONG_LN
RR_HC_ADDR
CONNECTOR
ZIP
ZIP4
A ZIP Code
A ZIP+4 Code
Description
ADDRESS_STD
ALL_TEXT_TYPES
FIRM_ACR
FIRM_MTC
81
82
Standard-type code
Description
FIRM_STD
FIRMLOC_ACR
FIRMLOC_MTC
FIRMLOC_STD
HONPOST_ACR
If the Primary is parsed as a honorary postname, use this Secondary as the acronym.
HONPOST_MTC
If the Primary is parsed as a honorary postname, use this Secondary as the match standard.
HONPOST_STD
If the Primary is parsed as a honorary postname, use this Secondary as the standardized form.
LAST_LINE_STD
MATURPOST_MTC
If the Primary is parsed as a maturity postname, use this Secondary as the match standard.
MATURPOST_STD
If the Primary is parsed as a maturity postname, use this Secondary as the standardized form.
NAME_MTC
NAMEDESIG_ACR
If the Primary is parsed as a name designator, use this Secondary as the acronym.
NAMEDESIG_MTC
If the Primary is parsed as a name designator, use this Secondary as the match standard.
NAMEDESIG_STD
If the Primary is parsed as a name designator, use this Secondary as the standardized form.
NAMESPEC_ACR
NAMESPEC_MTC
NAMESPEC_STD
PRELAST_MTC
If the Primary is parsed as a last-name prefix, use this Secondary as the match standard.
PRELAST_STD
If the Primary is parsed as a last-name prefix, use this Secondary as the standardized form.
PRENAME_ACR
PRENAME_MTC
Standard-type code
Description
PRENAME_STD
TITLE_ACR
TITLE_MTC
TITLE_STD
Appendix C:
83
84
Index
acronyms
how parser generates, 25
adding
firm that looks like a personal name, 22
multiple-word firm name, 20
new word, 17
title phrase, 18
defining, 39
definition section, 37
dictionary
creating custom for parsing, 7
dictionary type
parsing, 7, 15
drludpm.dat, 5, 35
e-mail address
Firstlogic, 2
C
capitalization
in parsing dictionary, 17
capitalization dictionary
build customized, 32
create transaction file, 30
create your own, 30
definition, 29
querying, 29, 33
update customized, 33
capitalization transaction entries, 31
capitalization transaction file
creating, 30
codes
information, 79
standard-type, 79
command line
build UMD, 77
config UMD, 78
query UMD, 77
comments, 38
configuration file
change QuickParses, 72
configure UMD (command line), 78
contact information
Firstlogic, 2
copyright statement, 2
creating a parsing transaction file, 12
custom capitalization dictionary
building, 32
updating, 33
custom dictionary
creating for parsing, 7
maintaining, 16
updating, 16
custom parsing dictionary
building, 15
customized parsing dictionary, 15
customizing DataRight, 5
F
firm
adding when it looks like a personal name, 22
adding when multiple word, 20
firm name that looks like a personal name
querying, 11
Firstlogic contact information, 2
G
generating acronyms via parser, 25
I
Information codes, 79
information codes
modifying, 23
input database
set up QuickParses, 69
L
legal notices, 2
log file
maintain in QuickParse, 71
M
macro, 38
macro name, 38
match standards
how standards work, 26
rules, 26
spelling and punctuation, 27
working with, 26
modifying
information codes, 23
standards and standard-types, 24
multiple-word firm name
adding, 20
querying, 10
Index
85
N
new word
adding, 17
O
operators
using in regular expressions, 40
P
parser
generating acronyms, 25
parsing dictionary
building custom, 15
creating custom, 7
definition, 7
parsing transaction file
creating, 12
pattern file, 36
personal name
really a firm name, 22
phone number
Firstlogic, 2
punctuation
in match standards, 27
Q
query UMD (command line), 77
QuickParse, 67
run, 70
set up, 68
set up input database, 69
R
regular expression, 38
rule file, 5
rule name, 38
rule section, 37
rules
defining, 48
S
session of QuickParse, 72
open, 73
remove, 73
save, 73
spelling
in match standards, 27
standards
modifying, 24
86
standard-type codes, 79
standard-types
modifying, 24
subcomponents, 39, 47
defining, 47
T
title phrase
adding, 18
querying, 9
trademarks, 2
transaction database
creating, 12, 30
transaction entries
parsing, 13
transaction file
creating for parsing, 12
placing entries in, 13
putting entries in, 31
transaction file for capitalization
creating, 30
transaction files
supporting files, 12
U
UMD, 5
UMD Build (command line), 32
UMD Build command line, 77
UMD Config command line, 78
UMD Show command line, 77
user-defined data
retrieve from DataRight, 49
submitting to DataRight, 49
user-defined pattern
definition section, 38
rules section, 38
user-defined pattern example, 37
user-defined pattern file
saving after modifying, 48
V
verification
when building custom dictionary, 15
W
web site
Firstlogic, 2
word
querying, 9