SSIS Succinctly
SSIS Succinctly
SSIS Succinctly
By
Rui Machado
This book is available for free download from www.syncfusion.com on completion of a registration form.
If you obtained this book from any other source, please register and download a free copy from
www.syncfusion.com.
This book is licensed for reading only if obtained from www.syncfusion.com.
This book is licensed strictly for personal or educational use.
Redistribution in any form is prohibited.
The authors and copyright holders provide absolutely no warranty for any information provided.
The authors and copyright holders shall not be liable for any claim, damages, or any other liability arising
from, out of, or in connection with the information in this book.
Please do not use this book if the listed terms are unacceptable.
Use shall constitute acceptance of the terms listed.
SYNCFUSION, SUCCINCTLY, DELIVER INNOVATION WITH EASE, ESSENTIAL, and .NET ESSENTIALS are the
registered trademarks of Syncfusion, Inc.
Table of Contents
The Story behind the Succinctly Series of Books .................................................................................. 8
About the Author ....................................................................................................................................... 10
Who is This Book For? ............................................................................................................................. 11
Introduction ............................................................................................................................................... 12
Chapter 1 Integration Services Architecture ......................................................................................... 13
Introduction ............................................................................................................................................ 13
Explaining the Components ................................................................................................................... 14
Runtime Engine ................................................................................................................................... 14
Integration Services Service ................................................................................................................ 14
SSIS Designer ..................................................................................................................................... 14
Log Provider ......................................................................................................................................... 14
Connection Manager ........................................................................................................................... 15
SSIS Wizard ......................................................................................................................................... 15
Packages ............................................................................................................................................. 15
Tasks ................................................................................................................................................... 15
Event Handlers .................................................................................................................................... 15
Containers ............................................................................................................................................ 15
Control Flow ......................................................................................................................................... 16
Data Flow Engine................................................................................................................................. 16
Project and Package Deployment Models ........................................................................................... 16
Developer Environment ......................................................................................................................... 17
Introduction .......................................................................................................................................... 17
SSIS Designer ..................................................................................................................................... 18
Chapter 2 Packages ................................................................................................................................. 20
Introduction ............................................................................................................................................ 20
Hello World PackageImport and Export Wizard ................................................................................. 20
Custom Packages .................................................................................................................................. 26
Introduction .......................................................................................................................................... 26
Adding a New Package to a Project .................................................................................................... 26
Executing the Packages ...................................................................................................................... 27
Package Explorer ................................................................................................................................... 29
Package Properties ................................................................................................................................ 30
Checkpoints ......................................................................................................................................... 31
Execution ............................................................................................................................................. 32
Forced Execution Value ....................................................................................................................... 32
Identification ......................................................................................................................................... 32
Misc ...................................................................................................................................................... 32
Security ................................................................................................................................................ 33
Transactions ........................................................................................................................................ 33
Version ................................................................................................................................................. 33
Chapter 3 Control Flow ............................................................................................................................ 34
Introduction ............................................................................................................................................ 34
Tasks and Containers ............................................................................................................................ 35
Introduction .......................................................................................................................................... 35
Favorite Tasks ..................................................................................................................................... 36
Common .............................................................................................................................................. 37
Containers ............................................................................................................................................ 39
Other Tasks ......................................................................................................................................... 40
Precedence Constraints......................................................................................................................... 43
Introduction .......................................................................................................................................... 43
Advanced Precedence Constraints ..................................................................................................... 45
Introduction .......................................................................................................................................... 74
Creating the Data Flow ........................................................................................................................ 75
Chapter 5 Variables, Expressions, and Parameters ............................................................................. 80
Introduction ............................................................................................................................................ 80
Understanding the components ............................................................................................................. 80
Creating New Variables ......................................................................................................................... 81
Creating New Expressions..................................................................................................................... 81
Creating New Parameters...................................................................................................................... 82
Handling Slowly Changing Dimensions ................................................................................................. 83
Introduction .......................................................................................................................................... 83
Demo: Using the Slowly Changing Dimension Component ................................................................ 83
Change Data Capture ............................................................................................................................ 92
Introduction .......................................................................................................................................... 92
Demo: Using CDC in an ETL Project ..................................................................................................... 94
Enable CDC on the Database ............................................................................................................. 95
Enable the SQL Agent in the Server of the CDC ................................................................................. 95
Enable CDC in the Tables We Want to Monitor for Changes ............................................................. 97
Process the Changed Data in SSIS and Use the SSIS CDC Component to Make It Easier ............ 101
Process Initial Load............................................................................................................................ 101
Create the CDC Data Flow ................................................................................................................ 104
Chapter 6 Deploying the Packages ...................................................................................................... 109
Introduction .......................................................................................................................................... 109
Project Deployment Model ................................................................................................................... 110
Package Deployment Model ................................................................................................................ 114
Whenever platforms or tools are shipping out of Microsoft, which seems to be about
every other week these days, we have to educate ourselves, quickly.
Free forever
Syncfusion will be working to produce books on several topics. The books will always be free.
Any updates we publish will also be free.
10
Code Examples
All of the examples in this book were created in Visual Studio. The code examples are formatted
in blocks as follows:
public int setPersonName(){
return 0;
}
Notes
Throughout this book, there are notes that highlight particularly interesting information about a
language feature, including potential pitfalls you will want to avoid.
11
Introduction
Extract, transform, and load (ETL) is more than just a trendy or fancy concept; its mandatory
knowledge for any developer. Companies need to integrate data more often than you think,
either because of new content available for a product catalog, new data trade agreements
between parties, technological migrations, or to create data warehouses, which is one of the
most popular usage contexts. Because of this, the information systems market demands
professionals with ETL skillseither using Microsoft products, Oracle tools, or other vendors
tools. So its important that you understand how quickly data science is evolving as well as how
important integration has become; its now a stage in the process of delivering information to
companies management layer.
SQL Server Integration Services (SSIS) is Microsofts business intelligence suite and ETL tool.
If you are not considering it for your data integration projects, you are losing one of the most
efficient, developer-friendly, and, most importantly, free tools (if you already have a SQL Server
paid license, that is). Its not only a tool that allows you to move data from one database to
another. It also allows you to clean and transform data in a drag-and-drop development
environment that is organized in data flows. By using it, you can maintain your project in an
easy and friendly environment.
The main goal for this product is to create integrated, clean, and consistent data to be used by
data warehouses (and, optionally, by OLAP-based systems such as SQL Server Analysis
Services) to create analysis cubes. However, developers have found it to be more than just an
ETL tool; it has been used to consolidate (when you need to bring data from several similar
databases into one), integrate (when you need to clean and move data), and move data (when
your goal is just moving data without any transformation).
In this book, you will learn how to use SSIS to develop data integration, migration, and
consolidation projects so that you, as a developer, can offer your company the ability to
populate data warehouses with an analytical purpose. To help you create and maintain the best
ETL projects, in this book I will guide you through the development and understanding of SSIS
packages, including the output of SSIS developments, their control flow, their execution flow (for
example, what will be executed and in what order), tasks, precedence constraints, and other
components that this platform provides.
12
13
Runtime Engine
The Runtime Engine is responsible for executing the packages that will work with the data
according to our needs. As you can see in the previous figure, its the heart of SSIS as it not
only executes the packages, but also saves their layout and provides support for logging,
breakpoints, configuration, connections, and transactions. This engine is activated every time
we invoke a package execution, which can be done from the command line, PowerShell scripts,
the SSIS Designer, the SQL Agent tool, or even from a custom application.
SSIS Designer
The SSIS Designer is where you will be spending most of your time when developing your
packages. SSIS Designer is a graphical tool that you can use to create and maintain Integration
Services packages. SSIS Designer is available in SQL Server Data Tools (SSDT) as part of an
Integration Services project. You can use it to do the following tasks:
Log Provider
Managed by the SSIS engine, the log providers responsibility is to create all the logs needed to
monitor the package execution. Every time you run a package, the log provider will write logs
with information about the package execution including duration, errors, warnings, and other
related information.
14
Connection Manager
Managed by the SSIS engine, the connection managers responsibility is to manage the
connections defined either at project level or package level. This is available as a tool inside the
SSIS Designer so that you can create the connections you need to extract and load data. It can
manage connections to FTP servers, databases, files, and Web services. The SSIS connection
manager allows you to easily connect to several types of systems, both Microsoft and nonMicrosoft platforms.
SSIS Wizard
The SSIS wizard is the simplest method by which to create an Integration Services package that
copies data from a source to a destination. When using this wizard, you can only use one type
of task.
Packages
The packages are the heart of the engine; in other words, they are the main component of the
SSIS engine and they are also subsystems as they are composed of several other artifacts. So,
these packages will make the connections to the source and destination targets using the
connection managers. They contain the tasks we need to run in order to complete the entire
sequence of ETL steps. As you can see in the architecture diagram (Figure 1), these package
tasks can be executed singularly, in sequence, grouped in containers, or performed in parallel.
Packages are the output of our Integration Services project and have the .dtsx extension.
Tasks
Tasks are control flow elements that define units of work that are performed in a package
control flow. An SSIS package is made up of one or more tasks. If the package contains more
than one task, they are connected and sequenced in the control flow by precedence constraints.
SSIS provides several tasks out of the box that correspond to the majority of problems;
however, it also allows you to develop and use your own custom tasks.
Event Handlers
Managed by the SSIS Runtime Engine, event handlers responsibility is to execute tasks only
when a particular event is firedfor example, if an OnError event is raised when an error
occurs. You can create custom event handlers for these events to extend package functionality
and make packages easier to manage at run time.
Containers
Containers are objects in SSIS that provide structure to packages and services to tasks. They
support repeating control flows in packages, and they group tasks and containers into
meaningful units of work. Containers can include other containers in addition to tasks. There are
four types of containers:
15
Control Flow
The last four tasks described define the executables in the SSIS Runtime Engine. Packages,
containers, tasks, and event handlers are what it will execute at run time. If you only look at the
last three (containers, tasks, and event handlers), they are also called control flow tasks as they
define the control flow of a package. In this way, packages will have only one control flow and
many data flows as we will see later.
The control flow is the brain of a package; it orchestrates the order of execution for all its
components. The components are controlled inside it by precedence constraints.
16
Developer Environment
Introduction
The Integration Services developer environment is created on top of Visual Studio. In SQL
Server 2012, the environment comes under the name SQL Server Data Tools - Business
Intelligence for Visual Studio 2012. Developing packages will feel familiar to .NET developers
who use Visual Studio in their daily work; however, the look and feel of an Integration Services
package will be very different as it is a much more visual development environment.
To create an Integration Services project, open the SQL Server Data Tools. After you open it,
select File > New > Project and then, when the new project window appears, select Business
Intelligence and under it, Integration Services. When you select it, the following screen will be
displayed.
As their names suggest, the difference between them is that the first will create an empty project
in which you can start building your SSIS packages. The second one will allow you to import an
existing SSIS project and continue your development. For our examples, we will always be
using the first project type, Integration Services Project.
Now, select Integration Services Project and set its Name, Location, and Solution Name. As I
have already mentioned, this is very similar to any other Visual Studio-based project. Once you
click OK, the SSIS Designer will open and this is where all your developments are going to be
made. Welcome to the Integration Services world.
17
SSIS Designer
The SSIS Designer is where you can create all your packages with its control flow and all its
data flows. You can use SSIS Designer to drag and drop the components you want to use and
then configure all of them. Before creating our first package, lets take a look at the Designer.
18
As Microsoft advises, its important to keep in mind that the SSIS Designer has no dependency
on the Integration Services service, the service that manages and monitors packages. Plus, it is
not required that the service be running to create or modify packages in SSIS Designer.
However, if you stop the service while SSIS Designer is open, you can no longer open the
dialog boxes that SSIS Designer provides, and you may receive the error message "RPC server
is unavailable.
To reset the SSIS Designer and continue working with the package, you must close the
designer, exit SQL Server Data Tools, and then reopen SQL Server Data Tools, the Integration
Services project, and the package.
19
Chapter 2 Packages
Introduction
Packages are the output of an Integration Services project; they are the executables you are
going to run in order to process an ETL task. The best definition of a package defines it as a
collection of connections, control flow elements, data flow elements, event handlers, variables,
parameters, and settings that you assemble by either using the graphical design tools that SQL
Server Integration Services provides or by building programmatically.
These tasks will be pulled together using the SSIS Designer; it is a very easy process of
connecting them using arrows. However, internally, the Integration Services engine will use
precedence constraints to not only connect the tasks together but to also manage the order in
which they execute. The engine does this based on what happens in each task or based on
rules that are defined by the package developer.
This is a basic definition of a package. It is important to understand that package development
is not just about adding tasks. When you finish a package, you can add advanced features such
as logging and variables to extend the package functionality. When you finish the package
developments, they need to be configured by setting package-level properties that implement
security, enable restarting of packages from checkpoints, or incorporate transactions in package
workflows.
The package output file will have the .dtsx file extension which holds inside it an XML structure
with all the operations needed to execute in order to meet the developers needs. Similar to
other .NET projects, the file-based code is marked up using the development environment. It
can then be saved and deployed to a SQL Server so that you can schedule its execution.
However, this is not the only way to execute these packages. You can, for example, use
PowerShell scripts to get the same results or even use SSIS DB-stored procedures.
Note: You can open the Import and Export Wizard by using the Run window and typing
DtsWizard.
Although the wizard can be opened from outside the SSIS Designer, we will use it inside the
SSIS Designer to create this first Hello world package. To do so, right-click the SSIS Package
20
in the Solution Explorer and choose the option SSIS Import and Export Wizard, as shown in
the following figure.
21
Once you click Next, you will be prompted with a similar screen to configure the target
database. You should make the same configurations you did for the source and click Next.
22
In our example, we will choose the second option: writing a simple query that will retrieve just
two columns from a source table. Once again, this is not a best practice because, if some target
columns changed, I would need to open the package, change the query, save, and redeploy it.
However, for this simple Hello World, it wont be a problem.
Once you complete this step, click Next and a new screen will be displayed. This screen allows
you to select the tables and views you want to copy to the target database. Because we have
created a custom SQL query, you will only have one option (which is a Query object as you
can see in the following figure).
23
24
25
Custom Packages
Introduction
Although the Import and Export Wizard is a very powerful tool for simple ETL jobs, in most
cases, when using Integration Services, you are going to need more complex structures with
execution logics that involve more than just a single data flow. You may need to execute SQL
queries to create support tables, send emails if something went wrong in the process, make
several data transformations, and even process multiple data sources to create a single insert.
By using the Import and Export Wizard, you are using just a small percentage of Integration
Services capabilities.
The following chapters will show you how SSIS can easily become your best friend in extract,
transform, and load operations using some out-of-the-box, reusable components. These
components will save you a lot of time and allow you to maintain and evolve your projects. Its
important to understand that most of the development logics provided by SSIS could be made
using SQL queries and programming languages such as C# or PowerShell (with some
considerable effort). However, the beauty of SSIS is, without developing a single line of code,
you can create fantastic data-oriented solutions without losing time thinking about how to
implement an algorithm. You just need to focus on the most important thing: creating a stable
solution thats easy to maintain and understand, using a very friendly design environment.
26
27
28
Package Explorer
The Package Explorer is a very useful tool to get an overview of all your packages objects. With
its hierarchical view, you are able to navigate between all of your design panes with all its tasks,
connections, containers, event handlers, variables, and transformations. It basically summarizes
your developments and allows you to easily navigate between the properties of all your objects
and change them.
To access the Package Explorer, open a package, click the Package Explorer tab, and expand
the objects you want to edit. Or just follow the structure as a single view.
29
Package Properties
Like any other object that is clicked in a Visual Studio-based solution, when you open a package
in the Integration Services design environment, because that object gets the focus of the IDE,
the Properties window will reflect that particular object properties. In this case, these are
packages. These properties allow you to change basic attributes such as the package name
and version, but they also allow you to add important attributes such as passwords to execute
the package.
In this section, I will guide you in the analysis of every property of packages in Integration
Services so that you can understand its important features. The first thing is to clarify where you
can find these properties. So, start by opening a package and then look at the Properties
window. The following figure shows the package properties.
30
Checkpoints
Execution
Forced Execution Value
Identification
Misc
Security
Transactions
Version
Before explaining each group of properties, there is an important topic I cannot avoid discussing
and that is checkpoints. When activated, checkpoints force SSIS to maintain a record of the
control flow executables that have successfully run. In addition, SSIS records the current values
of user-defined variables. This current context will be stored in an XML file which is defined in
the package property window. After that, if a package fails during execution, SSIS will reference
the checkpoint file when you try to rerun the package. It will first do so by retrieving the current
variable values as they existed prior to package failure and, based on the last successful
executable that ran, start running the package where it left off. That is, it will continue executing
from the point of failure, assuming weve addressed the issue that caused the failure in the first
place.
My goal in this section is more than teaching you how to apply these properties and how they
can be used in your scenariosmy goal is to teach you about their existence. This way, when a
particular problem appears, you will know its source. The following sections will help you
understand which properties exist and for what reason.
Checkpoints
31
Execution
Identification
Misc
32
LoggingModeA value that specifies the logging behavior of the package. The values
are Disabled, Enabled, and UseParentSetting.
OfflineModeIndicates whether the package is in offline mode.
SuppressConfigurationWarningsIndicates whether the warnings generated by
configurations are suppressed.
UpdateObjectsIndicates whether the package is updated to use newer versions of the
objects it contains, if newer versions are available.
Security
Transactions
Packages use transactions to bind the database actions that tasks perform into atomic units. By
doing this, they maintain data integrity. In other words, they allow developers to group tasks
together to be executed as a single transaction.
Version
And thats it. While some properties seem useful only when reading their definitions, others will
only show their applicability when your needs demand it. Integration Services is an aggregation
of many components and tools; some projects might require you to use all of them, while others
just demand a small set.
Now its time to start digging into the main components of this platform and start developing our
packages. The first thing we need to learn about is the control flow. We will be learning all the
control flow tasks, how you can connect them, and even about how these tasks connect to
external systems.
33
34
35
Favorite Tasks
This is your personal, customizable group of the control flow tasks toolbox. By default, it
contains two tasks which are two of the most often used ones. The first is the data flow task
which processes the source to the destination and retrieves, transforms, and inserts it into
databases or files. The second is the execute SQL task which allows you to run DML
(manipulation), DDL (definition), and DCL (control) commands against a relational database.
You can move your task group by right-clicking it and choosing one of the available move
options as shown in the following figure.
36
Name
Description
Common
The Common tasks group includes the most commonly used tasks in Integration Services
projects. The following table explains each task.
Table 2: Common Tasks
Icon
Name
Analysis Services
Processing Task
37
Description
This task allows you to process SQL Server
Analysis Services objects such as cubes,
dimensions, and mining models.
Icon
Name
Description
Expression Task
FTP Task
Script Task
38
Icon
Name
Description
XML Task
Containers
Containers are not tasks; they are used to group tasks together logically into unique units of
work. In this way, you can treat several tasks as one in terms of flow logics. There are several
advantages to using containers to group related work unit tasks. One of the advantages is
having the ability to define the scope of variables and event handlers at a containers level or,
even better, to be able to iterate a redefined enumeration until a condition is verified. There are
four types of containers:
You can find the containers in the SSIS Toolbox, inside the Containers group. The following
table explains each of the four containers.
Table 3: Containers
39
Shape
Name
N/A
Task Host
Container
Description
This is not a usable container. This encapsulates a task every
time you add one to the control flow. This means that every
time you add a task to the control flow, you are creating a new
container.
Shape
Name
Description
For Loop
Container
Foreach Loop
Container
Sequence
Container
Other Tasks
The Other Tasks group includes infrequently used tasks in Integration Services projects.
However, they are as important as, or even more important than, the common ones. If you
master these tasks, you will be able to do pretty much anything against a SQL database without
writing a line of code. The following table explains each of the tasks.
Table 4: Other Tasks
Shape
Name
Description
Analysis Services
Execute DDL Task
40
Shape
Name
41
Description
Maintenance Cleanup
Task
Shape
Name
Description
This task allows you to rebuild indexes to
reorganize metadata and index pages. This
improves performance of index scans and
seeks. This task also optimizes the distribution
of data and free space on the index pages,
allowing faster future growth.
42
Shape
Name
Description
This task allows you to transfer one or more
types of objects between instances of SQL
Server.
Precedence Constraints
Introduction
Precedence constraints are used by the Integration Services control flow component to define
the order of execution of each task and container, manage the workflow of a package, and
handle error conditions. There are three main types of precedence constraints, which can be
defined by their colors:
43
44
45
There is also an interesting concept called multiple constraints, which is commonly used when a
task has multiple inputs with different precedence constraints for each other. Grouping your
constraints enables you to implement complex control flow in packages; however, you need to
tell SSIS how to manage the interoperability of all precedence constraints.
To do so, you have two options: grouping them using the AND logical operator in which all
constraints must be evaluated to true, or by using the OR logical operator in which one of the
precedence constraints must be evaluated to true.
If you choose any evaluation operation other than Constraint, Integration Services will allow you
to open the Expression Builder in the Precedence Constraint Editor so that you can develop the
expression being evaluated. The following figure shows the Expression Builder and an example
expression to evaluate. In this example, I use a preset variable to check if there are records to
process.
46
Connection Managers
Introduction
Connection managers are components that allow you to connect to third-party platforms in order
to retrieve, insert, or perform operations on their data. Integration Services allows you to specify
two types of connection managers: one is at a solution level and the other is at a package level.
If you define a connection at a package level, only the control flow and data flow of that specific
package will be able to use it. On the other hand, if you define it at a solution level, then all
control flows and data flows of all packages in that solution will be able to use them. The
following figure shows the Add SSIS Connection Manager window.
47
48
ADO
ADO.NET
Cache
DQS Server
Excel
File
Flat File
FTP
HTTP
MSMQ (Message Queues)
MSOLAP100 (Analysis Services)
MULTIFILES (Multiple Files)
MULTIFLATFILES (Multiple Flat Files)
ODBC
OLE Database (SQL Server)
SMO
SMTP
SQL Mobile
WMI
The OLE Database (SQL Server) emphasized in bold is the one that I will explain how to use.
All the connection managers work similarly as they are wizards created to assist you in
configuring your connections faster and with more security, thus avoiding errors.
49
50
51
52
53
54
Event Handlers
Introduction
Event handlers allow you to capture and handle the events that executing an object with
Integration Services raises. After defining the event you want to capture, SSIS allows you to
develop data flows to be executed when the event is detected. Its very important to be aware of
the events you are able to capture in order to keep your packages execution from failing and
your ETL process from stopping.
To better manage the event handlers configuration inside a package, SSIS provides a specific
tab inside the package designer. To access it, select the Event Handler tab as shown in the
following figure.
55
Troubleshooting
Troubleshooting is important in every technology. You do it while working with programming
languages as well as when working with databases. Integration Services isnt an exception. The
first step to handle errors inside your packages execution is to correctly use precedence
constraints and event handlers.
56
However, these two mechanisms arent enough to identify and solve problems in your
packages execution. Sometimes you need to place breakpoints or log the errors to predefined
databases in order to evaluate the value of a variable, record values, and take action needed to
fix it if the values recorded arent the expected ones.
Breakpoints
Lets start by analyzing breakpoints. SSIS brings you its own breakpoints so that you can stop
the packages execution at a particular step and evaluate the state of the variables and other
components. They work like they do in any other programming language; if you have worked
with Visual Studio for programming in any language you are familiar with the breakpoints
mechanism. The main difference is that in those cases, you define them by code lines, while in
SSIS you define them in tasks, containers, or packages. As you can see, they arent available
inside data flows.
To define breakpoints, you just need to right-click the component you want and select Edit
Breakpoints. This will open the Set Breakpoints window where you will develop the
breakpoints.
57
58
Logging
Once again, as in any programming language or information system platform, logging is
essential to understanding errors, execution times, and many more pieces of information without
having to debug the solution. In SSIS, the logging mechanism is very easy to use and
understand. By simply selecting a few check boxes and configuring some components, you can
start logging your ETL process executions in XML or in a SQL Server table. This is because it
has a built-in set of features that captures details about package execution.
You have several options for output objects when logging your information. You can generate
text files, XML files, log to SQL Server databases, and even generate a Windows Event. To
show you how simple it is to activate logging, in the next section I will create a demo that will
create XML files with your package execution logs.
59
60
61
After saving and closing the windows, you can now test this log mechanism. To do this, run the
package and open the logging file. If everything goes all right, your file should look like the
following figure.
62
63
The next step is to add the data flow task to the package control flow and connect the previous
execute SQL task to it using precedence constraints. When adding the data flow task, you dont
configure it; instead, when you double-click it, SSIS will take you to the data flow designer so
that you can develop it according to your requirements. The final simple control flow is shown in
the following figure.
64
65
designed using a drag-and-drop designer and then double-clicking them to open their
configuration options. Now lets take a look at all available objects.
Favorites
The Favorites group contains two of the most important objects; they allow you to connect to
external data sources or destinations. You can add new objects to this group by right-clicking an
object and selecting Move to Favorites.
Table 5: Data flow transformations in the Favorites group
Shape
Name
Description
Destination Assistant
Source Assistant
66
Common
The Common tasks group includes objects commonly used in data flows. Once again, you can
move objects to this group by right-clicking them and selecting Move to Common. The
following table explains each of the objects that you can find in it.
Table 6: Common data flow transforms
Shape
Name
Description
Aggregate
Conditional Split
Data Conversion
Derived Column
Lookup
Merge
Merge Join
67
Shape
Name
Description
Multicast
OLE DB Command
Row Count
Script Component
Slowly Changing
Dimensions (SCD)
Sort
Union All
68
Other Transforms
The Other Transforms group includes objects less frequently used in data flows. Although they
are not used as often, they can be equally as important to your ETL projects.
Table 7: Other Transforms
Shape
69
Name
Description
Audit
Cache Transform
CDC Splitter
Character Map
Copy Column
DQS Cleansing
Export Columns
Shape
Name
Fuzzy Grouping
Description
This object allows you to identify potential
duplicate rows, and helps standardize the data by
selecting canonical replacements.
Fuzzy Lookup
Import Column
Percentage Sampling
Pivot
Row Sampling
Term Extraction
Term Lookup
Unpivot
70
Other Sources
The Other Sources group includes other source objects used in data flows.
Table 8: Other Sources
Shape
Name
ADO.NET Source
CDC Source
Excel Source
ODBC Source
OLE DB Source
XML Source
71
Description
Other Destinations
The Other Destinations group includes other used destination objects in data flows.
Table 9: Other Destinations
Shape
Name
ADO.NET Destination
Description
This object allows you to load data into an
ADO.NET-compliant database that uses a
database table or view.
This object allows you to train data mining models
by passing the data that the destination receives
through the data mining models algorithms.
Dimension Processing
Excel Destination
ODBC Destination
OLE DB Destination
Partition Processing
72
Shape
Name
Description
Data Viewers
Introduction
Data viewers are great debug components because they allow you to see the data that is
flowing from one data flow component to another. As I wrote before, paths connect data flow
components by connecting the output of one data flow component to the input of another data
flow component. Data viewers operate in these paths, allowing you to intercept the
communication and watch what input will be used in the next component or in another point of
view (with the data set that the previous component originated). When you run the package, it
will stop that data viewer and only continue its execution when you tell it to. By using data
viewers, you can evaluate whether or not the result of a data flow component meets your
requirements. If not, it allows you to correct it.
An important but logical prerequisite of using data viewers is that you must have at least one
data flow in your package control flow and, inside the data flow, at least two components
connected.
73
Source
Conversion
Derived Column
Destination
74
75
After configuring the source object, click OK to close the editor and double-click the
Conversion object. This will open its configuration screen where you can create expressions to
transform the columns as needed. In this screen, select the columns from the source table or
view you want to convert. In the conversion grid, select the destination data type and size.
76
77
78
Note: This is just an introductory example; it doesnt have any error control or validations. It
is meant for you to see how to start developing packages with data flows in its control
flows.
79
Updating properties of package elements at run time, such as the number of concurrent
executables that a Foreach Loop container allows.
Including an in-memory lookup table such as running an Execute SQL Task that loads a
variable with data values.
Loading variables with data values and then using them to specify a search condition in
a WHERE clause. For example, the script in a script task can update the value of a
variable that is used by a Transact-SQL statement in an Execute SQL Task.
Loading a variable with an integer and then using the value to control looping within a
package control flow, such as using a variable in the evaluation expression of a For
Loop container to control iteration.
Populating parameter values for Transact-SQL statements at run time, such as running
an Execute SQL Task and then using variables to dynamically set the parameters in a
Transact-SQL statement.
Building expressions that include variable values. For example, the Derived Column
transformation can populate a column with the result obtained by multiplying a variable
value by a column value.
80
Although variables are one of the key concepts of this dynamic mechanism, there are several
other components that interact with variables to bring you the flexibility you need inside your
package developments. Expressions, for instance, are used to evaluate a value in order to
decide which path of execution the package should take. This value can be defined by a
variable or by the combination of identifiers, literals, functions, and operators.
Another nice feature that SQL Server 2012 provides is parameters which, at first look, might
look like variables because they are also used in expressions to condition a certain logic based
in its store data (and even more, inside a package these parameters work just like a variable).
However, there is a small detail that makes a huge difference: parameters are defined externally
to the package and cannot be changed in development time, as opposed to variables which are
used and defined internally. These parameters can be defined in two scopes: package-level in
which they are available for a particular package, or project-level in which all packages of a
project can be used.
If you need a simple way to decide whether you need a variable or a parameter, decide if you
want to set the value of something from outside the package or if you want to create or store
values only within a package. If the first case, use parameters; otherwise, use variables.
81
82
SCD1, in which an attribute value will be overwritten if the record has changed (used if
historical data is not needed).
SCD2, in which a new record will be created whenever an attribute value changes (used
to keep all historical changes).
SCD3, in which N columns are created in a dimension to keep N changes in it (only
records the last N changes).
Some authors refer to another type when no change is applied to an attribute (fixed
attribute); the SCD type is known as 0. I will refer to this fourth type as SCD0.
For some time, the integration software available on the market didnt support any kind of wizard
or automation for developing SCD ETL packages; this meant that if you wanted to handle an
SCD inside your packages, you had to create the entire data flow logic to handle it.
However, SSIS gives you the SCD component inside data flows so that you can use a wizard to
create the flow for handling SCD. But there are two important things to note.
First, the SCD component doesnt support SCD3 so, if you want to use it, you need to develop
the flow yourself. Because SCD presents several performance issues, its better to always think
twice about using it.
Second, if you want to use SCD2, you have to create two new columns in your dimension: the
beginning date of the record life (SCD_DT_BEGIN, for example) and the end date of the record
life. The latter represents the time in which a new record with different attribute values is
identified in your ETL process (SCD_DT_END, for example).
If you arent familiar with using SCD dimensions, another important thing to note is that in the
same dimension you can have SCD1 attributes, SCD2 attributes, and even SCD3 attributes.
This is because when a change occurs in a specific attribute, you might not need to record it
(SCD2) but you might want to update its value. These requirements will vary from business to
business.
83
dimension called DimReferee. Both are very simple dimensions with few attributes. The source
table is shown in the following figure.
84
85
86
As I have explained before, SCD3 is not supported in the SSIS component. In this example, I
will use all of the types. After setting a referees name, I dont plan to do anything else with it
because I assume that the name wont change, so I make it a fixed attribute (SCD0). However,
when a referees marital status changes, I want to overwrite its value to the new one, so I set it
as a changing attribute (SCD1). When the referees address changes, I want to save the old
value, so it needs to be a historical attribute (SCD2).
87
88
I prefer the last option as it is the closest to the SCD flow in the package, reflecting the most
recent date. However, for this example, I used the CreationDate. After selecting the variable,
click Next.
89
All columns with a change type are null. This means that for every column in an inferred
member record that is configured with an SCD type, SSIS will set its values to null.
Use a Boolean column to indicate whether or not the current record is an inferred
member. In this case, SSIS will use a column created by you to flag whether or not the
current record is an inferred member.
90
91
Once you have added a conversion component to ensure data type consistency, your data flow
should look like the following figure.
92
the records that have changed since the previous night. In this way, your ETL process will be
faster and much more efficient.
This mechanism was introduced in SQL Server 2008 and, since then, Microsoft has made many
improvements to it. Proof of this are the new CDC components that SSIS 2012 includes. This is
not the only option available; however, it probably is the most efficient. I will explain why.
Over time, database developers have created mechanisms to solve this problem. Some of
those mechanisms are very smart; however, they bring with them particular problems. To
understand these problems, lets look at all options available to capture changed data in a
specific time window. Here, I explain the most used alternatives:
Audit Columns: This technique requires you to add datetime columns to each table you
want to monitor for changes in the operational systems. By adding a start date and end
date to the tables you change, you are able to change them every time the records
change and then extract only the records for which the end date is null or higher than the
last ETL round. This technique also requires you to create triggers in the source tables
or external applications to update the audit column with the latest update changes in
your data source table structure.
o PROBLEM: In some systems, the size and amount of tables, as well as DBAs
with bad tempers, make this an impossible solution due to cost inefficiency.
Triggers: This technique requires you to add triggers into each of the source tables you
want to monitor. Every time an insert or delete occurs, the trigger will store the changed
record business key into a log table. Your ETL process will then process only the
records from the log table.
o PROBLEM: This will affect the performance of the operational systems, and you
will also have to change all the triggers of the tables to monitor. Again, you will
need to change the operational system.
Now comes the CDC technique, which solves the previously mentioned problems. Although you
still need to enable CDC within each table you want to track changes, it does not affect the
schema of the tables. So the impact over the operational systems database schemas is less
when compared to audit columns or triggers.
Another interesting aspect of this technique is that it runs asynchronously. By this I mean that
the job that reads from the SQL Server log to detect changes will only run when the system is
idle, removing overhead from the process (when compared to triggers and audit columns which
run synchronously).
The way the SQL Server CDC mechanism works together with Integration Services CDC
components is also a huge benefit because you are able to extract the changed records into a
package control flow and also know what kind of operation has occurred in that record. In this
way, you can make separate developments over an inserted, updated, or deleted record.
Last but not least comes the usability of the CDC API. It is very easy to activate this mechanism
as a SQL Server table and then handle it inside an SSIS package. The steps needed to do so
will be shown in a brief demonstration so that you can start using it to optimize the extraction
stage in your ETL projects.
The use of this CDC API requires you to do the following steps in order to start processing only
changed records inside SSIS packages:
93
94
After running the previous command, SQL Server will create a special schema, cdc, which will
be used to store all the objects it needs to manage the mechanism. In these objects, you will be
able to find all your shadow tables. You are now able to check if the CDC is correctly enabled
for that particular database. To do so, you can run the following SQL query which will retrieve
the name and the value of the is_cdc_enabled attribute for your current database.
USE STG_FOOT_STAT; -- DATABASE_NAME
GO
--Check if CDC is enabled on the database
SELECT name, is_cdc_enabled
FROM sys.databases WHERE database_id = DB_ID();
Running the previous query will result in the following, in which the is_cdc_enabled attribute will
be one of two possibilities:
0Not enabled
1Enabled
95
hourly, etc.). You have two options to activate the SQL Server Agent: either use the SQL Server
Configuration Manager or use SQL Server Management Tools.
Lets start with the first option. Open the SQL Server Configuration Manager and, in the explorer
on the left, select SQL Server Services. When you do this, you will see all your servers
services in the right pane which indicate whether or not they are running. Identify the SQL
Server Agent of the server you want to activate, which in my case is the MSSQLSERVER
instance. Right-click it and select Start if it isnt running yet. If everything worked fine, you
should then see that the agent is running.
Figure 84: Starting SQL Agent using SQL Server Management Tools
96
As you can see in the stored procedure, I have used five parameters. However, there are some
more that will help you align the CDC with your requirements. You need to understand all of
them to execute it correctly. The definition for each one is as follows.
sys.sp_cdc_enable_table
[ @source_schema = ] 'source_schema',
[ @source_name = ] 'source_name' ,
[ @role_name = ] 'role_name'
[,[ @capture_instance = ] 'capture_instance' ]
[,[ @supports_net_changes = ] supports_net_changes ]
[,[ @index_name = ] 'index_name' ]
[,[ @captured_column_list = ] 'captured_column_list' ]
[,[ @filegroup_name = ] 'filegroup_name' ]
[,[ @allow_partition_switch = ] 'allow_partition_switch' ]
The following table gives you a short definition of the parameters you can use in the
sys.sp_cdc_enable_table stored procedure.
97
Definition
@source_schema
@source_name
@role_name
@capture_instance
@supports_net_changes
@index_name
@captured_column_list
@filegroup_name
Filegroup to be used for the change table created for the capture
instance.
@allow_partition_switch
98
If everything in your execution is correctly defined, you will see the following message in the
output windows of SQL Server.
This following figure shows the result of executing the previous command, which results in 0 if
the table is not being tracked and 1 if it is. In this case, the table is being tracked as it was
supposed to, after running the enable stored procedure.
99
With the source tables being tracked, you can now take advantage of the shadow tables directly
from SQL Server because it allows you to query them and see the changes that have occurred.
The CDC mechanism will create and name these shadow tables with a standard Source table
name + _CT, like myTable_CT, for instance. This means that by knowing the source table
name, you also know the shadow table name. To query it, you can use a simple SQL SELECT
statement as in the following code example.
SELECT * FROM cdc.STG_FOOT_STAT_TB_CT
Because you havent made any changes to the source table, the shadow table is empty as well.
However, you can see the special columns in this shadow table.
If you re-query the shadow table, you will see the changes tracked in it.
100
Process the Changed Data in SSIS and Use the SSIS CDC Component
to Make It Easier
Now that you understand CDC, lets go back to Integration Services and make use of it. As
previously mentioned, its one of the most important concepts to understand. Otherwise, if you
dont understand it, while processing the ETL of big data, your process will be very inefficient
and, consequently, slow. Of course, this only makes sense when you have periodic ETL
processes. If you only run a single execution, such as a data migration, CDC isnt useful.
101
Figure 91: Configuring the initial load start CDC Control task
In this window, you will need to make several configurations:
102
State name: Name of the state table to be used for storing the CDC state. You can
create a new one using the New button.
Mark initial load startThis option records the first load starting point. This operation is
used when executing an initial load from an active database without a snapshot. It is
invoked at the beginning of an initial-load package to record the current log sequence
number (LSN) in the source database before the initial-load package starts reading the
source tables. This requires a connection to the source database.
Mark initial load endThis option records the first load ending point. This operation is
used when executing an initial load from an active database without a snapshot. It is
invoked at the end of an initial-load package to record the current LSN in the source
database after the initial-load package finishes reading the source tables. This LSN is
determined by recording the time when this operation occurred and then querying the
cdc.lsn_time_mapping table in the CDC database, looking for a change that occurred
after that time.
Mark CDC startThis option records the beginning of the CDC range. This operation is
used when the initial load is made from a snapshot database or from a quiescence
database. It is invoked at any point within the initial-load package. The operation accepts
a parameter which can be a snapshot LSN, a name of a snapshot database (from which
the snapshot LSN will be derived automatically), or it can be left empty, in which case
the current database LSN is used as the start LSN for the change processing package.
An important note about this operation is that it can be used instead of the Mark Initial
Load Start and Mark Initial Load End operations.
Get processing rangeThis option retrieves the range for the CDC values. This
operation is used in a change processing package before invoking the data flow that
uses the CDC Source data flow. It establishes a range of LSNs that the CDC Source
data flow reads when invoked. The range is stored in an SSIS package variable that is
used by the CDC Source during data flow processing.
Mark processed rangeThis option records the range of values processed. This
operation is used in a change processing package at the end of a CDC run (after the
CDC data flow is completed successfully) to record the last LSN that was fully processed
in the CDC run. The next time GetProcessingRange is executed, this position
determines the start of the next processing range.
Save and close this task editor. Open the Mark initial load end task to tell the CDC mechanism
that this initial load has ended. If you dont mark this end state and you try to process the next
103
range of data, SSIS will give you an error. The configuration of this task only differs from the
previous one on the operation type.
Figure 93: Configuring the Mark initial load end CDC Control task
To process incremental updates, you just need to develop new packages or reuse the current
ones and change their operation type as follows:
104
The next step is to configure this CDC source. This configuration involves setting the following
properties:
105
106
You can now make the logics you need by using the correct paths and changing its output when
connecting the CDC splitter to another object.
107
108
The first model, project deployment, is a new feature of SSIS 2012. It uses an SSIS catalog that
should be created in a SQL Server instance to store all the objects in a deployed project. When
you deploy a project using this model, it will aggregate all the objects inside your project and
deploy them as one to the server, creating an .ispac file.
In the second case, the package deployment model (which has been used since SSIS 2005),
the packages arent deployed as one; they are either treated separately where you can deploy
one package at a time if you want, or a deployment utility can be used to deploy all of them.
Although one of the concerns related to these deployment models is the atomicity of the project
objects, there are some other differences between the project and package deployment models.
To better understand the differences, take a look at the following table.
Table 11: Project deployment model vesus the package deployment model
Project Deployment
Model
109
Unit of Deployment
Project
Package
Assignment of
Package Properties
Parameters
Configurations
File output
Deployment
A project containing
packages and
parameters is deployed
to the SSISDB catalog
on an instance of SQL
Server.
Project Deployment
Model
Package Validation
Running and
Scheduling Packages
Event Handling
Environment Specific
variables
CLR Integration
Environment-specific
parameter values are
stored in environment
variables.
Environment-specific configuration
values are stored in configuration files.
CLR integration is
required on the database
engine.
Now that we understand the main differences between these two models, lets learn how we
can use them so that our projects can go into a production environment.
110
which you want the project to be deployed, locate the Integration Services Catalogs folder in
the Object Explorer.
111
After setting all the required property values, click OK. Now you are able to see a new catalog
inside your Integration Services Catalog folder. If you expand the folder, youll see the new
catalog folder inside.
Select Source
Select Destination
Review
Results
In the Select Source screen, you need to specify the project you want to deploy (which will be
our example) and if you want another SSIS catalog coming from another server. In the Select
Source screen, select the project deployment file option and SSIS will automatically set the path
of the project to the one that is currently open.
112
113
And thats it! The final screen allows you to validate the settings you have just set in the Select
Source and Select Destination screens. If everything is as it should be, click Deploy to finish the
deployment. After the deployment is over, go to the Management Studio and expand the
Integration Services Catalog folder. You should now see your project deployed. If you want to
know how to view and change your project parameters, revisit the Variables, Expressions, and
Parameters chapter.
114
Once you finish this step, your project will be under the package deployment model. Now its
time to deploy our individual packages. To do this, open the package you want to deploy. Next,
click an empty area in the control flow and then open the File menu. Select Save Copy of
[package name] as shown in the following figure. Its important that you are aware that you can
deploy all of your packages at once by using a deploy manifest. However, like the project
deployment model, this is an all-or-nothing method, which means you cannot select which
packages you want to deploy; it will deploy all of them. If you want to learn more about this
method, use this reference on the Microsoft TechNet website. On that page it will explain how to
create a deployment utility (manifest).
115
116