Questions for Lookup transformation
source Q1) How matching happens inside the lookup transformation?
Lookup transformation performs an equi-join between input data and lookup dataset. By default unmatched row is considered as an error however we can configure lookup to redirect such rows as “No match output”.
If the lookup dataset is having multiple matches it returns only the first match. In-case the lookup dataset is a cache then it raises a warning or error in case of multiple matches.
http://atlantafingerprinting.com/fingerprinting-in-atlanta/ Q2) What are all the inputs and outputs of a lookup transformation?
http://ashflowboards.com/2013-dodge-durango/ Input: Dataset from data source
Match Output: All matched rows
No Match Output: All not matched rows. Unmatched rows can be configured to redirect to “Error output” or to “No match output”
Error Output: Rows failed to compare or unmatched rows
Q3) What are the lookup cache modes available and how to use them?
Fill Cache (Default):
The database is queried once during the pre-execute phase of the data flow. The entire reference set is pulled into memory. This approach uses the most memory, and adds additional startup time for your data flow. Lookup will not swap memory out to disk, so your data flow will fail if you run out of memory.
When to use Full cache mode
1) When you’re accessing a large portion of your reference set
2) When you have a small reference\lookup table
3) When your database is remote or under heavy load, and you want to reduce the number of queries sent to the server
In this mode, the lookup cache starts off empty at the beginning of the data flow. When a new row comes in, the lookup transform checks its cache for the matching values. If no match is found, it queries the database. If the match is found at the database, the values are cached so they can be used the next time a matching row comes in.
In 2008 there is a new Miss Cache feature that allows you to allocate a certain percentage of your cache to remembering rows that had no match in the database. This is useful in a lot of situations, as it prevents the transform from querying the database multiple times for values that don’t exist
When to use this cache mode
1) When you’re processing a small number of rows and it’s not worth the time to fully cache the dataset
2) When you have a large reference\lookup table
3) When your data flow is adding new rows to your reference\lookup table
As the name implies, in this mode the lookup transform doesn’t maintain a lookup cache (actually, not quite true – we keep the last match around, as the memory has already been allocated). In most situations, this means that you’ll be hitting the database for every row.
When to use this cache mode
1) When you’re processing a small number of rows
2) When you have non-repeating lookup indexes
3) When your reference\lookup table is changing (inserts, updates, deletes)
4) When you have severe memory limitations
Q4) What is the use of Character Map transformation in Lookup scenario ?
This transformation is used for applying transformations to the column data that includes changing characters from lower to upper case, upper to lower case, half width, full width, Byte reversal etc. When we are using lookup on columns from source and destinations as we know that SSIS Lookup is a case sensitive not like T-SQL. So before comparing two columns we can design data flow to pass those two columns through Character Map Transformation and can convert data into a common format either “Lower” or “Upper” case.
Questions for Merge, Merge Join, Union All transformation
INPUT: One or more SORTED or UNSORTED streams
OUTPUT: One UNSORTED stream
INPUT: One or two SORTED streams
OUTPUT: One SORTED stream
INPUT: Two SORTED streams
OUTPUT: One SORTED stream
Q1) What is the Difference between MERGE and MERGE JOIN Transformation?
The data from 2 input paths are merged into one.
Work as UNION ALL.
Metadata for all columns needs to be same.
Use when merging of data from 2 data source.
MERGE JOIN Transformation:
The data from 2 inputs are merged based on some common key.
Work as JOIN (LEFT, RIGHT OR FULL).
Only Key columns metadata needs to be same.
Use when data from 2 tables having foreign key relationship needs to present based on common key.
Q2) What is the Difference between MERGE and UNION ALL?
Merge transformation can accept only two inputs whereas Union all can take more than two inputs.
Merge Transformation requires input data to be sorted whereas Union all doesn’t requires sorted input.
Merge transformation combines rows from each sorted dataset into the output based on values in their key columns whereas Union all doesn’t have any condition like that.
Use the Union All transformation instead of the Merge transformation in the following situations:
1) The transformation inputs are not sorted.
2) The combined output does not need to be sorted.
3) The transformation has more than two inputs.
Questions for other transformation
Q1) What is the difference between “Copy Column” and “Derived Column”?
Both transformations can add new columns.
Copy column can add new columns only through existing columns but coming to Derived column it can add new columns without any help from existing columns.
Derived Column can assign different transformations and data types to the new columns whereas Copy Column cannot.
Derived Column supports error output whereas Copy Column cannot.
Q2) What is the difference between “Data Conversion” and “Derived Column” transformations?
Data Conversion transformation is used o convert the datatype of a column. Same operation can be done using “Derived Column “transformation using typecast but derived column can also be used to add / create a new column by manipulating the existing column based on expressions.
We have to choose “Data Conversion” when the requirement is only intended to change the datatype. In other words “Data Conversion” is introduced just for developer convenience as it’s a direct method where as in “Derived Column” we have to use an expression to change the datatype of a column.
From 2008 onwards in “Derived Column” transformation, datatype and length information is read only, when we create a new column or created from existing , data type would be assigned based on the expression outcome and the datatype is a read-only column, to change the datatype we have to use “Data Conversion” transformation.
Q3) What is difference between Multicast and Conditional Split?
1) Both transformations direct single input to multiple outputs.
2) Multicast transformation directs every row from input transformation/source to every output transformation/destination.
3) Conditional Split transformation directs a single input row to a single output transformation/destination.
Q4) Does using “RowCount” transformation affects the package performance?
Rowcount component is a synchronous component and it doesn’t actually do anything particularly resource intensive means the performance degradation of your package should be negligible.
We do use this component to capture the number of inserts, deletes and updates from each data-flow and then using “OnPost Execute” event this information would be written to a SQL Server table.
Listing of all data flow transformation
DATACONVERSION: Converts columns data types from one to another type. It stands for Explicit Column Conversion.
DATAMININGQUERY: Used to perform data mining query against analysis services and manage Predictions Graphs and Controls.
DERIVEDCOLUMN: Create a new (computed) column from given expressions.
EXPORTCOLUMN: Used to export a Image specific column from the database to a flat file.
FUZZYGROUPING: Used for data cleansing by finding rows that are likely duplicates.
FUZZYLOOKUP: Used for Pattern Matching and Ranking based on fuzzy logic.
AGGREGATE: It applies aggregate functions to Record Sets to produce new output records from aggregated values.
AUDIT: Adds Package and Task level Metadata: such as Machine Name, Execution Instance, Package Name, Package ID, etc..
CHARACTERMAP: Performs SQL Server column level string operations such as changing data from lower case to upper case.
MULTICAST: Sends a copy of supplied Data Source onto multiple Destinations.
CONDITIONALSPLIT: Separates available input into separate output pipelines based on Boolean Expressions configured for each output.
COPYCOLUMN: Add a copy of column to the output we can later transform the copy keeping the original for auditing.
IMPORTCOLUMN: Reads image specific column from database onto a flat file.
LOOKUP: Performs the lookup (searching) of a given reference object set to a data source. It is used for exact matches only.
MERGE: Merges two sorted data sets into a single data set into a single data flow.
MERGEJOIN: Merges two data sets into a single dataset using a join junction.
ROWCOUNT: Stores the resulting row count from the data flow / transformation into a variable.
ROWSAMPLING: Captures sample data by using a row count of the total rows in dataflow specified by rows or percentage.
UNIONALL: Merge multiple data sets into a single dataset.
PIVOT: Used for Normalization of data sources to reduce anomalies by converting rows into columns
UNPIVOT: Used for de-normalizing the data structure by converts columns into rows in case of building Data Warehouses.