A brief retrospective of my submission for Kaggle data science competition that predicts the gender and age group of a smartphone user based on their usage pattern.
Project DescriptionCredit Card Number Validator 08 (CCNV08) is a Custom SSIS Data Flow Transformation Component that determines whether the given input is a valid credit card number or not. CCNV08 takes String/Numeric input and performs a Synchronous, Non-Blocking transformation using Luhn Mod 10 algorithm and validates whether the input is a valid credit card number or not. The output is true if the input is a valid credit card number and false otherwise. CCNV08 is compatible with SQL Server 2008. An earlier version of this component which is compatible with SQL Server 2005 is available as CodePlex project CCNV05
For a detailed reference about CCNV08 check-out this blog post on Custom SSIS Data Flow Component – Credit Card Number Validator
Features of CCNV08
- Built for SQL Server 2008, source code written in C#
- Provides a similar UI as out-of-box SSIS component
- Accepts String/Numeric input that can be sourced from any out-of-box SSIS Data Source.
- No limit to the number of input columns that can be fed into the component.
- Non-Blocking, synchronous and high-speed validation using widely-used Luhn Mod10 algorithm
- Automatically creates an output collection for each corresponding input column.
- Provides an Error Collection to channel exception data
- Output columns (result) are in Boolean, which can be immediately used with conditional split component.
Deployment InstructionsFollow this blog post - Steps to Build and Deploy Custom SSIS Component to build the source code and deploy the CCNV05 component to BIDS. If you are downloading just the assembly, steps 1&2 can be skipped and start from Step 3.
Debugging InstructionsCustom SSIS Components consists of Design-time methods and Run-time methods. Debugging a Design-time method is different from a Run-time method. This blog post – Debugging Custom SSIS Components explains the techniques to debug both Design-time methods and Run-time methods.
Recent Blog PostsSome of my recent BI posts
A brief retrospective of my submission for Kaggle data science competition that forecasts inventory demand for Grupo Bimbo. Objective Grupo Bimbo is a bakery product manufacturing company that supplies bread and bakery products to its clients in Mexico on a weekly basis. Usually the sales agent calculates the supply for each product for each store. […]
While type 2 dimensions are great to track attribute history, in some instances type 2 SCD may not be appropriate. In this post, I would like to mention few common anti-patterns I have encountered around type 2 SCD.
Time Series is the historical representation of data points collected at periodic intervals of time. Statistical tools like R use forecasting models to analyse historical time series data to predict future values with reasonable accuracy. In this post I will be using R time series to forecast the exchange rate of Australian dollar using daily closing rate of the dollar collected over a period of two years.
Decision tree is a data mining model that graphically represents the parameters that are most likely to influence the outcome and the extent of influence. The output is similar to a tree/flowchart with nodes, branches and leaves. The nodes represent the parameters, the branches represent the classification question/decision and the leaves represent the outcome. Internally decision tree algorithm performs a recursive classification on the input dataset and assigns each record to a segment of the tree where it fits closest.
Box plot is an effective way to visualize the distribution of your data.It only takes a few lines of code in R to come up with a box plot.
CUBE operator in Pig computes all possible combination of the specified fields. In this post I will demonstrate the use of Cube operator to analyse energy rating of air conditioners in Hortonworks Data Platform (HDP).
This post demonstrates the use of GROUP operator to analyse credit card expenses.
Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Hortonworks Data Platform (HDP).
Power Query can discover and import data from websites. Often data warehouses rely on external data which is readily available in public websites for e.g. public holidays, school holidays, SIC codes, SWIFT codes, post codes etc. Power Query is perfectly suitable for such situations. Power Query can discover, fetch and transform data from a HTML table in a web page into a format that can be easily imported into data warehouses using SSIS package. It's like an ETL tool for the web page data source.
Storage cluster (HDFS) in Hadoop is also the Processing cluster (MapReduce). Azure provides two different options to store data:
Option 1: Use HDInsight cluster to store data as well as to process MapReduce requests. For e.g. a Hive database hosted in an HDInsight cluster which also executes HiveQL MapReduce queries. In this instance data is stored in the cluster’s HDFS.
Option 2: Use HDInsight cluster to only process MapReduce requests whereas data is stored in Azure blob storage. For e.g. the Hive data is stored in Azure storage while the HDInsight cluster executes HiveQL MapReduce queries. Here the metadata of Hive database is stored in the cluster whereas the actual data is stored in Azure storage. The HDInsight cluster is co-located in the same datacentre as the Azure storage and connected by high speed network.
There are several advantage of using Azure storage. Provisioning the HDInsight cluster on demand while retaining Azure storage is cost effective especially when there is spending limits on your Azure subscription.
Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Azure HDInsight platform.
This post is a tutorial to get started on Hive in HDInsight. The steps to be followed are given below. As a pre-requisite you would need a subscription to Microsoft Azure to try out these steps
1. Provision Azure Storage Account
2. Provision HDInsight Cluster
3. Create Hive Database and Tables
4. Prepare Data as Ctrl-A separated Text Files
5. Upload Text Files to Azure Storage
6. Load Data to Hive
7. Execute HiveQL DML Jobs
Partitions improve the query response of SSAS cube by narrowing the MDX query to a pre-defined subset of data. Usually cubes are developed with few basic partitions to begin with and new partitions are added over time. Partitions are created using XMLA commands. The rest of this post explains the steps to create partitions using metadata and SSIS package.
SSAS Partitions are primarily geared towards better query performance by narrowing the MDX query to a pre-defined subset of data. Each partition must be in processed state for the cube to be available for browsing. However in a delta ETL run, it is more than likely that only some partitions would be refreshed with data while the rest remain unchanged. For example in a hospital admissions cube partitioned by month, data movements are frequent for admissions made during the current/last month compared to earlier months. This means if the partitions updated by the ETL run can be identified, then we can tailor SSIS package to dynamically process only those partitions that require data refresh while leaving the remaining partitions untouched.
Canned reports/dashboards aka custom reports or out of box reports are inevitable part of any BI ecosystem. A well-designed DW/BI will satisfy most of the analytical requirements of business which includes but not restricted to canned reports. Canned reports are a given, not the end game to any BI strategy. Reporting tools available today are incredibly easy to use. A power user with a little bit of training can build reports and dashboards with reasonable ease while BI teams can focus on expanding and enriching the self-service BI Platform.
The execution plan of SSAS processing engine triggers a sequence of processing tasks based on the object's current processed state and its dependencies with other objects. This is what happens when the SSAS cube is processed from SSMS. It is possible to replicate this execution plan close enough in an SSIS package so that the cube is always processed and refreshed with latest data.
If your SSRS report is using SSAS MDX datasets, the member uniquename is particularly useful to pass parameter values to drill-through MDX reports or for sorting based on key values. If Fields!Attribute.UniqueName is returning nothing or blank or null value in your report , then most likely you are missing the DIMENSION PROPERTIES clause in your MDX dataset query
While SSAS allows you to build a cube straight of the fact and dimension tables, however there are benefits in creating the cube from database views that sits on top of the underlying fact and dimension tables. Here's why.
It's quite common to analyse a measure by categorizing it into different buckets or bands. For example to list the customers who are due for payment in the next week or fortnight, to decide which income group to target for a particular product campaign, which age group of students are most likely to buy a smartphone and so on. These bands are usually not supplied by source systems but derived from the underlying measure and they exist only for analytic purposes. While this banding could be achieved in reports or dashboards, there is significant value to model them as dimensions especially to promote self-service BI.
Not affiliated with Microsoft and CodePlex
To remove your project or any question, please contact us: [email protected]