Česky | English
Home > Solutions > IT Solutions > Data Cleansing

Data Cleansing

Quality results can only be produced by using quality data"

The use of poor quality data ensures that the resulting output will also be of poor quality. Let's take a simple example of a marketing offer which is being sent to company clients. When using poor quality data with customer names and addresses, the offer will reach a single address several times or will not reach it at all. Or, it will be sent to the same address several times to each member of the household separately. Besides direct losses due to extra postage and cost of processing the mailings, there are also hidden, potential losses resulting from the negative impression which may have been created.

Data Cleansing is a complex subject. The cleansing process encompasses not only a one-time data fix, but also provides improvements to company procedures which have lead to poor quality data in the first place. The main steps for data cleansing are:

  • Check of all data input
  • Integration of monitoring procedures to data processing cycle
  • Assignment of roles and responsibilities to people working with the data

The Data Cleansing topic is mainly heard and discussed along with Data Warehousing and Master Data Management issues.

Solution

data input analysis, control checks implementation

Prior to batch data cleansing, it is important to implement correct standards which continuously monitor data quality and cleanses the data while being fed to the company databases.

  • A correctly executed project begins with data input analysis and data model analysis. A result of such analysis is an identification and design of data object consolidation - unifying of formats, entity descriptions, semantics, proper relationship setup (data integrity), consolidation of look-up tables, etc.
  • Along with the work on a data model, business rules for data inputs have to be defined. It usually means restricting manual data input to predefined values, using character masks, look-up tables or mathematical, logical or formatting rules. At this stage, relational or hierarchical relationships of data entities get defined.
  • The next phase calls for implementing control checks as a standard part of everyday work. It is necessary to define metrics for entities along with criteria for measurement and to keep a running history of such daily checks. This provides for monitoring of data quality development over a longer time period. Records which do not meet quality criteria are sent for further processing - data cleansing.

The Data Cleansing itself can be done in a transient or batch mode. It consists of several phases:

  • Analysis and assessment of data using frequency analysis
  • Standardization of data (i.e. correction/replacement of data which do not conform to format and content of internal look-up tables and business rules.)
  • Use of registers of comparable data (public address records or business services providing such information) to further verify or enhance data content
  • Data integration (i.e. removal of duplicates.)

Technology

Database platforms: Oracle, MS SQL, Informix, DB2, Teradata, MySQL,PostgreSQL

ETL: Informatica, DataStage, AbInitio, OWB, CloverETL

DataCleansing/Profiling: Trillium ProfileStage, FirstLogic, dwSavvy