Online-Academy

Look, Read, Understand, Apply

Menu

Data Mining And Data Warehousing

Introduction

Data Mining

Data mining is the process of extracting hidden information and knowledge from huge volume of data. Organizations collect data everyday while performing the operational levels tasks and activities. That results in the accumulation of huge of volume of data after a certain period of time. Data analysts, Data miner, knowledge workers, statisticians, software engineers developed methods to work on those data and produce information, knowledge which were previously unknown to anyone.

Data mining functions

  • Classification and Predication
  • Cluster Analysis
  • Outlier Analysis
  • Association Analysis
  • Evolution Analysis

Separate Data repository required

Data mining can't be performed on the operational database of the organizations; Data stored in the operational database have different purpose compared to data required for data mining. That is, operational database is updated quite frequently; contains several data required for performing different tasks of the organizations, may contain inconsistencies, noise, redundancies and data may be incomplete; data format of operational data may not be according to the data mining algorithms.

Data mining needs clean, consistent and complete data; if data mining is performed on incomplete, inconsistent data then the knowledge, information produced by Data mining functions may not be accurate, and reliable.

Data Preprocessing

Operational data needs to be cleaned, made error and noise free and complete. Data preprocessing is the process of cleaning data and making data consistent and complete; data preprocessing is performed to prepare data for mining.

Data Preprocessing Phases:

  • Data Selection
  • Data integration
  • Data cleaning
  • Data transformation
  • Data Reduction
NameAddressgenderDOBMarital Status
GeetaBaneshworFeMale1999/01/01UnMarried
SitalKuleshworF1992/01/01Married
NehaBaneshworFeMale1981/11/14
RakeshKuleshworM1979/01/01Married
LaxmiThaliFemale1989/07/01
GeetaBaneshworF1999/01/01UnMarried
AnilMaleMarried

Importance of data preprocessing

Data preprocessing is necessary to ensure quality of the data and information produced. Data preprocessing is done to check quality of the data. Following properties of the data are used to ensure quality of the data:
  • Accuracy: Is data correct or not?
  • Completeness: Is full set of records available or not?
  • Consistency: Is data same in all places or not?
  • Timeliness: Notion of time is necessary. Each record must have associate date with it. Is data up-to-date with time or not?
  • Reliable: Can data be trusted or not?
  • Interpretability: Is data in understandable form?

Data Cleaning

The data cleaning process removes incorrect data, removes incomplete data or makes incomplete data complete by adding values to the missing fields.

Handling missing values:

Attribute mean or global mean or group mean can be used to fill in the missing values. For example, let's suppose we have a dataset of 1000 students of grade ten; among those 1000 records 20 records have no value for age. Here we can insert the mean age of the students of grade ten, we can the take global average of the students studying in grade 10. Or if removing those ten records has to severe effect on the information produced in that case those 20 records can be removed.

Noise data:

Noise means a random error in the data. Data seems inconsistent because of noise. Let's suppose, in the dataset of 1000 students, the age of the students ranges from 13 to 28. In general, the age of students of grade ten is 15, 16, 17. Here, we can see inconsistency in the value of age. This is a simple example of noise.
Noise can be removed using different Binning (smoothing) techniques:
  • Binning: We can use the mean value or median value to replace the inconsistent data. In case of our example, we can use the mean or median age of students of grade ten to smooth the noise. Or minimum or maximum value in the set can be taken and values near to minimum are replaced with minimum and values near to maximum are replaced by maximum (Smoothing by boundary).
  • Regression: Regression is used to check if the attribute is suitable for analysis or not.
  • Clustering: Outliers are found using clustering techniques.

Data Integration

For the purpose of data mining, existing organizational data may not be sufficient, in that case data from external sources have to be brought and integrated with the existing dataset. But some problems may have to be faced while integrating dataset.
  • Schema Integration: Database schema of our dataset and external dataset may be different. For example, our student records may have six attributes (id, name, gender, address, grade, dob) but dataset brought from an external source may have the following six attributes (id, name, gender, district, class, age). Here, a number of attributes of both datasets are the same, but the data type of attributes: address and district (both are string but the length may differ) may be different; age and date of birth (dob) are different data types.
  • Record identification problem: ID of students in our dataset may differ from ID of the external dataset. For example, we may have written integer values starting from 1 as ID but the external source may have integers starting from 10000 for ID.

Data Reduction

Data Reduction: is the process of reducing the size of data such that the amount of storage required to store and process data is reduced. Several techniques can be used to reduce the size of data.
  • Dimension Reduction: Unnecessary attributes are removed to reduce the dimension of the dataset. Combining or merging existing attributes is also done to reduce number of attributes from the dataset.
  • Numerosity Reduction: Existing attribute can be represented by smaller representation. For example, if the grade value in our example is represented by integer or string (one, two etc), we can use a for one, b for two such that the new representation requires less memory.
  • Data compression: Data can be compressed using different compression techniques.

Data Transformation

Data transformation is method of converting data given in one format to another format.
  • Smoothing:
  • Aggregation: Data is presented in summary form.
  • Discretization: Intervals are used to represent continuous data.
  • Normalization: Data is scaled from one range to another range.