Data Mining And Data Warehousing
Introduction
Data Mining
Data mining is the process of extracting hidden information and knowledge from huge volume of data.
Organizations collect data everyday while performing the operational levels tasks and activities. That results in
the accumulation of huge of volume of data after a certain period of time. Data analysts, Data miner, knowledge workers,
statisticians, software engineers developed methods to work on those data and produce information, knowledge which were previously
unknown to anyone.
Data mining functions
- Classification and Predication
- Cluster Analysis
- Outlier Analysis
- Association Analysis
- Evolution Analysis
Separate Data repository required
Data mining can't be performed on the operational database of the organizations; Data stored in the operational database
have different purpose compared to data required for data mining. That is, operational database is updated quite frequently; contains
several data required for performing different tasks of the organizations, may contain inconsistencies, noise, redundancies and
data may be incomplete; data format of operational data may not be according to the data mining algorithms.
Data mining needs clean, consistent and complete data; if data mining is performed on incomplete, inconsistent data then the
knowledge, information produced by Data mining functions may not be accurate, and reliable.
Data Preprocessing
Operational data needs to be cleaned, made error and noise free and complete. Data preprocessing is the process of cleaning data and
making data consistent and complete; data preprocessing is performed to prepare data for mining.
Data Preprocessing Phases:
- Data Selection
- Data integration
- Data cleaning
- Data transformation
- Data Reduction
Name | Address | gender | DOB | Marital Status |
Geeta | Baneshwor | FeMale | 1999/01/01 | UnMarried |
Sital | Kuleshwor | F | 1992/01/01 | Married |
Neha | Baneshwor | FeMale | 1981/11/14 | |
Rakesh | Kuleshwor | M | 1979/01/01 | Married |
Laxmi | Thali | Female | 1989/07/01 | |
Geeta | Baneshwor | F | 1999/01/01 | UnMarried |
Anil | | Male | | Married |
Importance of data preprocessing
Data preprocessing is necessary to ensure quality of the data and information produced.
Data preprocessing is done to check quality of the data. Following properties of the data
are used to ensure quality of the data:
- Accuracy: Is data correct or not?
- Completeness: Is full set of records available or not?
- Consistency: Is data same in all places or not?
- Timeliness: Notion of time is necessary. Each record must have associate date with it. Is data up-to-date with time or not?
- Reliable: Can data be trusted or not?
- Interpretability: Is data in understandable form?
Data Cleaning
The data cleaning process removes incorrect data, removes incomplete data or
makes incomplete data complete by adding values to the missing fields.
Handling missing values:
Attribute mean or global mean or group mean can be used to fill in the missing values.
For example, let's suppose we have a dataset of 1000 students of grade ten; among those 1000 records
20 records have no value for age. Here we can insert the mean age of the students of grade ten, we can
the take global average of the students studying in grade 10. Or if removing those ten records has to severe effect
on the information produced in that case those 20 records can be removed.
Noise data:
Noise means a random error in the data. Data seems inconsistent because of noise. Let's suppose,
in the dataset of 1000 students, the age of the students ranges from 13 to 28.
In general, the age of students of grade ten is 15, 16, 17. Here, we can see inconsistency in the value of age. This
is a simple example of noise.
Noise can be removed using different Binning (smoothing) techniques:
- Binning: We can use the mean value or median value to replace the inconsistent data. In case
of our example, we can use the mean or median age of students of grade ten to smooth the noise. Or minimum or
maximum value in the set can be taken and values near to minimum are replaced with minimum
and values near to maximum are replaced by maximum (Smoothing by boundary).
- Regression: Regression is used to check if the attribute is suitable for analysis or not.
- Clustering: Outliers are found using clustering techniques.
Data Integration
For the purpose of data mining, existing organizational data may not be sufficient, in that case
data from external sources have to be brought and integrated with the existing dataset. But some problems may have to be faced while
integrating dataset.
- Schema Integration: Database schema of our dataset and external dataset may be different. For
example, our student records may have six attributes (id, name, gender, address, grade, dob) but
dataset brought from an external source may have the following six attributes (id, name, gender, district, class, age).
Here, a number of attributes of both datasets are the same, but the data type of attributes: address and district (both are string
but the length may differ) may be different; age and date of birth (dob) are different data types.
- Record identification problem: ID of students in our dataset may differ from ID of the external dataset.
For example, we may have written integer values starting from 1 as ID but the external source may have integers starting from 10000 for ID.
Data Reduction
Data Reduction: is the process of reducing the size of data such that the amount of storage
required to store and process data is reduced. Several techniques can be used to reduce the size of data.
- Dimension Reduction: Unnecessary attributes are removed to reduce the dimension of the dataset.
Combining or merging existing attributes is also done to reduce number of attributes from the dataset.
- Numerosity Reduction: Existing attribute can be represented by smaller representation. For example, if the grade
value in our example is represented by integer or string (one, two etc), we can use
a for one, b for two such that the new representation requires less memory.
- Data compression: Data can be compressed using different compression techniques.
Data Transformation
Data transformation is method of converting data given in one format to another format.
- Smoothing:
- Aggregation: Data is presented in summary form.
- Discretization: Intervals are used to represent continuous data.
- Normalization: Data is scaled from one range to another range.