Data Mining And Data Warehousing
Q_and_A_2025_new
Hyperplane
Definition:
A hyperplane is a flat affine subspace in an
n-dimensional space that separates the space into two halves.
In 2D, it's a line; in 3D, it's a plane; in higher dimensions, it's called a hyperplane.
Role in Linear Classifier:
In linear classifiers, the hyperplane is the decision boundary that separates different classes.
For example, in 2-class classification:
w * x + b = 0
where,
w is the weight vector,
x is the feature vector, and
b is the bias. Points on one side belong to class +1, and on the other side to class -1.
Support Vectors
Definition:
Support vectors are the data points that are closest to the hyperplane.
These points lie on the margin (Distance between support vectors and hyperplane; linear classifier tries to maximize this for better generalization.) and are critical in defining the position and orientation of the hyperplane.
Role in Linear Classifier:
They determine the optimal hyperplane.
Removing non-support vector points does not change the hyperplane.
The margin (distance between the hyperplane and nearest points) is maximized using support vectors.
Proximity-based approach for detecting outliers
Detecting outliers using a proximity-based approach relies on measuring distances or densities between data points.
- Idea: Outliers are points that are far away from most other points in the dataset.
- Proximity measures: Euclidean distance, Manhattan distance, or other similarity metrics.
Common Proximity-Based Methods
- Distance-Based Outlier Detection
Method:
A point P is an outlier if most other points are farther than a threshold distance
d from it.
Let d(P,Q) = distance between points P and Q.
Threshold: if fewer than k points lie within distance r of P, mark P as an outlier.
Steps:
- Choose distance metric (e.g., Euclidean).
- Set parameters: distance r and minimum neighbors k.
- For each point, count neighbors within r.
- If count < k, label point as an outlier.
- Example: In 2D space, if a point is isolated far from clusters, it's an outlier.
-
k-Nearest Neighbor (k-NN) Approach
Method:
Measure the distance from a point to its k nearest neighbors. Points with large average distance to neighbors are outliers.
Steps:
- Compute distance to k nearest neighbors for each point.
- Compute the average distance or maximum distance to these neighbors.
- Rank points by distance; top points are likely outliers.
- Advantages: Works well for clusters of different densities.
-
Proximity-based outlier detection = find points that are far from most other points.
Key steps:
- Choose distance/similarity measure.
- Decide neighborhood size or threshold.
- Compute distance/density for each point.
- Label points exceeding threshold as outliers.
Lazy Learners:
A lazy learner is a type of machine learning algorithm that does not build an explicit model during the training phase. Instead, it stores the training data and waits until a query (test instance) is given to perform computations to make a prediction.
Technqiues for improving classification algorithms
- Data Cleaning
- Feature Selection/Engineering
- Data Normalization/Scaling
- Data Augmentation
Overfitting and Underfitting
Overfitting occurs when a model learns the training data too well, including its noise and outliers, so it performs poorly on new, unseen data. Model may overfit because
model is very complex (too many parameters), too little training data, or no regularization
Underfitting occurs when a model is too simple to capture the underlying pattern in the data, so it performs poorly on both training and test data. Model may underfit
because: model is very simple, not enough features in data, or inadequate training