Q_and_A_2025_new Data Mining and Data Warehousing - Java, Java Swing, OOAD, MIS, DSA

Data Mining And Data Warehousing

Q_and_A_2025_new

Hyperplane

Definition:

A hyperplane is a flat affine subspace in an n-dimensional space that separates the space into two halves. In 2D, it's a line; in 3D, it's a plane; in higher dimensions, it's called a hyperplane.

Role in Linear Classifier:

In linear classifiers, the hyperplane is the decision boundary that separates different classes. For example, in 2-class classification:

w * x + b = 0

where,

w is the weight vector,

x is the feature vector, and

b is the bias. Points on one side belong to class +1, and on the other side to class -1.

Support Vectors

Definition:

Support vectors are the data points that are closest to the hyperplane. These points lie on the margin (Distance between support vectors and hyperplane; linear classifier tries to maximize this for better generalization.) and are critical in defining the position and orientation of the hyperplane.

Role in Linear Classifier:

They determine the optimal hyperplane.

Removing non-support vector points does not change the hyperplane.

The margin (distance between the hyperplane and nearest points) is maximized using support vectors.

Proximity-based approach for detecting outliers

Detecting outliers using a proximity-based approach relies on measuring distances or densities between data points.

Idea: Outliers are points that are far away from most other points in the dataset.
Proximity measures: Euclidean distance, Manhattan distance, or other similarity metrics.

Common Proximity-Based Methods

Distance-Based Outlier Detection
Method: A point P is an outlier if most other points are farther than a threshold distance d from it.
Let d(P,Q) = distance between points P and Q.

Threshold: if fewer than k points lie within distance r of P, mark P as an outlier.
Steps:
- Choose distance metric (e.g., Euclidean).
- Set parameters: distance r and minimum neighbors k.
- For each point, count neighbors within r.
- If count < k, label point as an outlier.
- Example: In 2D space, if a point is isolated far from clusters, it's an outlier.
k-Nearest Neighbor (k-NN) Approach

Method: Measure the distance from a point to its k nearest neighbors. Points with large average distance to neighbors are outliers.
Steps:
1. Compute distance to k nearest neighbors for each point.
2. Compute the average distance or maximum distance to these neighbors.
3. Rank points by distance; top points are likely outliers.
4. Advantages: Works well for clusters of different densities.
Proximity-based outlier detection = find points that are far from most other points.
Key steps:
1. Choose distance/similarity measure.
2. Decide neighborhood size or threshold.
3. Compute distance/density for each point.
4. Label points exceeding threshold as outliers.

Lazy Learners:
A lazy learner is a type of machine learning algorithm that does not build an explicit model during the training phase. Instead, it stores the training data and waits until a query (test instance) is given to perform computations to make a prediction.

Technqiues for improving classification algorithms

Data Cleaning
Feature Selection/Engineering
Data Normalization/Scaling
Data Augmentation

Overfitting and Underfitting

Overfitting occurs when a model learns the training data too well, including its noise and outliers, so it performs poorly on new, unseen data. Model may overfit because model is very complex (too many parameters), too little training data, or no regularization

Underfitting occurs when a model is too simple to capture the underlying pattern in the data, so it performs poorly on both training and test data. Model may underfit because: model is very simple, not enough features in data, or inadequate training

Online-Academy

Look, Read, Understand, Apply

Data Mining And Data Warehousing

Hyperplane

Role in Linear Classifier:

Support Vectors

Role in Linear Classifier:

Proximity-based approach for detecting outliers

Common Proximity-Based Methods