Answer: One-to-Many (Dimension → Fact)
Answer: Unsupervised Learning
Answer: City Block Distance (L1 Distance)
Answer: Database Management System
Answer: Hardware Mining
Answer: Breadth-First Search Candidate Generation
Answer: True
Answer: True
Answer: Minimum
Answer: Database Server
Answer: (c) FP-Growth
Answer: (a) Only (i) and (ii) are true
Support: frequency of itemset in database: Support(A→B) = Count(A∪B)/Total Transactions.
Confidence: reliability of rule: Confidence(A→B) = Support(A∪B)/Support(A).
Frequent Itemset: itemset with support ≥ min_support threshold.
Association Rule: implication A → B indicating co-occurrence. Example: Bread → Butter.
Tree construction principle (FP-Growth):
Advantages: compact representation, avoids candidate generation, fewer scans.
Clustering: grouping similar objects, separating dissimilar ones.
| Method | Description | Example |
|---|---|---|
| Partitioning | Divides data into k clusters | K-Means |
| Hierarchical | Creates cluster hierarchy (agglomerative/divisive) | BIRCH |
| Density-Based | Forms clusters based on density | DBSCAN |
| Grid-Based | Partitions space into cells, fast | STING |
Time-Series DB: stores observations at regular time intervals (stock prices, sensors).
Examples: Stock prices → time series ; customer purchase sequence → sequential.
Algorithm:
Natural grouping: Cluster1: X1,X2,X3,X4 → medoid X4(4,7). Cluster2: X5,X6,X7,X8,X9,X10 → medoid X8(7,4).
Final clusters: C₁ = {X1,X2,X3,X4}, C₂ = {X5,X6,X7,X8,X9,X10}.
Manhattan: d(x,y)= Σ|xᵢ - yᵢ|.
✔ Non-negativity: absolute values ≥0.
✔ Identity: zero iff all coordinates equal.
✔ Symmetry: |xᵢ-yᵢ| = |yᵢ-xᵢ|.
✔ Triangle: |xᵢ - zᵢ| ≤ |xᵢ-yᵢ|+|yᵢ-zᵢ|, summing preserves inequality. Hence valid metric.
Data stream: continuous, rapid, infinite sequence of data (sensor networks, financial transactions). Characteristics: continuous, infinite, real-time processing.
Synopsis: compact summary of stream data. Structures: sampling, histograms, wavelets, sketches. Advantages: reduced memory, fast processing, approximate query answering.
Reservoir sampling: maintains random sample. Progressive sampling: sample grows with stream. Sliding window: recent data only. Applications: real-time analytics, fraud detection.
Procedure: Initialize K centers → Euclidean distance → assign nearest → update centroids → repeat until convergence. Formula: d = √[(x₁-x₂)²+(y₁-y₂)²]. Advantages: simple, fast, scalable.
CLARA: Clustering Large Applications – uses sampling, applies PAM on sample. Suitable for large data.
CLARANS: Randomized search, explores neighbors, better quality.
Difference: CLARA (sampling, faster) vs CLARANS (randomized search, more accurate).
Min-Max: v' = (v-min)/(max-min), min=12, max=65 → 12→0, 65→1. Range [0,1].
Z-Score: z = (x-μ)/σ, μ=35.4, σ≈18.15. Example: 12→(12-35.4)/18.15 = -1.29 ; 65→(65-35.4)/18.15 = 1.63.
| Supervised | Unsupervised |
|---|---|
| Labeled data, predict output | Unlabeled data, discover patterns |
| Classification/Regression | Clustering/Association |
| Examples: Decision Tree, KNN | Examples: K-Means, DBSCAN |
KNN (K-Nearest Neighbor): supervised learning, classifies based on majority of K nearest neighbors.
Steps: choose K, compute distances, find K neighbors, majority voting.
Example: points (1,1)A, (2,2)A, (6,6)B, (7,7)B. New point (3,3) → nearest to class A → predicts A. Advantages: simple, no training. Disadvantages: slow for large data, sensitive to noise.
✔ Complete coverage of MAKAUT PE-DT602B (2022–23) – Data Warehousing & Data Mining.
✔ Includes K-Medoid, distance axioms, stream mining, CLARA/CLARANS, similarity search, normalization, KNN.
✔ All group answers written in university exam style – ready for 1, 5, and 15-mark questions.