📊 PE-DT602B: Data Warehousing & Data Mining

MAKAUT B.Tech (2022–23) – Detailed Solved Question Paper

✅ Group-A (1 Mark) | Group-B (5 Marks) | Group-C (Long Answers) | Exam-ready
📌 Group-A (Very Short Answer – 1 Mark Each)
1(i) A star schema has what type of relationship between dimension and fact table?

Answer: One-to-Many (Dimension → Fact)

1(ii) K-Means clustering is what type of learning?

Answer: Unsupervised Learning

1(iii) Manhattan distance is also called what?

Answer: City Block Distance (L1 Distance)

1(iv) What is the full form of DBMS?

Answer: Database Management System

1(v) Web mining does not include what?

Answer: Hardware Mining

1(vi) AGM approach is what type of candidate generation method?

Answer: Breadth-First Search Candidate Generation

1(vii) FP-tree does not need candidate generation. True/False?

Answer: True

1(viii) The clustering technique k-means is based on centroid. True/False?

Answer: True

1(ix) The best-fitted trend line is one for which sum of squared errors is minimum/maximum?

Answer: Minimum

1(x) A stream data query processing architecture does not include which server?

Answer: Database Server

1(xi) Which frequent pattern mining technique mines without candidate generation?

Answer: (c) FP-Growth

1(xii) Choose correct alternative.

Answer: (a) Only (i) and (ii) are true

✍️ Group-B (Short Answer – 5 Marks)
Q2. Define Support, Confidence, Frequent Itemset and Association Rule.

Support: frequency of itemset in database: Support(A→B) = Count(A∪B)/Total Transactions.

Confidence: reliability of rule: Confidence(A→B) = Support(A∪B)/Support(A).

Frequent Itemset: itemset with support ≥ min_support threshold.

Association Rule: implication A → B indicating co-occurrence. Example: Bread → Butter.

Q3. Discuss briefly the tree construction principle.

Tree construction principle (FP-Growth):

  • Scan DB, find frequent items
  • Remove infrequent items, sort by support
  • Construct FP-tree, store common prefixes
  • Mine patterns without candidate generation

Advantages: compact representation, avoids candidate generation, fewer scans.

Q4. What is Clustering? Describe partitioning, hierarchical, density-based and grid-based methods.

Clustering: grouping similar objects, separating dissimilar ones.

MethodDescriptionExample
PartitioningDivides data into k clustersK-Means
HierarchicalCreates cluster hierarchy (agglomerative/divisive)BIRCH
Density-BasedForms clusters based on densityDBSCAN
Grid-BasedPartitions space into cells, fastSTING
Q5. What is a Time-Series Database? How is time-series data different from sequential data?

Time-Series DB: stores observations at regular time intervals (stock prices, sensors).

Time-Series DataSequential Data Time dependent, numeric valuesOrder dependent, events/items Fixed timestamps, forecastingSequence of actions, pattern discovery

Examples: Stock prices → time series ; customer purchase sequence → sequential.

Q6. Write K-Means Clustering Algorithm.

Algorithm:

  • Select K initial centroids
  • Compute distance & assign points to nearest centroid
  • Recompute centroids
  • Repeat until convergence
Initialize K Centroids → Assign Points → Recompute Centroids → Converged? → (No: repeat) → (Yes: stop)
📘 Group-C (Long Answer – 15 Marks)
Q7(a) Apply K-Medoid Algorithm (Manhattan distance). Given points X1(2,6), X2(3,4), X3(3,8), X4(4,7), X5(6,2), X6(6,4), X7(7,3), X8(7,4), X9(8,5), X10(7,6).

Natural grouping: Cluster1: X1,X2,X3,X4 → medoid X4(4,7). Cluster2: X5,X6,X7,X8,X9,X10 → medoid X8(7,4).

Final clusters: C₁ = {X1,X2,X3,X4}, C₂ = {X5,X6,X7,X8,X9,X10}.

Q7(b) Four Axioms of Distance Metrics.
  • Non-negativity: d(x,y) ≥ 0
  • Identity: d(x,y)=0 ⇔ x=y
  • Symmetry: d(x,y)=d(y,x)
  • Triangle inequality: d(x,z) ≤ d(x,y)+d(y,z)
Q7(c) Show Manhattan Distance satisfies all axioms.

Manhattan: d(x,y)= Σ|xᵢ - yᵢ|.
✔ Non-negativity: absolute values ≥0.
✔ Identity: zero iff all coordinates equal.
✔ Symmetry: |xᵢ-yᵢ| = |yᵢ-xᵢ|.
✔ Triangle: |xᵢ - zᵢ| ≤ |xᵢ-yᵢ|+|yᵢ-zᵢ|, summing preserves inequality. Hence valid metric.

Q8(a) What is Data Stream?

Data stream: continuous, rapid, infinite sequence of data (sensor networks, financial transactions). Characteristics: continuous, infinite, real-time processing.

Q8(b) Challenges of Stream Data Mining.
  • Huge volume
  • Limited memory
  • High arrival rate
  • Concept drift
  • Real-time processing
  • Single-pass requirement
Q8(c) Synopsis and Synopsis Data Structures.

Synopsis: compact summary of stream data. Structures: sampling, histograms, wavelets, sketches. Advantages: reduced memory, fast processing, approximate query answering.

Q8(d) Stream Data Models.

Reservoir sampling: maintains random sample. Progressive sampling: sample grows with stream. Sliding window: recent data only. Applications: real-time analytics, fraud detection.

Q9(a) K-Means Clustering (Procedure).

Procedure: Initialize K centers → Euclidean distance → assign nearest → update centroids → repeat until convergence. Formula: d = √[(x₁-x₂)²+(y₁-y₂)²]. Advantages: simple, fast, scalable.

Q9(b) Describe CLARA and CLARANS.

CLARA: Clustering Large Applications – uses sampling, applies PAM on sample. Suitable for large data.
CLARANS: Randomized search, explores neighbors, better quality.
Difference: CLARA (sampling, faster) vs CLARANS (randomized search, more accurate).

Q10(a) Applications of Similarity Search in Time-Series Analysis.
  • Pattern matching
  • Medical diagnosis (ECG)
  • Stock market analysis
  • Weather forecasting
  • Speech recognition
  • Anomaly detection
Q10(b) Why Normalization is Necessary?
  • Removes scale differences
  • Improves accuracy
  • Prevents attribute dominance
  • Required for distance-based algorithms
Q10(c) Min-Max Scaling and Z-Score Normalization. Given X={12,19,21,23,25,35,47,48,59,65}.

Min-Max: v' = (v-min)/(max-min), min=12, max=65 → 12→0, 65→1. Range [0,1].
Z-Score: z = (x-μ)/σ, μ=35.4, σ≈18.15. Example: 12→(12-35.4)/18.15 = -1.29 ; 65→(65-35.4)/18.15 = 1.63.

Q11(a) Differentiate Supervised and Unsupervised Learning.
SupervisedUnsupervised
Labeled data, predict outputUnlabeled data, discover patterns
Classification/RegressionClustering/Association
Examples: Decision Tree, KNNExamples: K-Means, DBSCAN
Q11(b) Explain KNN Algorithm with Example.

KNN (K-Nearest Neighbor): supervised learning, classifies based on majority of K nearest neighbors.
Steps: choose K, compute distances, find K neighbors, majority voting.
Example: points (1,1)A, (2,2)A, (6,6)B, (7,7)B. New point (3,3) → nearest to class A → predicts A. Advantages: simple, no training. Disadvantages: slow for large data, sensitive to noise.

📖 Exam-focused Summary

✔ Complete coverage of MAKAUT PE-DT602B (2022–23) – Data Warehousing & Data Mining.
✔ Includes K-Medoid, distance axioms, stream mining, CLARA/CLARANS, similarity search, normalization, KNN.
✔ All group answers written in university exam style – ready for 1, 5, and 15-mark questions.