Answer: Prediction is the process of estimating unknown or future values based on historical data using statistical or machine learning techniques.
Answer: Autocorrelation measures the relationship between current and past observations. It helps identify trends, seasonality, and suitable forecasting models.
Web Content Mining: Extracts information from web pages. Web Usage Mining: Analyzes user browsing behavior from web logs.
Answer: Data mining is the process of discovering useful patterns, relationships, and knowledge from large datasets.
Answer: They group similar data objects around centroids, providing efficient and scalable clustering for large datasets.
Answer: Decomposition separates a time series into trend, seasonal, cyclical, and irregular components for better analysis.
Answer: Sampling reduces data volume while preserving essential information, enabling efficient stream processing.
Answer: Modulation enables efficient transmission of signals over long distances by varying carrier wave properties.
Challenges: Large volume, noise & missing values, seasonal variations, high dimensionality, concept drift.
Solutions: Data preprocessing, smoothing techniques, time-series decomposition, dimensionality reduction, adaptive learning algorithms.
Data Stream: Continuous real-time data (sensor data, stock market).
| Data Streams | Static Data |
|---|---|
| Continuous | Fixed |
| Infinite size | Finite size |
| Real-time processing | Batch processing |
| Single scan | Multiple scans |
| Dynamic | Stable |
Web structure mining extracts info from HTML/DOM. Significance: improves search engines, content extraction, page classification, link analysis, information retrieval.
Graph mining analyzes nodes & edges. Contributions: detects communities, frequent subgraphs, influential nodes, fraud/anomalies, recommendation systems. Example: social network analysis.
Significance: trend detection, forecasting, anomaly detection, customer behavior analysis, event sequence discovery.
Advancements: Cloud Data Warehousing, Distributed File Systems (HDFS), Hadoop Ecosystem, Spark, Data Lake, Parallel Query Processing. Impact: Faster analytics, improved scalability, reduced storage cost, real-time processing, fault tolerance. Examples: Hadoop, Spark, Snowflake, BigQuery.
Class imbalance: one class under-represented. Ensemble techniques: Bagging, Boosting, Random Forest, AdaBoost, XGBoost. Benefits: improve minority class detection, reduce overfitting, enhance accuracy. Applications: fraud detection, medical diagnosis, spam filtering.
Graph representation: nodes = devices, edges = connections. Contributions: detect unusual nodes, abnormal communication, fraud networks, cyber-attacks, suspicious communities. Applications: intrusion detection, financial fraud, social network monitoring.
Techniques: Association rule mining, clustering, classification, prediction, recommendation systems. Applications: market basket analysis, customer segmentation, demand forecasting, personalized promotions, inventory management → increased sales & retention.
Scalable methods handle growing dataset size while maintaining performance. Importance: Big data handling, faster execution, memory efficiency, cost-effective. Examples: K-Means, Apriori, FP-Growth, Hadoop MapReduce, Spark MLlib.
Definition: measures relationships between variables (positive/negative/zero). Applications: feature selection, market analysis, medical research, financial prediction. Advantages: reduces redundancy, improves model accuracy.
Challenges: huge DB size, high dimensionality, rare itemsets, computational complexity, memory limits, dynamic databases, noise, scalability. Solutions: FP-Growth, parallel mining, data compression.
Sequence mining: discovers frequent ordered patterns from sequential data. Steps: collect sequences → identify frequent sequences → generate patterns. Example: customer purchase: Milk → Bread → Butter and Milk → Bread → Jam gives frequent sequence: Milk → Bread. Applications: web clickstream, bioinformatics, retail analytics.
Spectral analysis uses Fourier Transform to identify hidden periodic cycles. Process: transform data → identify dominant frequencies → detect periodicity. Advantages: finds hidden cycles, improves forecasting, detects seasonality. Applications: signal processing, weather, stock analysis.
Methods: trend analysis, moving average, ARIMA, exponential smoothing, ML models. Applications: stock/weather/sales forecasting, demand prediction. Benefits: better planning, risk reduction, improved decision-making.
Applications: Market Basket Analysis, Customer Segmentation, Demand Forecasting, Customer Retention, Recommendation Systems, Inventory Optimization. Benefits: higher sales, better satisfaction, reduced operational cost, improved marketing efficiency.
Data redundancy: multiple attributes contain similar information. Correlation analysis: calculate correlation coefficient → identify highly correlated attributes → remove redundant attributes. Benefits: reduced storage, faster processing, improved model performance. Example: Height(cm) and Height(m) → keep one.
✔ Complete coverage of MAKAUT PE-DT602B: Data Warehousing & Data Mining – all groups.
✔ Answers suitable for 1, 5, and 15-mark questions as per university pattern.
✔ Key topics: time series mining, data streams, web mining, graph mining, ensemble methods, correlation, sequence mining, retail applications.