Data Warehousing MCQ
Data Warehousing MCQ
TH
Semester: 8 SEM
Branch: Computer Technology (C.T.) - OLD SYLLABUS
UNIT 5
1. This clustering algorithm terminates when mean values computed for the current iteration of the
algorithm are identical to the computed mean values for the previous iteration Select one:
a. K-Means clustering
b. conceptual clustering
c. expectation maximization
d. agglomerative clustering
2. Find odd man out Select one:
a. DBSCAN
b. K means
c. PAM
d. K medoid
3. Which statement is true about the K-Means algorithm? Select one:
a. The output attribute must be categorical.
b. All attribute values must be categorical.
c. All attributes must be numeric
d. Attribute values may be either categorical or numeric
4. Which of the following is cluster analysis? Select one:
a. Simple segmentation
b. Grouping similar objects
c. Labeled classification
d. Query results grouping
5. A good clustering method will produce high quality clusters with Select one:
a. high inter class similarity
b. low intra class similarity
c. high intra class similarity
d. no inter class similarity
6. Which statement about outliers is true? Select one:
a. Outliers should be part of the training dataset but should not be present in the test data.
b. Outliers should be identified and removed from a dataset.
c. The nature of the problem determines how outliers are used
d. Outliers should be part of the test dataset but should not be present in the training data.
7. What does K refer to in the K-Means algorithm which is a non-hierarchical clustering approach? Select
one:
a. Complexity
b. Fixed value
c. No of iterations
d. number of clusters
8. Which of the following mentioned clustering methods divides the data into k groups such that each
group must contain at least one object.
a. Partitioning methods
b. Hierarchical methods
c. Density-based methods
d. Grid-based methods
9. Which of the following mentioned clustering methods creates a hierarchical decomposition of the
given set of data objects
a. Partitioning methods
b. Hierarchical methods
c. Density-based methods
d. Grid-based methods
10. Which of the following mentioned clustering methods continue to grow a given cluster as long as the
density (number of objects or data points) in the “neighborhood” exceeds some threshold.
a. Partitioning methods
b. Hierarchical methods
c. Density-based methods
d. Grid-based methods
11. Which of the following mentioned clustering methods quantize the object space into a finite number of
cells that form a grid structure.
a. Partitioning methods
b. Hierarchical methods
c. Density-based methods
d. Grid-based methods
12. The k-means method is not guaranteed to converge to the global optimum and often
terminates at a local optimum.
a. True
b. False
13. The Partitioning Around Medoids (PAM) algorithm is a popular realization of ________________
clustering.
a. K Means
b. K-medoids
c. Extended- PAM
d. BIRCH
14. BIRCH stands for
a. Balanced Iterative Reducing and Clustering using Hierarchies
b. Balanced Iterative Reducing and Classification using Hierarchies
c. Balanced Iterative Regression and Clustering using Hierarchies
d. Balanced Iterative Regression and Classification using Hierarchies
15. DBSCAN stands for
a. Density-Based Classification Based on Connected Regions with High Density
b. Density-Based Classification Based on Connected Regions with Low Density
c. Density-Based Clustering Based on Connected Regions with High Density
d. Density-Based Clustering Based on Connected Regions with Low Density
16. A cluster is a collection of data objects that are______
a. similar to one another within the same cluster and similar to the objects in other clusters
b. dissimilar to one another within the same cluster and are dissimilar to the objects in other clusters
c. similar to one another within the same cluster and are dissimilar to the objects in other clusters
d. dissimilar to one another within the same cluster and are similar to the objects in other clusters
17. This method can be classified as being either agglomerative (bottom-up) or divisive (top-down), based on
how the hierarchical decomposition is formed
a. Partitioning methods
b. Hierarchical methods
c. Density-based methods
d. Grid-based methods
18. __________are the simplest form of outlier and the easiest to detect.
a. contextual outlier
b. collective outlier
c. conceptual outlier
d. global outliers
19. _________methods consult the neighborhood of an object, defined by a given radius. An object is an
outlier if its neighborhood does not have enough other points.
a. Clustering-based outlier detection
b. Classification-based outlier detection
c. Proximity-based outlier detection
d. Distance-based outlier detection
20. _______ methods assume that the normal data objects belong to large and dense clusters, whereas
outliers belong to small or sparse clusters, or do not belong to any clusters.
a. Clustering-based outlier detection
b. Classification-based outlier detection
c. Proximity-based outlier detection
d. Distance-based outlier detection
Question Bank : BECT406T: Data Warehousing & Mining (MCQ)
Unit 1
1) __________ is a subject-oriented,integrated, time-variant, nonvolatile collection of data in supportof
management decisions.
A.Data Mining.
B.Data Warehousing.
C.Web Mining.
D.Text Mining
8) State whether the following statements about the three-tier data warehouse architecture are True or False.
i) OLAP server is the middle tier of data warehouse architecture.
ii) The bottom tier of data warehouse architecture does not include a metadata repository.
A) i-True, ii-False
B) i-False, ii-True
C) i-True, ii-True
D) i-False, ii-False
11. The … of the data warehouse architecture contains query and reporting tools, analysis tools, and data mining
tools.
A) bottom tier
B) middle tier
C) top tier
D) both B and C
12. Which of the following are the examples of gateways of the bottom tier of data warehouse architecture.
i) ODBC (Open Database Connection)
ii) OLEDB (Open-Linking and Embedding of Databases)
iii) JDBC (Java Database Connection)
A) i and ii only
B) ii and iii only
C) i and iii only
D) All i, ii and iii
13. Back-end tools and utilities are used to feed data into the … from operational databases or other external sources.
A) bottom tier
B) middle tier
C) top tier
D) both A and B
14. From the architecture point of view, there are… data warehouse models.
A) two
B) three
C) four
D) five
15. A … contains a subset of corporate-wide data that is of value to a specific group of users.
A) primary warehouse
B) virtual warehouse
C) enterprise warehouse
D) data mart
17. State whether the following statements about the enterprise warehouse are True or False.
i) Enterprise warehouse contains details as well as summarized data.
ii) It provides corporate-wide data integration.
A) i-True, ii-False
B) i-False, ii-True
C) i-True, ii-True
D) i-False, ii-False
18. State whether the following statements about the OLTP system are True.
i) Clerk, database administrators, and database professionals are the users of the OLTP system.
ii) It is used on long-term informational requirements.
iii) It has a short and simple transaction.
A) i and ii only
B) ii and iii only
C) i and iii only
D) All i, ii and iii
19. State whether the following statements about the OLAP system are True or False.
i) Knowledge workers such as managers, executive analysts are the users of the OLAP system.
ii) This system is used in day-to-day operations.
iii) The database size of the OLAP system will be 100GB to TB.
A) i-True, ii-False, iii-True
B) i-False, ii-True, iii-True
C) i-True, ii-True, iii-False
D) i-False, ii-False, iii-True
20. Multidimensional model of a data warehouse can exist in the form of the following schema.
i) Star Schema
ii) Snowflake Schema
iii) Fact Constellation Schema
A) i and ii only
B) ii and iii only
C) i and iii only
D) All i, ii and iii
21. In the … the dimension tables displayed in a radial pattern around the central fact table.
A) snowflake schema
B) star schema
C) fact schema
D) fact constellation schema
22. The dimension tables of the … model can be kept in the normalized form to reduce the redundancies.
A) snowflake schema
B) star schema
C) fact schema
D) fact constellation schema
23. State whether the following statements about the fact constellation schema are True or False.
i) The fact constellation schema is also called galaxy schema.
ii) The fact constellation schema allows dimension tables to be shared between fact tables.
iii) This kind of schema can be viewed as a collection of snowflakes.
A) i-True, ii-False, iii-True
B) i-False, ii-True, iii-True
C) i-True, ii-True, iii-False
D) i-False, ii-False, iii-True
24. Which of the following are the different OLAP operations performed in the multidimensional data model.
i) Roll-up
ii) Roll-down
iii) Drill-down
iv) Slice
A) i, ii, and iii only
B) ii, iii, and iv only
C) i, iii, and iv only
D) All i, ii, iii, and iv
25. When … operation is performed, one or more dimensions from the data cube are removed.
A) roll-up
B) roll-down
C) drill-down
D) drill-up
26. The … operation selects one particular dimension from a given cube and provides a new subcube.
A) drill
B) dice
C) pivot
D) slice
27. The … operation rotates the data axes in view in order to provide an alternative presentation of data.
A) drill
B) dice
C) pivot
D) slice
28. Which of the following are the different types of OLAP servers.
i) Relational OLAP
ii) Multidimensional OLAP
iii) Hybrid OLAP
iv) Specialized SQL Servers
A) i, ii, and iii only
B) ii, iii, and iv only
C) i, iii, and iv only
D) All i, ii, iii, and iv
30. Data that can be modeled as dimension attributes and measure attributes are called _______ data.
a) Multidimensional
b) Singledimensional
c) Measured
d) Dimensional
UNIT II
1. ...................... is an essential process where intelligent methods are applied to extract data patterns.
A) Data warehousing
B) Data mining
C) Text mining
D) Data selection
5. ............................. is a comparison of the general features of the target class data objects against the
general features of objects from one or multiple contrasting classes.
A) Data Characterization
B) Data Classification
C) Data discrimination
D) Data selection
7. ............................. is the process of finding a model that describes and distinguishes data classes or
concepts.
A) Data Characterization
B) Data Classification
C) Data discrimination
D) Data selection
14.Which of the following can be considered as the classification or mapping of a set or class with some pre-
defined group or classes?
a. Data set
b. Data Characterization
c. Data Sub Structure
d. Data Discrimination
15.Which one of the following can be defined as the data object which does not comply with the general be-
havior (or the model of available data)?
a. Evaluation Analysis
b. Outlier Analysis
c. Classification
d. Prediction
16.Which one of the following statements is not correct about the data cleaning?
a.It refers to the process of data cleaning
b.It refers to the transformation of wrong data into correct data
c.It refers to correcting inconsistent data
d.All of the above
19.Which of the following also used as the first step in the knowledge discovery process?
a. Data selection
b. Data cleaning
c. Data transformation
d. Data integration
20.Which of the following refers to the steps of the knowledge discovery process, in which the several data
sources are combined?
a. Data selection
b. Data cleaning
c. Data transformation
d. Data integration
23.The issues of “Scalability and efficiency of the data mining algorithms” come under:
a. User Interaction and Mining Methodology Issues
b. Diverse Data Types Issues
c. Performance Issues
d. None of the above
26.__________ means the description and trends or model regularities for those objects whose behavior
would change eventually over time.
a. Evolution Analysis
b. Outlier Analysis
c. Classification
d. Prediction
28. The issue of “Handling complex and relational types of data” comes under:
a. User Interaction and Mining Methodology Issues
b. Diverse Data Types Issues
c. Performance Issues
d. None of the above
29.Multiple numbers of data sources get combined in which step of the Knowledge Discovery?
a. Data Transformation
b. Data Selection
c. Data Integration
d. Data Cleaning
2) BIRCH is a ________
A) agglomerative clustering algorithm.
B)hierarchical algorithm.
C)hierarchical-agglomerative algorithm.
D) divisive.
4) In ________ algorithm each cluster is represented by the center of gravity of the cluster.
A) k-medoid.
B) k-means.
C) STIRR
D) ROCK.
5) In ___________ each cluster is represented by one of the objects of the cluster located near the center.
A) k-medoid.
B) k-means.
C) STIRR.
D) ROCK.
13)What is a dendrogram?
A)A hierarchical structure
B)A diagram structure
C)A graph structure
D)None
15)__________consider the clusters as the dense region having some similarity and different from the lower
dense region of the space
A) Density-Based
B) Hierarchical Based
C) Grid-based
D) None of these
18) A _________ is a decision support tool that uses a tree-like graph or model of decisions and their possi-
ble consequences, including chance event outcomes, resource costs, and utility.
a) Decision tree
b) Graphs
c) Trees
d) Neural Networks
23)Which of the following refers to the problem of finding abstracted patterns (or structures) in the unlabeled
data?
a.Supervised learning
b.Unsupervised learning
c.Hybrid learning
d.Reinforcement learning
24)Which one of the following statements about the K-means clustering is incorrect?
a. The goal of the k-means clustering is to partition (n) observation into (k) clusters
b. K-means clustering can be defined as the method of quantization
c. The nearest neighbor is the same as the K-means
d. All of the above
25) Which one of the clustering technique needs the merging approach?
a. Partitioned
b. Naïve Bayes
c. Hierarchical
d. Both A and C
26) How do you choose the right node while constructing a decision tree?
(A) An attribute having high entropy
(B) An attribute having high entropy and information gain
(C) An attribute having the lowest information gain.
(D) An attribute having the highest information gain
2)The paths from root node to the nodes labelled 'a' are called __________.
A)transformed prefix path.
B)suffix subpath.
C)transformed suffix path.
D) prefix subpath
3) The transformed prefix paths of a node 'a' form a truncated database of pattern which co-occurwith
a is called _______.
A)suffix path.
B)FP-tree.
C)conditional pattern base.
D) prefix path
5). Which of the following are interestingness measures for association rules?
a. recall
b. lift
c. accuracy
d. compactness
12). The basic idea of the apriori algorithm is to generate________ item sets of a particular size & scans the
database.
A. candidate.
B. primary.
C. secondary.
D. Superkey.
13). If an item set ‘XYZ’ is a frequent item set, then all subsets of that frequent item set are a. Undefined
b. Not frequent
c. Frequent
d. Can not say
14) A frequent pattern tree is a tree structure consisting of ________
A) an item-prefix-tree
B) a frequent-item-header table.
C) a frequent-item-node.
D) both A &B
15) Frequency of occurrence of an itemset is called as _____
(a) Support
(b) Confidence
(c) Support Count
(d) Rules
16) An itemset whose support is greater than or equal to a minimum support threshold is ______
(a) Itemset
(b) Frequent Itemset
(c) Infrequent items
(d) Threshold values
(a) It mines all frequent patterns through pruning rules with lesser support
(b) It mines all frequent patterns through pruning rules with higher support
(c) It mines all frequent patterns by constructing a FP tree
(d) It mines all frequent patterns by constructing an itemsets
18) What techniques can be used to improve the efficiency of apriori algorithm?
(a) Apriori
(b) FP growth
(c) Decision trees
(d) Eclat
(a) Apriori
(b)FP Growth
(c) Naive Bayes
(d)Decision Trees
27) For the question given below consider the data Transactions :
(a) <I1>, <I2>, <I4>, <I5>, <I6>, <I1, I4>, <I2, I4>, <I2, I5>, <I4, I5>, <I4, I6>, <I2, I4, I5>
(b) <I2>, <I4>, <I5>, <I2, I4>, <I2, I5>, <I4, I5>, <I2, I4, I5>
(c) <I11>, <I4>, <I5>, <I6>, <I1, I4>, <I5, I4>, <I11, I5>, <I4, I6>, <I2, I4, I5>
(d) <I1>, <I4>, <I5>, <I6>
(a) Concurrent
(b) Consistent
(c) Constant
(d) Compete
2) _______ mining is concerned with discovering the model underlying the link structures of the web.
A) Data structure.
B) Web structure
C) Text structure
D) Image structure.
5) In web mining, _________ is used to know the order in which URLs tend to be accessed.
A) clustering.
B) associations.
C) sequential analysis.
D.classification
6) In web mining, _________ is used to know which URLs tend to be requested together.
A.clustering.
B.associations.
C.sequential analysis.
D.classification.
7) __________ describes the discovery of useful information from the web contents.
A)Web content mining.
B) Web structure mining.
C) Web usage mining.
D) All of the above.
8)_______ is concerned with discovering the model underlying the link structures of the web
A) Web content mining.
B) Web structure mining
C) Web usage mining.
D) All of the above
9)A link is said to be _________ link if it is between pages with different domain names.
A) intrinsic.
B) transverse.
C) direct.
D) contrast.
10) A link is said to be _______ link if it is between pages with the same domain name.
A) intrinsic.
B) transverse.
C) direct.
D) contrast.
11) Hierarchical, Partitioning, Grid-based and density based methods are the methods of
Clustering
Classification
Association
Outlier Detection
12. Web structure mining is the process of discovering ____ information from the web
Semi structured
Unstructured
Structured
Data Mining
Text Mining
Both a and b
None of these
14. Select non predictive data mining technique from below options
Summarization
Classification
Regression
Association rule
Clustering
Regression
Classification
16. PageRank is a metric for ________documents based on their quality
- ranking hypertext
- ranking document structure
- ranking web content
- None of these
17. Select non descriptive data mining technique from options below
Options
- Clustering
- Summarization
- Sequence Discovery
- Classification
18. Select non predictive data mining technique from below options
Options
- Summarization
- Classification
- Regression
- Time Series Analysis
19. In data mining, Data objects that do not comply with general behavior or model of the data are called as
Options
- Clusters
- Centroids
- Outliers
- None of these
20. Web usage mining refers to the discovery of user access patterns from Web usage logs
True
False
21. BIRCH stands for
Balanced Interactive Regression and Clustering using Hierarchies
Data cleaning
Data Reduction
Regression
Data Loading
23. In data mining, Data objects that do not comply with general behavior or model of the data are called as
Clusters
Centroids
Outliers
None of these
24. Web Server Data includes ________
IP address,
page reference
access time
Web pages
Web hyperlinks
Web data
Web contents
UNIT VI Big data Analytics
1. Hadoop is a framework that works with a variety of related tools. Common cohorts include ____________
a) MapReduce, Hive and HBase
b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet
3. __________ can best be described as a programming model used to develop Hadoop-based applications
that can process massive amounts of data.
a) MapReduce
b) Mahout
c) Oozie
d) All of the mentioned
6.Above the file systems comes the ________ engine, which consists of one Job Tracker, to which client
applications submit MapReduce jobs.
A. MapReduce
B. Google
C. Functional Programming
D. Facebook
7. ________ is a platform for constructing data flows for extract, transform, and load (ETL) processing and
analysis of large datasets.
A. Pig Latin
B. Oozie
C. Pig
D. Hive
8. According to analysts, for what can traditional IT systems provide a foundation when they’re integrated
with big data technologies like Hadoop?
a) Big data management and data mining
b) Data warehousing and business intelligence
c) Management of Hadoop clusters
d) Collecting and storing unstructured data
A. Tera
B. Giga
C. Peta
D. Meta
12. Unprocessed data or processed data are observations or measurements that can be expressed as
text, numbers, or other types of media.
A. True
B. False
13. In computers, a ____ is a symbolic representation of facts or concepts from which information
may be obtained with a reasonable degree of confidence.
A. Data
B. Knowledge
C. Program
D. Algorithm
17. Virtualization separates resources and services from the underlying physical delivery environment.
A. True
B. False
A. Virtualization layer
B. Storage layer
C. Abstract layer
D. None of the mentioned above
A. SQL
B. DBMS
C. NoSQL
D. RDBMS
A. Python
B. C++
C. R
D. Java
23. Big data deals with high-volume, high-velocity and high-variety information assets,
A. True
B. False
24. _____ hypervisor runs directly on the underlying host system. It is also known as "Native Hypervi-
sor" or "Bare metal hypervisor".
A. TYPE-1 Hypervisor
B. TYPE- 2 Hypervisor
C. Both A and B
D. None of the mentioned above
A. TYPE-1 Hypervisor
B. TYPE- 2 Hypervisor
C. Both A and B
D. None of the mentioned above
26. In the layered architecture of Big Data Stack, Interfaces and feeds,
27. _____ is the supporting physical infrastructure is fundamental to the operation and scalability of
big data architecture.
28. The physical infrastructure of a big data is based on a distributed computing model.
A. True
B. False
29. Security infrastructure refers the data about your constituents needs to be protected to ____.
A. Processing of data
B. User friendly representation
C. Both A and B
D. None of the mentioned above
32. The significance of metadata is to provide information about a dataset’s characteristics and struc-
ture.
A. True
B. False
A. True
B. False
A. Cost Reduction
B. Time Reductions
C. Smarter Business Decisions
D. All of the mentioned above
35. Amongst which of the following is/are not Big Data Technologies?
A. Apache Hadoop
B. Apache Spark
C. Apache Kafka
D. Apache Pytarch
1. Information can be converted into knowledge about ___ patterns and future trends.
Ans: Historical
4. ___ and ___ are the key to emerging Business Intelligence technologies.
Ans: Data warehouse and data mining
6. Online Analytical Processing (OLAP) is a technology that is used to create ___ software.
Ans: Decision support
9. ___ Optimization techniques are based on the concepts of genetic combination, mutation, and natural selection.
Ans: Genetic algorithms
11. A data warehouse refers to a database that is maintained separately from an organization’s operational databases.
(True/False)
Ans: True
12. A data warehouse is usually constructed by integrating multiple heterogeneous sources. (True/False)
Ans: True
13. ___ system is customer-oriented and is used for transaction and query processing by clerks, clients, and
information technology professionals.
Ans: OLTP
15. In ___ schema some dimension tables are normalized, thereby further splitting the data into additional tables.
Ans: Snowflake
16. The ___ data model is commonly used in the design of relational databases.
Ans: Entity-relationship
17. Data warehouses and OLAP tools are based on ___ data model.
Ans: Multidimensional
18. The ___ exposes the information being captured, stored, and managed by operational systems.
Ans: Data source view
19. ___ are the intermediate servers that stand in between a relational back – end server and client front – end
tools.
Ans: Relational OLAP (ROLAP) servers
21. The ___ software gives the user the opportunity to look at the data from a variety of different dimensions.
Ans: Multidimensional Analysis
23. Based on the overall requirements of business intelligence, the ___ layer is required to extract, cleanse and
transform data into load files for the information warehouse.
Ans: Data integration
27. ___ routines attempt to fill in missing values, smooth out noise while identifying outlines, and correct
inconsistencies in the data.
Ans: Data cleaning
28. ___ is used to refer to systems and technologies that provide the business with the means for decision-makers
to extract personalized meaningful information about their business and industry.
Ans: Business Intelligence
29. In ___ each value in a bin is replaced by the mean value of the bin.
Ans: Smoothing by bin means
30. ___ regression involves finding the “best” line to fit two variables so that one variable can be used to predict
the other.
Ans: Linear
31. ___ works to remove the noise from the data that includes techniques like binning, clustering, and regression.
Ans: Smoothing
33. The ___ technique uses encoding mechanisms to reduce the data set size.
Ans: Data compression
35. ___ hierarchies can be used to reduce the data by collecting and replacing low-level concepts by higher-level
concepts.
Ans: Concept
36. The ___ rule can be used to segment numeric data into relatively uniform, “natural” intervals.
Ans: 3-4-5
38. Data Base Management System (DBMS) supports query languages. (True/False)
Ans: True
39. The ___ item sets find all sets of items (items sets) whose support is greater than the user-specified minimum
support, σ.
Ans: Frequent set
40. A frequent set is a ___ if it is a frequent set and no superset of this is a frequent set.
Ans: Maximal frequent set
41. ___ techniques are used to detect relationships or associations between specific values of categorical variables
in large data sets.
Ans: Association rule mining
43. Using a decision tree, only categorical variables would be modelled. (True/False).
Ans: False
46. For a given transaction database T, a ___ is an expression of the form X => Y, where X and Y are subsets of
A and X => Y holds with confidence Ʈ, if Ʈ% of transactions in D support X also support Y.
Ans: Association rule
47. The ___ rule describes associations between quantitative items or attributes.
Ans: Quantitative association
48. The ___ step eliminates the extensions of (k-1) – itemsets, which are not found to be frequent, from being
considered for counting support.
Ans: Pruning
49. In the first phase of the Partition algorithm, the algorithm logically divides the database into a number of ___.
Ans: non – overlapping partitions.
51. ___ algorithm works like a train running over the data, with stops at intervals M between transactions. When
the train reaches the end of the transaction file it completes one path.
Ans: DIC Algorithm
54. Data mining systems should provide capabilities to mine association rules at multiple levels of abstraction and
traverse easily among different abstraction spaces (True/False).
Ans: True
55. Which one of the following is alternative search strategies for mining multiple-level associations with reduced
support?
a) Level – by level independent
b) Level – cross-filtering by a single item
c) Level – cross-filtering by k – itemset:
d) All the above
Ans: d) All the above
57. Association rules that involve two or more dimension or predicates can be referred to as ___.
Ans: Multidimensional association rules.
58. An algorithm that performs a series of “walks” through itemset space is called a ___.
Ans: Random walk algorithm.
61. The process of grouping a set of physical or abstract objects into classes of similar objects is called ___.
Ans: Cluster
66. Weight and height of an individual fall into ___ kind of variables.
Ans: Continuous
67. In the K-means algorithm for partitioning, each cluster is represented by the ___ of objects in the cluster.
Ans: Means
68. K-means clustering requires prior knowledge about number clusters required as its input.(True/False).
Ans: True
70. ___ software provides a set of partitioned clustering algorithms that treat the clustering problem as an
optimization process.
Ans: CLUTO
72. ___ can be viewed as the construction and use of a model to assess the class of an unlabeled sample, or to
assess the value or value ranges of an attribute that a given sample is likely to have.
Ans: Prediction
73. ___ of data removes or reduces noise (by applying smoothing techniques) and the treatment of missing values.
Ans: Pre-processing
74. ___ method refers to the ability to construct the model efficiently given a large amount of data.
Ans: Scalability
76. The basic algorithm for decision tree induction is a ___ algorithm.
Ans: greedy
77. The ___ measure is used to select the test attribute at each node in the tree.
Ans: information gain
79. ___ is simple text files that are automatically generated every time someone accesses one Website.
Ans: Log File
81. ___ is used to examine the structure of a particular website and collate and analyze related data.
Ans: Structural mining
82. Which of the following techniques are concerned about user navigation accessing?
a. Web structural mining
b. Web usage mining
c. Web content mining
d. Web data definition mining
Ans: b. Web usage mining
84. ___ Web mining involves the development of Sophisticated Artificial Intelligence systems.
Ans: an agent-based approach
85. The ___ approaches to Web mining have generally focused on techniques for integrating and organizing the
heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources.
Ans: database
86. Association rules involving multimedia objects can be mined in ___ and ___ databases.
Ans: Image and video
87. In ___ approach, the signature of an image includes color histograms based on the color composition of an
image regardless of its scale or orientation.
Ans: Color histogram-based signature
88. Which of the following are the measures of the text retrieval documents?
a. Precision
b. Recall
c. F-score
d. a,b,c
Ans: d. a,b,c
90. Which of the following is the first step in text retrieval systems?
a. Stemming
b. Term words finding
c. Tokenization
d. Replacing the null data with keywords
Ans: c. Tokenization
93. Insurance and direct mail are two industries that rely on ___ to make profitable business decisions.
Ans: data analysis
94. To aid decision-making, analysts construct ___ models using warehouse data to predict the outcomes of a
variety of decision alternatives.
Ans: predictive
95. A ___ profile is a model that predicts the future purchasing behaviour of an individual customer, given historical
transaction data for both the individual and for the larger population of all of a particular company’s customers.
Ans: predictive
96. Data mining can be used to help predict future patient behaviour and to improve treatment programs (True/False).
Ans: True
98. Data mining in the telecommunication industry helps to understand the business involved, identify
telecommunication patterns (True/False).
Ans: True
100. ___ is proving to be a critical link between theory, simulation, and experiment.
Ans: data-intensive computing
101. IDS are based on ___ that are developed by the manual encoding of expert knowledge.
Ans: Handcrafted signatures
103. To improve accuracy, data mining programs are used to analyze audit data and extract features that can
distinguish normal activities from intrusions. (True/False)
Ans: True
104. Data mining-based IDSs (especially anomaly detection systems) have higher false-positive rates than traditional
handcrafted signature-based methods. (True/False)
Ans: True
105. ___ is a new class of intrusion detection algorithms that do not rely on labelled data.
Ans: Unsupervised anomaly detection
106. ___ algorithm uses the frequency distribution of each feature’s values to proportionally generate a sufficient
amount of anomalies.
Ans: Distribution Based Artificial Anomaly
107. OLAP typically includes the following kinds of analyses: simple, comparison, trend, ___ and ___.
Ans: Variance and ranking
108. Patient Rule Induction Method (PRIM) and Weighted Item Sets (WIS), is a type of ___ technique.
Ans: Association rule
109. ___ tools cannot discover high average regions or find new patterns in data.
Ans: OLAP
110. ___ method is useful for finding patterns or associations between attributes.
Ans: WIS
KDK COLLEGE OF ENGINEERING, NAGPUR
8TH SEM
D.W.M.
MCQ QUESTION BANK
1. An itemset whose support is greater than or equal to a minimum support threshold is.......................
Option A. itemset
Option B. Frequent itemset
Option C. Threshold values
Option D. None of these
Answer: B
2. The process that analyzes customer buying habits by finding associations between the different items that
customers place in their “shopping baskets”
Option A. frequent Item set mining
Option B. Market Basket Analysis
Option C. FP growth
Option D. Predictive analysis
Answer : B
5. Frequent itemset mining leads to the discovery of associations and correlations among items in large
transactional or relational data sets.
Option A. True
Option B. False
Answer : A
Answer : A
10. A partitioning method available for improving the efficiency of the algorithm requires just _____ database
scans to mine the frequent itemset.
Option A. one
Option B. two
Option C. three
Option D. none of the above
Answer : B
11. Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good
performance gain. However, it can suffer from some nontrivial costs also.
Option A. True
Option B. False
Answer : A
13. Listed below are the three steps that are followed to deploy a Big Data Solution except
Option A. Data Ingestion
Option B. Data Processing
Option C. Data dissemination
Option D. Data Storage
Answer : C
15. __________ is an interdisciplinary field that draws on information retrieval, data mining, machine learning,
statistics, and computational linguistics.
Option A. Data mining
Option B. Web mining
Option C. Text mining
Option D. None of these
Answer : C
16. The techniques that can be used to improve the efficiency of Apriori algorithm is/are
Option A. hash based techniques
Option B. transaction reduction
Option C. Partitioning
Option D. All of these
Answer: D
17. Web mining is the application of _____________ to discover patterns, structures, and knowledge from
the Web.
Option A. data mining classification
Option B. data mining application
Option C. data mining features
Option D. data mining techniques
Answer : D
19. An .................. system is market-oriented and is used for data analysis by knowledge workers, including
managers, executives, and analysts.
Answer: A
21. Frequent pattern mining can be classified in various ways, based on the following criteria :
Option A. Based on the completeness of patterns to be mined
Option B. Based on the levels of abstraction involved in the rule set
Option C. Based on the number of data dimensions involved in the rule
Option D. All of these
Answer : D
22. ________ method(s) transforms the problem of finding long frequent patterns to searching for shorter ones
recursively and then concatenating the suffix.
Option A. The FP-growth
Option B. appriori
Option C. Vertical data format
Option D. All of these
Answer : A
23. web content mining, web structure mining, and web usage mining these are the main areas of _________.
Option A. Text mining
Option B. web mining
Option C. Both a and b
Option D. None of these
Answer : B
24. The form of data having an associated time interval during which it is valid , is known as
Option A. Temporal data
Option B. Snapshot data
Option C. Point in time data
Option D. None of these
Answer : A
25. The main purpose for structure mining is to extract previously unknown relationships between
Option A. Web pages
Option B. Web hyperlinks
Option C. Web data
Option D. Web contents
Answer : A
Answer: C
29. Which of the following are interestingness measure for association rules?
Option A. recall
Option B. lift
Option C. accuracy
Option D. compactness
Answer : B
30. Web mining can be organized into _______ main areas.
Option A. One
Option B. Two
Option C. Three
Option D. Four
Answer : C
31. The simple text files that are automatically generated every time someone accesses one Website are
Option A. Multimedia files
Option B. Text files
Option C. Log Files
Option D. None of these
Answer : C
32. _________discovers implicit and useful knowledge from large data sets using data and/or knowledge
visualization techniques.
Option A. Text Data mining
Option B. Web mining
Option C. Visual data mining
Option D. Spatial data mining
Answer : C
34. __________ are data that relate to both space and time.
Option A. Spatial data
Option B. Spatiotemporal data
Option C. Temporal data
Option D. None of these
Answer : B
35. Web structure mining is the process of discovering ____ information from the web.
Option A. Semi structured
Option B. Unstructured
Option C. Structured
Option D. None of these
Answer : C
36. The examination of large amounts of data to see what patterns or other useful information can be found is
known as
Option A. Data examination
Option B. Information analysis
Option C. Big data analytics
Option D. Data analysis
Answer : C
37. The new source of big data that will trigger a big data revolution in the years to come is
Option A. Business transactions
Option B. Social Media
Option C. Transactional data and sensor data
Option D. RDBMS
Answer : C
38. Which is general-purpose computing model and runtime system for distributed data analytics?
Option A. MapReduce
Option B. Drill
Option C. Oozie
Option D. None of these
Answer : A
Answer:D
Answer:A
A) 6 B) 7 C) 8 D) 9
Answer:B
A) 4 B) 6 C) 8 D)10
Answer:B
A) 3 B) 3.74 C) 6 D)18
Answer:A
Answer:C
48. computer =>antivirus software [support = 2%; confidence = 60%] A support of 2% in above Association rule
describe
Option A. 2% of all the transactions under analysis show that computer and antivirus software are purchased
together.
Option B. 2% of all the transactions under analysis show that computer or antivirus software may purchased
together.
Option C. 2% of the customers who purchased a computer also bought the software
Option D. 2% of the customers purchased a computer
Answer : A
49. ____________ integrates data mining and data visualization to discover implicit and useful knowledge from
large data sets.
Option A. Audio data mining
Option B. Video data mining
Option C. Text data mining
Option D. Visual data mining
Answer : D
52. ____________ is the process of extracting useful information (e.g., user click streams) from server logs.
Option A. Audio mining
Option B. Data mining
Option C. Web usage mining
Option D. Text mining
Answer : C
53. The data mining algorithm used by Google Search to rank web pages in their search engine results, is
Option A. K-means Algorithm
Option B. PageRank Algorithm
Option C. Naive Bayes Algorithm
Option D. Adaboost Algorithm
Answer : B
59. Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions in TID-itemset
format (that is, {TID : itemset}), where TID is a transaction-id and itemset is the set of items bought in
transaction TID.
Option A. True
Option B. False
Answer : A
Answer: B
Answer: A
63. Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf)
are called as ______ rule
Option A. strong
Option B. weak
Option C. primary
Option D. none of the above
Answer : A
64. ___________uses audio signals to indicate the patterns of data or the features of data mining results.
Option A. Audio data mining
Option B. Visual data mining
Option C. Web mining
Option D. Text Data mining
Answer : A
65. A user session is a ___ record spanning the entire Web.
Option A. Log
Option B. Clickstream
Option C. Web log
Option D. None of these
Answer : B
67. Which of the following fields come under the umbrella of big data?
Option A. Black Box Data
Option B. Power Grid Data
Option C. Search Engine Data
Option D. All of the above
Answer : D
Answer: D
69. In ______ , the number of transactions scanned in future iterations are reduced.
Option A. Transaction reduction
Option B. Partitioning
Option C. Sampling
Option D. None of these
Answer : A
73. ……………… is a subject oriented, integrated, time-variant, non volatile collection of data in support of
management decisions.
Answer: B
74. Adding candidate itemsets at different points during a scan is known as _________
Option A. Dynamic itemset counting
Option B. Partitioning
Option C. Dynamic itemset partitioning
Option D. none of the above
Answer : A
75. __________ include text categorization, text clustering, concept /entity extraction, production of granular
taxonomies, sentiment analysis, document summarization, and entity-relation modeling.
Option A. Data mining tasks
Option B. Text mining tasks
Option C. Web mining tasks
Option D. Video mining tasks
Answer : B
77. ..................... is an essential process where intelligent methods are applied to extract data
patterns.
Answer: B
78. Sequential pattern mining searches for frequent subsequences in a sequence data set, where a sequence
records an ordering of events.
Option A. True
Option B. False
Answer : A
79. The set of closed graphs where a graph g is closed if there exists no proper ___________ g’ that carries
the same support count as g.
Option A. sub graph
Option B. no graph
Option C. super graph
Option D. none of these
Answer : C
80. 90% of the world's total data has been created just within the past two years. This statement is true or false?
Option A. True
Option B. False
Answer : A
81. According to analysts, for what can traditional IT systems provide a foundation when they’re integrated with
big data technologies like Hadoop?
Option A. Big data management and data mining
Option B. Data warehousing and business intelligence
Option C. Management of Hadoop clusters
Option D. Collecting and storing unstructured data
Answer : A
Answer: A
86. Concerning the Forms of Big Data, which one of these is odd?
Option A. Processed
Option B. Semi-structured
Option C. Structured
Option D. Unstructured
Answer : A
95. ________ often also uses Word Net, Sematic Web, Wikipedia, and other information sources to enhance
the understanding and mining of text data.
Option A. Text mining
Option B. Data mining
Option C. Web mining
Option D. None of these
Answer : A
97. Which task takes the output from a map as an input and combines those data tuples into smaller set of tuples?
Option A. Map
Option B. Reduce
Option C. Node
Option D. None of these
Answer : B
99. ____ is the ratio of the measure of an item when compared with that of its parent , its child , or its sibling in
frequent pattern analysis.
Option A. gradient
Option B. association
Option C. support
Option D. None of these
Answer : A
100. ---------------takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples
Option A. Map
Option B. Reduce
Option C. Node
Option D. none
Answer : A
102.___________ analyzes web content such as text, multimedia data, and structured data (within web pages
or linked across web pages).
Option A. Web content mining
Option B. web mining
Option C. Web usage mining
Option D. Web structure mining
Answer : A
103.The feature of big data that refers to the quality of the stored data is ______
Option A. Variety
Option B. Volume
Option C. Variability
Option D. Veracity
Answer : D
104. Data can also be presented in item-TID set format (that is, {item : TID set}), where item is an item name,
and TID_set is the set of transaction identifiers containing the item. This format is known as __________
Option A. vertical data format.
Option B. horizontal data format
Option C. Parallel data format
Option D. none of the above
Answer : A
Answer: B
109.____________ is the process of using graph and network mining theory and methods to analyze the nodes
and connection structures on the Web.
Option A. web mining
Option B. Web structure mining
Option C. Web usage mining
Option D. Text mining
Answer : B
B) Aggregation
C) Normalization
D) Generalization
Answer:C
Answer: A
Answer: D
A) Supervised learning
B) Unsupervised learning
C) Reinforcement learning
Answer: B
A) Unsupervised learning
B) Supervised learning
C) Reinforcement learning
Answer: B
119. Some telecommunication company wants to segment their customers into distinct groups in order to send
appropriate subscription offers, this is an example of
A. Supervised learning
B. Data extraction
C. Serration
D. Unsupervised learning
Answer: D
B. Supervised learning
C. Reinforcement learning
Answer: A
121. You are given data about seismic activity in Japan, and you want to predict a magnitude of the next
earthquake, this is in an example of
A. Supervised learning
B. Unsupervised learning
C. Serration
D. Dimensionality reduction
Answer: A
122.Assume you want to perform supervised learning and to predict number of newborns according to size of storks’
population (http://www.brixtonhealth.com/storksBabies.pdf), it is an example of
A. Classification
B. Regression
C. Clustering
Answer: B