0% found this document useful (0 votes)

139 views

1 Choosing The Right Data Mining Techniques For The Job (8 Min-Utes, 4 Points)

The document provides information about a final exam for a data mining course, including guidance on time management, scoring details, and a sample customer relationship management database to use for answering questions. It then lists 5 multi-part questions for students to answer that involve choosing appropriate data mining techniques, describing different clustering algorithms, applying clustering to segment the sample customers, addressing data preparation challenges, and explaining the full data mining process.

Uploaded by

baloch45

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views

1 Choosing The Right Data Mining Techniques For The Job (8 Min-Utes, 4 Points)

Uploaded by

baloch45

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CS490D Spring 2004 Final Solutions, May 3, 2004

Prof. Chris Clifton

Time will be tight. If you spend more than the recommended time on any question, go on to the next
one. If you can’t answer it in the recommended time, you are either going in to too much detail or the
question is material you don’t know well. You can skip one or two parts and still demonstrate what I believe
to be an A-level understanding of the material.
The answers I’ve given are one possible answer - not necessarily the only one, or even the best. The exam is
out of 35 points. My rough feeling (subject to change) is that I’d expect an A student to get at least 31, a B
student to get at least 25. I’m not concerned about anyone not demonstrating at least a C level knowledge of
the material, so I won’t even think about the C/D cutoff.
Note: It is okay to use abbreviations in your answers, as long as the abbreviations are unambiguous and
reasonably obvious.
During the exam, you will make use of the following table, a hypothetical customer rela-
tionship management (CRM) database for a company that makes computer parts.
Name Publicly Market Cap. Employees Sales Customer ... Year Units Profit ...
Held? ($Mil) Channel since Sold
Dell Y 75000 37200 Direct 1997 2002 120000 $1.2M
Dell Y 88000 39100 Direct 1997 2003 109000 1.1M
Gateway Y 2300 11000 Direct 1995 2001 60000 .9M
Gateway Y 1400 11500 Direct 1995 2002 70000 1.0M
Gateway Y 1700 9600 Retailers 1995 2003 65000 1.9M
Compaq Y 10000 50000 Retailers 1993 2002 30000 1.4M
Hewlett-Packard Y 35000 95000 Retailers 1994 2002 80000 2.2M
Hewlett-Packard Y 60000 141000 Retailers 1994 2003 100000 2.5M
MA Micro N ? 80 Direct 1995 2003 400 3500
. . .

This contains information on the companies we supply our products to, both general information about
them (such as their market capitalization - the total value of their outstanding stock, the way they sell their
products); and information about our sales to them and the profit we get from those sales. Our company’s
marketing department wants to use this information to better target their campaigns, thus increasing the
sales of our company’s products and the profit earned. The table you are shown is not complete - you
can assume there are a lot more attributes both describing the company, and describing our sales to the
company (represented by . . . in the table.) The information shown above will be sufficient for you to answer
the questions on the exam.

1 Choosing the right data mining techniques for the job (8 min-
utes, 4 points)
The people in marketing would like a better understanding of their different customers. They want to know
what distinguishes customers – what are the key attributes that make a customer unique? The idea isn’t to
group similar customers, but to identify the attributes that set customers apart.
What data mining technique you would use to answer this question? Include a sentence or two justifying
your answer. Also give a couple of sentences describing how you would relate the raw data mining results
back to the question asked by marketing.
Principle Component Analysis would be an appropriate technique, after dropping such “uninteresting” at-
tributes as company name. The primary components used to generate the most important vectors would be the
information to provide back to marketing.
An alternative would be to use a decision tree, regression tree, or other such “transparent” classifier to predict
the company names. I’d first bin continuous values into a small number of bins, build the decision tree, then see
what the top few nodes are. Also of interest would be nodes that seem to have a relatively even split between

1
the number of entities at that point in the tree. Note that there may be multiple entries with a single company
name, which helps to make this approach interesting.
Scoring: one point for a method, 1-2 points for a reasonable description of why it is appropriate, 1-2
points for how you would interpret the results.

2 Types of clustering (9 minutes, 6 points)

Give an advantage / strong point of each of the following types of clustering. (Your answers can be general
- they don’t need to be specific to the CRM database above.)
• K-means clustering
Gives a “prototype” description of the cluster, the entity that would be constructed by taking the cluster
mean.
• Hierarchical clustering
Number of clusters can be based on various parameters, including intra-cluster distance, number of clusters.
Gives a measure of which clusters are close to each other, which are far apart.
• Density-based clustering
Handles odd-shaped clusters: Items can be distant from each other and still be placed in the same cluster,
if appropriate.
Scoring: 1 point each for showing evidence that you know what the method is, 1 for a good advantage it
has over other methods.

3 Clustering for CRM

Marketing is going to run three independent advertising campaigns, each addressing a different segment of
current or potential customers. They want you to cluster the companies to help to build these advertising
campaigns.

3.1 Choice of Technique (8 minutes, 3 points)

What clustering technique/algorithm would you use if your goal was to describe key characteristics of each
cluster? Give a brief reason why.
I’d use k-medoid clustering, with k = 3. This approach handles continuous and discrete attributes (provided
I define a distance function), and it would be easy for marketing to understand when I said “here is a typical
company for this cluster”.
Scoring: One for a valid clustering technique, 1-2 for good reasons why.

3.2 K-Means (15 minutes, 3 points)

Assume you were to cluster the companies using k-means, with k = 3. For just the data you are provided,
describe ONE cluster, i.e., list the companies that are in the cluster and calculate the means for that cluster.
One cluster consists of MA Micro. Since it is a cluster of a single company (quite distant from any others),
the mean would be that entity itself. It isn’t meaningful to talk about “k-means” of categorical attributes, but
for the numeric attributes the mean would be 80 employees, since 1995, year 2003, 400 items, $3500 profit.
Scoring: One for giving a reasonable set of companies for a cluster, one for showing evidence you know
how to calculate the mean, one for getting a correct mean.

2
3.3 Is this the right question? (15 minutes, 2 points)
The goal is to maximize revenue. By using clustering of existing customers to develop the advertising
campaign, the marketing department may miss something important. Describe something they might miss
given the data you have and the data mining technique you suggested, and what else you would need so they
wouldn’t miss it.
This approach will only help us define campaigns for customers similar to those we already have. It wouldn’t
help to identify new market segments. I’d want to include information on potential customers that we don’t
currently sell to (e.g., our competitors’ customers.)
Scoring: One for something reasonable that it wouldn’t provide, one for what you’d do about it.

4 Data Preparation
In 2002, Compaq and Hewlett-Packard merged (i.e., Hewlett-Packard bought Compaq.) This distorts the
data – you’ll notice that sales to Hewlett-Packard jumped significantly in 2003, not surprisingly to roughly
the combined level of the two companies in 2001. You could handle this in various ways: Combine the
historical records of the two companies, try to split them out in more recent data, or just ignore the change.

4.1 Ignoring the change (8 minutes, 2 points)

Give a data mining problem/scenario/technique where it would be appropriate to just take the data as is –
where the merger wouldn’t make a difference.
Building a model to predict profit on a new customer from public information. Since this is independent of
history, treating Compaq as an independent company would be fine.
Scoring: One for a reasonable example, one for explanation.

4.2 Merge the history (8 minutes, 2 points)

Give a data mining problem/scenario/technique where it would be appropriate to combine the companys’
data prior to the merge, e.g., add the sales of Compaq to Hewlett-Packard and eliminate the old records for
Compaq.
Predicting next year’s profit for existing customers. Without combining past HP/Compaq data, there would
be no background on the combined company to use for effective prediction.
Scoring: One for a reasonable example, one for explanation.

5 Data Mining Process / Cost

The company has said “we have the data in a data warehouse, all you need to do is install a data mining
tool and run it. Shouldn’t take you more than a couple of days.”

5.1 True or False? (0.1 minute, 1 point)

False

5.2 Reasoning (8 minutes, 4 points)

How would you justify your answer to 5.1? Remember, you are out to convince the company - not me.
Give a description of the resources you would need and the reasoning you would use estimate the time/cost
required. Also suggest what resources or arguments you might use to convince the company you are correct.
You don’t need to do an estimate, just say briefly how you would go about it.

3
There are many tasks that need to be done beyond gathering the data in a data warehouse. These are a
significant part of the effort, generally much greater than installing and running a tool. For documented evidence,
see the CRoss-Industry Standard Process for Data Mining (CRISP-DM). Some of the tasks that need to be done
are:
Business Understanding: What are the questions to be answered? Access to domain experts will be needed.
Data Analysis / preparation: The data may be in the warehouse, but is it clean? What about missing values?
How about discretizing / smoothing the data so the results are meaningful? At the very least, I’d want statistics
on the correctness and completeness of the data so I could estimate the time required to address these issues.
Result analysis: Relating data mining results back to the original business questions requires considerable
effort, and collaboration between data mining and business domain experts.
Scoring: one point for each task, one for describing what you would need to either estimate or accomplish
it. One point for giving some pointer to “expert sources” to back you up.

6 Time Series / Sequential Associates (8 minute, 4 points)

Demonstrate your knowledge of time series mining and sequential association mining by giving an example,
based on the CRM database, of something that would qualify as:
• Time series mining
Repeat buying patterns. For example, which companies can we rely on to keep us going in an economic
downturn?
• Sequential associations
Prediction of company-specific changes or trends, e.g., indicators that we may lose a customer.
Scoring: One each for demonstrating a knowledge of the difference between the two, one for a reasonable
example.

7 Text Mining
Text mining commonly uses the vector space, or “bag of words” model. To represent a set of documents in
a traditional “flat” format, each document is treated as a row. The words are the columns (attributes.) For
a given document, the value of an attribute is a weight, such as the number of times that word occurs in the
document.
Naı̈ve Bayes has proven effective for text classification. However, Naı̈ve Bayes has limitations.

7.1 Limitation Example (8 minutes, 2 points)

Give an example where Naı̈ve Bayes would not be effective, but some other method would. Hint: Naı̈ve
Bayes has a general limitation, or assumption about characteristics of the data, that applies – you can
describe this limitation as opposed to a specific example.
Naı̈ve Bayes doesn’t capture correlation between items. For example, the words “Naı̈ve” and “Bayes” appearing
together in a document are strongly indicative that the document is about data mining. However, either one
alone is likely to be about a lot of things other than data mining.
Scoring: One for demonstrating some understanding of Naı̈ve Bayes, one for a clear discussion of what
it fails to capture.

7.2 Alternatives (8 minutes, 2 points)

Describe briefly how another classification method would overcome the limitation you described in Question
7.1.

4
Decision trees capture such correlation. For example, we could have “Naı̈ ve” as one node of a decision tree.
The “yes” branch could go on to “Bayes”. The yes from that would have “data mining” as the class. A no branch
from either node would not lead to data mining.
Scoring: One point for naming a method that handles your objection to Naı̈ve Bayes, one for a solid
discussion of how it handles the problem.

File 1883
0% (1)
File 1883
3 pages
Enhancing Inclusive Instruction: Student Perspectives and Practical Approaches For Advancing Equity in Higher Education 1st Edition Addy
100% (4)
Enhancing Inclusive Instruction: Student Perspectives and Practical Approaches For Advancing Equity in Higher Education 1st Edition Addy
62 pages
Combined Quiz Solutions PDF
No ratings yet
Combined Quiz Solutions PDF
61 pages
BRM Practice Questions PGP20
0% (1)
BRM Practice Questions PGP20
47 pages
Quiz Solutions
No ratings yet
Quiz Solutions
6 pages
QM-II Midterm OCT 2014 Solution
No ratings yet
QM-II Midterm OCT 2014 Solution
19 pages
Solution To Exam 1
No ratings yet
Solution To Exam 1
8 pages
Business Statistics: Level 3
100% (1)
Business Statistics: Level 3
26 pages
Instructional strategies for middle and secondary social studies methods assessment and classroom management 1st Edition Bruce E. Larson all chapter instant download
100% (2)
Instructional strategies for middle and secondary social studies methods assessment and classroom management 1st Edition Bruce E. Larson all chapter instant download
82 pages
Henley Ch1 Introduction To Proactive Classroom Management
100% (1)
Henley Ch1 Introduction To Proactive Classroom Management
22 pages
Quantitative Methods II Mid-Term Examination: Instructions
100% (1)
Quantitative Methods II Mid-Term Examination: Instructions
17 pages
Exponent Rules Practice PDF
No ratings yet
Exponent Rules Practice PDF
2 pages
ECON1203-2292 Final Exam S212 PDF
No ratings yet
ECON1203-2292 Final Exam S212 PDF
13 pages
Testing Hypothesis
No ratings yet
Testing Hypothesis
42 pages
Analytics Quiz and Case Study
No ratings yet
Analytics Quiz and Case Study
12 pages
Golden Rules of Organic Chemistry
No ratings yet
Golden Rules of Organic Chemistry
3 pages
Indicator Lab Report
100% (1)
Indicator Lab Report
6 pages
OC Reviewer 2nd Quarter
No ratings yet
OC Reviewer 2nd Quarter
8 pages
Bigdata Assess1 PDF
No ratings yet
Bigdata Assess1 PDF
12 pages
BAM Quiz 03
No ratings yet
BAM Quiz 03
3 pages
Extra Question
No ratings yet
Extra Question
5 pages
5.physics Classified QP-Unit5 Atomic
No ratings yet
5.physics Classified QP-Unit5 Atomic
120 pages
Carbohydrates - FactRecall
No ratings yet
Carbohydrates - FactRecall
5 pages
Cell Division-Mitosis Notes: 2 New Cells
No ratings yet
Cell Division-Mitosis Notes: 2 New Cells
21 pages
Action Verbs Rie Print
No ratings yet
Action Verbs Rie Print
15 pages
O Level Geography Elective Paper
No ratings yet
O Level Geography Elective Paper
6 pages
BAM Quiz 05
No ratings yet
BAM Quiz 05
3 pages
Zorn, Proving Rape - Final Published
100% (1)
Zorn, Proving Rape - Final Published
42 pages
Amit Pradhan CA Exam Amit Pradhan 71310017 282
No ratings yet
Amit Pradhan CA Exam Amit Pradhan 71310017 282
10 pages
Exploratory Factor Analysis - A Five-Step Guide For Novices
100% (1)
Exploratory Factor Analysis - A Five-Step Guide For Novices
14 pages
Business Statistics Level 3: LCCI International Qualifications
100% (1)
Business Statistics Level 3: LCCI International Qualifications
19 pages
Learning To Teach ... Not Just For Beginners - The Essential Guide For All Teachers
No ratings yet
Learning To Teach ... Not Just For Beginners - The Essential Guide For All Teachers
356 pages
Answering SBQ
No ratings yet
Answering SBQ
5 pages
The Cell Cycle & CANCER
100% (1)
The Cell Cycle & CANCER
17 pages
Problem Set 6
No ratings yet
Problem Set 6
6 pages
Volcanoes
No ratings yet
Volcanoes
53 pages
Journal of Network and Computer Applications: Mohiuddin Ahmed, Abdun Naser Mahmood, Jiankun Hu
No ratings yet
Journal of Network and Computer Applications: Mohiuddin Ahmed, Abdun Naser Mahmood, Jiankun Hu
13 pages
2019 EJC Evolution II Tutorial Answer PDF
No ratings yet
2019 EJC Evolution II Tutorial Answer PDF
31 pages
21 Century SUSD Characteristics
No ratings yet
21 Century SUSD Characteristics
10 pages
Biology - Digestion and Absorption Revision Notes PDF
No ratings yet
Biology - Digestion and Absorption Revision Notes PDF
24 pages
Dipole-Dipole Interaction PDF
No ratings yet
Dipole-Dipole Interaction PDF
4 pages
Verbal Analogies Notice
No ratings yet
Verbal Analogies Notice
45 pages
Factors Affecting Electrolysis
100% (1)
Factors Affecting Electrolysis
42 pages
Atoms, Molecules & Stoichiometry Redox
No ratings yet
Atoms, Molecules & Stoichiometry Redox
189 pages
Chemical Kinetics Slides
No ratings yet
Chemical Kinetics Slides
87 pages
Topic 8 (Introduction To Organic) Summary
No ratings yet
Topic 8 (Introduction To Organic) Summary
3 pages
Classroom Management Practices and Learn
No ratings yet
Classroom Management Practices and Learn
16 pages
Cell Division
No ratings yet
Cell Division
60 pages
Chemistry Notes PT 1
No ratings yet
Chemistry Notes PT 1
55 pages
McGraw Hill S LSAT 2011 Edition
100% (2)
McGraw Hill S LSAT 2011 Edition
41 pages
The Most Important Probability Distribution in Statistics
No ratings yet
The Most Important Probability Distribution in Statistics
57 pages
Chemistry Electrolysis Cheat Sheet: by Via
No ratings yet
Chemistry Electrolysis Cheat Sheet: by Via
3 pages
Factor Analysis Easy Definition - Statistics How To
No ratings yet
Factor Analysis Easy Definition - Statistics How To
13 pages
Tutorial 3 - Hypothesis Testing
0% (1)
Tutorial 3 - Hypothesis Testing
12 pages
(Sci) Notes
No ratings yet
(Sci) Notes
205 pages
A complete Dictionary of Synonyms and Anthonyms
From Everand
A complete Dictionary of Synonyms and Anthonyms
Samuel Fallows
No ratings yet
Top Ten Steps to Research Like a Pro: Top Ten Series
From Everand
Top Ten Steps to Research Like a Pro: Top Ten Series
B Alan Bourgeois
No ratings yet
Frequency Study Guide Marsden's : So Much to Tell You
From Everand
Frequency Study Guide Marsden's : So Much to Tell You
Sophia Von Sawilski
No ratings yet
What You Should Know About White-Collar Crime: Simply Said Series
From Everand
What You Should Know About White-Collar Crime: Simply Said Series
Ansari Dr. Reem Al
No ratings yet
Lessons Learned: A Case Study Using Data Mining in The Newspaper Industry
No ratings yet
Lessons Learned: A Case Study Using Data Mining in The Newspaper Industry
10 pages
Out of the Box, or Out of the Question: What Won't Your Incentive Compensation Management System Do?
From Everand
Out of the Box, or Out of the Question: What Won't Your Incentive Compensation Management System Do?
David Kelly
No ratings yet
Manuale Viessmann Vitoligno 300-C
No ratings yet
Manuale Viessmann Vitoligno 300-C
2 pages
Ielts Reading Test 33
No ratings yet
Ielts Reading Test 33
6 pages
The Umbrella
No ratings yet
The Umbrella
3 pages
2019.2 - Alteryx Beginner Book 9th June
No ratings yet
2019.2 - Alteryx Beginner Book 9th June
255 pages
Holo-Print User Guide
No ratings yet
Holo-Print User Guide
12 pages
detailed-lesson-plan-in-lifestyle-and-weight-management
No ratings yet
detailed-lesson-plan-in-lifestyle-and-weight-management
15 pages
MCRkitectsAD - Company Profile - March 2020
No ratings yet
MCRkitectsAD - Company Profile - March 2020
25 pages
Help
No ratings yet
Help
6 pages
Help - Advance Steel Profiles API - Autodesk
No ratings yet
Help - Advance Steel Profiles API - Autodesk
21 pages
Commutative Property
No ratings yet
Commutative Property
2 pages
671112-2025-specimen-mark-scheme-paper-3-
No ratings yet
671112-2025-specimen-mark-scheme-paper-3-
12 pages
Wechsler Intelligence Scale For Children-Fifth Edition WISC-V
100% (1)
Wechsler Intelligence Scale For Children-Fifth Edition WISC-V
14 pages
TheBadList ZBergRyanRoss
No ratings yet
TheBadList ZBergRyanRoss
6 pages
History of Kaizen
No ratings yet
History of Kaizen
4 pages
Ac Minor February
No ratings yet
Ac Minor February
1 page
50 Fungsi Keyboard Microsoft Excel
No ratings yet
50 Fungsi Keyboard Microsoft Excel
10 pages
Resume - Rajiv Singh - Format4
No ratings yet
Resume - Rajiv Singh - Format4
2 pages
Royalplatinum
No ratings yet
Royalplatinum
3 pages
A Transistor Ladder Voltage-Controlled Filter Implemented On A Field Programmable Analog Array
No ratings yet
A Transistor Ladder Voltage-Controlled Filter Implemented On A Field Programmable Analog Array
8 pages
Netcdf MOD04
No ratings yet
Netcdf MOD04
37 pages
Time_n_Distance_Quiz_4
No ratings yet
Time_n_Distance_Quiz_4
7 pages
1. Mentor Report on Mentee After SEM-I [2024-25]
No ratings yet
1. Mentor Report on Mentee After SEM-I [2024-25]
2 pages
Using Psychological Science To Help Children Thrive
No ratings yet
Using Psychological Science To Help Children Thrive
3 pages
Radio One, Inc.: By: Ankur Gupta B08071 Anwar Syed B08072 Arun Kumarb08073 B Prathik B08075
No ratings yet
Radio One, Inc.: By: Ankur Gupta B08071 Anwar Syed B08072 Arun Kumarb08073 B Prathik B08075
11 pages
Seatmap A330 300 295
No ratings yet
Seatmap A330 300 295
1 page
1 02 - Embedded Hardware Units and Devices in A Syste
No ratings yet
1 02 - Embedded Hardware Units and Devices in A Syste
23 pages
Indochina Notes
100% (5)
Indochina Notes
35 pages
Multical 601 Radio Router
No ratings yet
Multical 601 Radio Router
2 pages
Mathematics I A (EM) BLM 2021-22
No ratings yet
Mathematics I A (EM) BLM 2021-22
130 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

1 Choosing The Right Data Mining Techniques For The Job (8 Min-Utes, 4 Points)

Uploaded by

1 Choosing The Right Data Mining Techniques For The Job (8 Min-Utes, 4 Points)

Uploaded by

CS490D Spring 2004 Final Solutions, May 3, 2004

Prof. Chris Clifton

2 Types of clustering (9 minutes, 6 points)

3 Clustering for CRM

3.1 Choice of Technique (8 minutes, 3 points)

3.2 K-Means (15 minutes, 3 points)

4.1 Ignoring the change (8 minutes, 2 points)

4.2 Merge the history (8 minutes, 2 points)

5 Data Mining Process / Cost

5.1 True or False? (0.1 minute, 1 point)

5.2 Reasoning (8 minutes, 4 points)

6 Time Series / Sequential Associates (8 minute, 4 points)

7.1 Limitation Example (8 minutes, 2 points)

7.2 Alternatives (8 minutes, 2 points)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.