1 Choosing The Right Data Mining Techniques For The Job (8 Min-Utes, 4 Points)
1 Choosing The Right Data Mining Techniques For The Job (8 Min-Utes, 4 Points)
Time will be tight. If you spend more than the recommended time on any question, go on to the next
one. If you can’t answer it in the recommended time, you are either going in to too much detail or the
question is material you don’t know well. You can skip one or two parts and still demonstrate what I believe
to be an A-level understanding of the material.
The answers I’ve given are one possible answer - not necessarily the only one, or even the best. The exam is
out of 35 points. My rough feeling (subject to change) is that I’d expect an A student to get at least 31, a B
student to get at least 25. I’m not concerned about anyone not demonstrating at least a C level knowledge of
the material, so I won’t even think about the C/D cutoff.
Note: It is okay to use abbreviations in your answers, as long as the abbreviations are unambiguous and
reasonably obvious.
During the exam, you will make use of the following table, a hypothetical customer rela-
tionship management (CRM) database for a company that makes computer parts.
Name Publicly Market Cap. Employees Sales Customer ... Year Units Profit ...
Held? ($Mil) Channel since Sold
Dell Y 75000 37200 Direct 1997 2002 120000 $1.2M
Dell Y 88000 39100 Direct 1997 2003 109000 1.1M
Gateway Y 2300 11000 Direct 1995 2001 60000 .9M
Gateway Y 1400 11500 Direct 1995 2002 70000 1.0M
Gateway Y 1700 9600 Retailers 1995 2003 65000 1.9M
Compaq Y 10000 50000 Retailers 1993 2002 30000 1.4M
Hewlett-Packard Y 35000 95000 Retailers 1994 2002 80000 2.2M
Hewlett-Packard Y 60000 141000 Retailers 1994 2003 100000 2.5M
MA Micro N ? 80 Direct 1995 2003 400 3500
. . .
This contains information on the companies we supply our products to, both general information about
them (such as their market capitalization - the total value of their outstanding stock, the way they sell their
products); and information about our sales to them and the profit we get from those sales. Our company’s
marketing department wants to use this information to better target their campaigns, thus increasing the
sales of our company’s products and the profit earned. The table you are shown is not complete - you
can assume there are a lot more attributes both describing the company, and describing our sales to the
company (represented by . . . in the table.) The information shown above will be sufficient for you to answer
the questions on the exam.
1 Choosing the right data mining techniques for the job (8 min-
utes, 4 points)
The people in marketing would like a better understanding of their different customers. They want to know
what distinguishes customers – what are the key attributes that make a customer unique? The idea isn’t to
group similar customers, but to identify the attributes that set customers apart.
What data mining technique you would use to answer this question? Include a sentence or two justifying
your answer. Also give a couple of sentences describing how you would relate the raw data mining results
back to the question asked by marketing.
Principle Component Analysis would be an appropriate technique, after dropping such “uninteresting” at-
tributes as company name. The primary components used to generate the most important vectors would be the
information to provide back to marketing.
An alternative would be to use a decision tree, regression tree, or other such “transparent” classifier to predict
the company names. I’d first bin continuous values into a small number of bins, build the decision tree, then see
what the top few nodes are. Also of interest would be nodes that seem to have a relatively even split between
1
the number of entities at that point in the tree. Note that there may be multiple entries with a single company
name, which helps to make this approach interesting.
Scoring: one point for a method, 1-2 points for a reasonable description of why it is appropriate, 1-2
points for how you would interpret the results.
2
3.3 Is this the right question? (15 minutes, 2 points)
The goal is to maximize revenue. By using clustering of existing customers to develop the advertising
campaign, the marketing department may miss something important. Describe something they might miss
given the data you have and the data mining technique you suggested, and what else you would need so they
wouldn’t miss it.
This approach will only help us define campaigns for customers similar to those we already have. It wouldn’t
help to identify new market segments. I’d want to include information on potential customers that we don’t
currently sell to (e.g., our competitors’ customers.)
Scoring: One for something reasonable that it wouldn’t provide, one for what you’d do about it.
4 Data Preparation
In 2002, Compaq and Hewlett-Packard merged (i.e., Hewlett-Packard bought Compaq.) This distorts the
data – you’ll notice that sales to Hewlett-Packard jumped significantly in 2003, not surprisingly to roughly
the combined level of the two companies in 2001. You could handle this in various ways: Combine the
historical records of the two companies, try to split them out in more recent data, or just ignore the change.
3
There are many tasks that need to be done beyond gathering the data in a data warehouse. These are a
significant part of the effort, generally much greater than installing and running a tool. For documented evidence,
see the CRoss-Industry Standard Process for Data Mining (CRISP-DM). Some of the tasks that need to be done
are:
Business Understanding: What are the questions to be answered? Access to domain experts will be needed.
Data Analysis / preparation: The data may be in the warehouse, but is it clean? What about missing values?
How about discretizing / smoothing the data so the results are meaningful? At the very least, I’d want statistics
on the correctness and completeness of the data so I could estimate the time required to address these issues.
Result analysis: Relating data mining results back to the original business questions requires considerable
effort, and collaboration between data mining and business domain experts.
Scoring: one point for each task, one for describing what you would need to either estimate or accomplish
it. One point for giving some pointer to “expert sources” to back you up.
7 Text Mining
Text mining commonly uses the vector space, or “bag of words” model. To represent a set of documents in
a traditional “flat” format, each document is treated as a row. The words are the columns (attributes.) For
a given document, the value of an attribute is a weight, such as the number of times that word occurs in the
document.
Naı̈ve Bayes has proven effective for text classification. However, Naı̈ve Bayes has limitations.
4
Decision trees capture such correlation. For example, we could have “Naı̈ ve” as one node of a decision tree.
The “yes” branch could go on to “Bayes”. The yes from that would have “data mining” as the class. A no branch
from either node would not lead to data mining.
Scoring: One point for naming a method that handles your objection to Naı̈ve Bayes, one for a solid
discussion of how it handles the problem.