1.
Define data
mining. Why are there many different names and definitions for data mining?
Data mining is the process
through which previously unknown patterns in data were discovered. Another definition
would be “a process that uses statistical, mathematical, artificial
intelligence, and machine learning techniques to extract and identify useful
information and subsequent knowledge from large databases.” This includes most
types of automated data analysis. A third definition: Data mining is the
process of finding mathematical patterns from (usually) large sets of data;
these can be rules, affinities, correlations, trends, or prediction models.
Data mining has many definitions
because it’s been stretched beyond those limits by some software vendors to
include most forms of data analysis in order to increase sales using the
popularity of data mining.
2.
What recent factors
have increased the popularity of data mining?
Following are
some of most pronounced reasons:
• More intense
competition at the global scale driven by customers’ ever-changing needs and
wants in an increasingly saturated marketplace.
• General
recognition of the untapped value hidden in large data sources.
• Consolidation
and integration of database records, which enables a single view of customers,
vendors, transactions, etc.
• Consolidation
of databases and other data repositories into a single location in the form of
a data warehouse.
• The
exponential increase in data processing and storage technologies.
• Significant
reduction in the cost of hardware and software for data storage and processing.
• Movement
toward the de-massification (conversion of information resources into
nonphysical form) of business practices.
3.
Is data
mining a new discipline? Explain.
Although the term data
mining is relatively new, the ideas behind it are not. Many of the
techniques used in data mining have their roots in traditional statistical
analysis and artificial intelligence work done since the early part of
1980s. New or increased use of data
mining applications makes it seem like data mining is a new discipline.
In general, data mining seeks to
identify four major types of patterns: Associations, Predictions, Clusters and
Sequential relationships. These types of
patterns have been manually extracted from data by humans for centuries, but
the increasing volume of data in modern times has created a need for more
automatic approaches. As datasets have grown in size and complexity, direct
manual data analysis has increasingly been augmented with indirect, automatic
data processing tools that use sophisticated methodologies, methods, and
algorithms. The manifestation of such evolution of automated and semiautomated
means of processing large datasets is now commonly referred to as data mining.
4.
What are
some major data mining methods and algorithms?
Generally speaking, data mining
tasks can be classified into three main categories: prediction, association,
and clustering. Based on the way in which the patterns are extracted from the
historical data, the learning algorithms of data mining methods can be
classified as either supervised or unsupervised. With supervised learning
algorithms, the training data includes both the descriptive attributes (i.e.,
independent variables or decision variables) as well as the class attribute
(i.e., output variable or result variable). In contrast, with unsupervised
learning the training data includes only the descriptive attributes. Below shows a simple taxonomy for data mining tasks, along with the learning
methods, and popular algorithms for each of the data mining tasks.
5.
What are the
key differences between the major data mining methods?
Prediction is the act of telling
about the future. It differs from simple guessing by taking into account the
experiences, opinions, and other relevant information in conducting the task of
foretelling. A term that is commonly associated with prediction is forecasting.
Even though many believe that these two terms are synonymous, there is a subtle
but critical difference between the two. Whereas prediction is largely
experience and opinion based, forecasting is data and model based. That is, in
order of increasing reliability, one might list the relevant terms as guessing,
predicting, and forecasting, respectively.
In data mining terminology, prediction and forecasting
are used synonymously, and the term prediction
is used as the common representation of the act.
Classification:
analyzing the historical behavior of groups of entities with similar
characteristics, to predict the future behavior of a new entity from its
similarity to those groups
Clustering:
finding groups of entities with similar characteristics
Association:
establishing relationships among items that occur together
Sequence
discovery: finding time-based associations
Visualization:
presenting results obtained through one or more of the other methods
Regression: a
statistical estimation technique based on fitting a curve defined by a
mathematical equation of known type but unknown parameters to existing data
Forecasting: estimating
a future data value based on past data values.
6.
What are the
major application areas for data mining?
Applications are
listed near the beginning of this section: CRM, banking, retailing and
logistics, manufacturing and production, brokerage, insurance, computer
hardware and software, government, travel, healthcare, medicine, entertainment,
homeland security, and sports.
7. Identify at least five specific
applications of data mining and list five common characteristics of
these
applications.
This question
expands on the prior question by asking for common characteristics. Several
such applications and their characteristics are: CRM, banking, retailing and
logistics, manufacturing and production, brokerage, insurance, computer
hardware and software, government, travel, healthcare, medicine, entertainment,
homeland security, and sports.
8.
What do you
think is the most prominent application area for data mining? Why?
The answers will
differ depending on which of the applications (most likely banking, retailing
and logistics, manufacturing and production, government, healthcare, medicine,
or homeland security) they think is most in need of greater certainty. Their reasons for selection should relate to
the application area’s need for better certainty and the ability to pay for the
investments in data mining.
9. What are
the major data mining processes?
Similar to other information systems
initiatives, a data mining project must follow a systematic project management
process to be successful. Several
data mining processes have been proposed: CRISP-DM, SEMMA, and KDD.
10.
Why do you
think the early phases (understanding of the business and understanding of the
data) take the longest in data mining projects?
The early steps
are the most unstructured phases because they involve learning. Those phases
(learning/understanding) cannot be automated. Extra time and effort are needed upfront
because any mistake in understanding the business or data will most likely
result in a failed BI project.
11.
List and
briefly define the phases in the CRISP-DM process.
CRISP-DM provides a systematic and orderly
way to conduct data mining projects. This process has six steps. First,
an understanding of the data and an understanding of the business issues to be
addressed are developed concurrently. Next, data are prepared for modeling; are
modeled; model results are evaluated; and the models can be employed for
regular use.
12.
What are the
main data preprocessing steps? Briefly describe each step and provide relevant
examples.
Data
preprocessing is essential to any successful data mining study. Good data leads
to good information; good information leads to good decisions. Data
preprocessing includes four main steps:
data consolidation: access, collect, select and filter data
data cleaning: handle missing data, reduce noise, fix
errors
data transformation: normalize the data, aggregate data,
construct new attributes
data reduction: reduce number of attributes and records;
balance skewed data
13.
How does
CRISP-DM differ from SEMMA?
The main
difference between CRISP-DM and SEMMA is that CRISP-DM takes a more
comprehensive approach—including understanding of the business and the relevant
data—to data mining projects, whereas SEMMA implicitly assumes that the data
mining project’s goals and objectives along with the appropriate data sources
have been identified and understood.
14.
Identify at
least three of the main data mining methods.
Classification
learns patterns from past data (a set of information—traits, variables,
features—on characteristics of the previously labeled items, objects, or
events) in order to place new instances (with unknown labels) into their
respective groups or classes. The objective of classification is to analyze the
historical data stored in a database and automatically generate a model that
can predict future behavior.
Cluster analysis
is an exploratory data analysis tool for solving classification problems. The
objective is to sort cases (e.g., people, things, events) into groups, or
clusters, so that the degree of association is strong among members of the same
cluster and weak among members of different clusters.
Association rule
mining is a popular data mining method that is commonly used as an example to
explain what data mining is and what it can do to a technologically less savvy
audience. Association rule mining aims to find interesting relationships (affinities)
between variables (items) in large databases.
15.
Give
examples of situations in which classification would be an appropriate data
mining technique. Give examples of situations in which regression would be an
appropriate data mining technique.
The answers will
differ, but should be based on the following issues. Classification is for
prediction that can be based on historical data and relationships, such as
predicting the weather, product demand, or a student’s success in a university.
If what is being predicted is a class label (e.g., “sunny,” “rainy,” or
“cloudy”) the prediction problem is called a classification, whereas if it is a
numeric value (e.g., temperature such as 68°F), the prediction problem is
called a regression.
16.
List and
briefly define at least two classification techniques.
• Decision tree analysis. Decision tree analysis (a
machine-learning technique) is arguably the most popular classification
technique in the data mining arena.
• Statistical analysis. Statistical classification
techniques include logistic regression and discriminant analysis, both of which
make the assumptions that the relationships between the input and output
variables are linear in nature, the data is normally distributed, and the
variables are not correlated and are independent of each other.
• Case-based reasoning. This approach uses historical
cases to recognize commonalities in order to assign a new case into the most
probable category.
• Bayesian classifiers. This approach uses probability
theory to build classification models based on the past occurrences that are
capable of placing a new instance into a most probable class (or category).
• Genetic algorithms. The use of the analogy of natural
evolution to build directed search-based mechanisms to classify data samples.
• Rough sets. This method takes into account the partial
membership of class labels to predefined categories in building models
(collection of rules) for classification problems.
17.
What are
some of the criteria for comparing and selecting the best classification
technique?
· The
amount and availability of historical data.
· The
types of data, categorical, interval, ration, etc.
· What
is being predicted -- class or numeric value
· The
purpose or objective
18.
Briefly
describe the general algorithm used in decision trees.
A general algorithm for building a decision tree is as
follows:
1. Create a root node and assign all of the training data
to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of the
split. Split the data into mutually exclusive (nonoverlapping) subsets along
the lines of the specific split and mode to the branches.
4. Repeat the steps 2 and 3 for each and every leaf node
until the stopping criteria is reached (e.g., the node is dominated by a single
class label).
19.
Define Gini
index. What does it measure?
The Gini index and information
gain (entropy) are two popular ways to determine branching choices in a
decision tree.
The Gini index measures the
purity of a sample. If everything in a sample belongs to one class, the Gini
index value is zero.
20.
Give
examples of situations in which cluster analysis would be an appropriate data
mining technique.
Cluster algorithms are used when
the data records do not have predefined class identifiers (i.e., it is not
known to what class a particular record belongs).
21.
What is the
major difference between cluster analysis and classification?
Classification methods learn
from previous examples containing inputs and the resulting class labels, and
once properly trained they are able to classify future cases. Clustering
partitions pattern records into natural segments or clusters.
22.
What are
some of the methods for cluster analysis?
The most commonly used
clustering algorithms are k-means and self-organizing maps.
23.
Give
examples of situations in which association would be an appropriate data mining
technique.
Association rule mining is
appropriate to use when the objective is to discover two or more items (or
events or concepts) that go together. Students’ answers will differ.
24. What are the most popular commercial data mining
tools?
Examples of these vendors include SPSS (PASW Modeler), SAS
(Enterprise Miner), StatSoft (Statistica Data Miner), Salford (CART, MARS,
TreeNet, RandomForest), Angoss (KnowledgeSTUDIO, KnowledgeSeeker), and
Megaputer (PolyAnalyst). Most of the more popular tools are developed by the
largest statistical software companies (SPSS, SAS, and StatSoft).
25. Why do you think the most popular tools are developed
by statistics companies?
Data mining techniques involved the use of statistical
analysis and modeling. So it’s a natural extension of their business offerings.
26. What are the most popular free data mining tools?
Probably the most popular free and open source data mining
tool is Weka. Others include RapidMiner, and Microsoft’s SQL Server.
27. What are the main differences between commercial and
free data mining software tools?
The main difference between
commercial tools, such as Enterprise Miner, PASW, and Statistica, and free
tools, such as Weka and RapidMiner, is computational efficiency. The same data
mining task involving a rather large dataset may take a whole lot longer to
complete with the free software, and in some cases it may not even be feasible
(i.e., crashing due to the inefficient use of computer memory).
28. What would be your top five selection criteria for a
data mining tool? Explain.
The answers will differ.
Criteria they are likely to mention include cost, user-interface, ease-of-use,
computational efficiency, hardware compatibility, type of business problem,
vendor support, and vendor reputation.
29. What are some of the most common myths about data
mining?
Data mining
is not yet viable for business applications.
Data
mining requires a separate, dedicated database.
Only those
with advanced degrees can do data mining.
Data
mining is only for large firms that have lots of customer data.
30.
What do you
think are the reasons for these myths about data mining?
The answers will differ. Some answers might relate to
fear of analytics, fear of the unknown, or fear of looking dumb.
31.
What are the
most common data mining mistakes? How can they be minimized and/or eliminated?
1. Selecting
the wrong problem for data mining.
2. Ignoring
what your sponsor thinks data mining is and what it really can and cannot do.
3. Leaving
insufficient time for data preparation. It takes more effort than one often
expects.
4. Looking
only at aggregated results and not at individual records.
5. Being
sloppy about keeping track of the mining procedure and results.
6. Ignoring
suspicious findings and quickly moving on.
7. Running
mining algorithms repeatedly and blindly. It is important to think hard enough
about the next stage of data analysis. Data mining is a very hands-on activity.
8. Believing
everything you are told about data.
9. Believing
everything you are told about your own data mining analysis.
10. Measuring
your results differently from the way your sponsor measures them.
Ways to minimize these risks are basically the reverse of
these items.
32. Define data mining. Why are there many names and
definitions for data mining?
Data mining is the process
through which previously unknown patterns in data were discovered. Another
definition would be “a process that uses statistical, mathematical, artificial
intelligence, and machine learning techniques to extract and identify useful information
and subsequent knowledge from large databases.” This includes most types of
automated data analysis. A third definition: Data mining is the process of
finding mathematical patterns from (usually) large sets of data; these can be
rules, affinities, correlations, trends, or prediction models.
Data mining has many definitions
because it’s been stretched beyond those limits by some software vendors to
include most forms of data analysis in order to increase sales using the
popularity of data mining.
33. What are the main reasons for the recent popularity
of data mining?
Following are
some of most pronounced reasons:
• More intense
competition at the global scale driven by customers’ ever-changing needs and
wants in an increasingly saturated marketplace.
• General
recognition of the untapped value hidden in large data sources.
• Consolidation
and integration of database records, which enables a single view of customers,
vendors, transactions, etc.
• Consolidation
of databases and other data repositories into a single location in the form of
a data warehouse.
• The
exponential increase in data processing and storage technologies.
• Significant
reduction in the cost of hardware and software for data storage and processing.
• Movement
toward the de-massification (conversion of information resources into
nonphysical form) of business practices.
34. Discuss what an organization should consider before
making a decision to purchase data mining software.
Technically speaking, data mining is a process that uses
statistical, mathematical, and artificial intelligence techniques to extract
and identify useful information and subsequent knowledge (or patterns) from
large sets of data. Before making a decision to purchase data mining software
organizations should consider the standard criteria to use when investing in
any major software: cost/benefit analysis, people with the expertise to use the
software and perform the analyses, availability of historical data, a business
need for the data mining software.
35. Discuss the main data mining methods. What are the
fundamental differences among them?
Prediction is the act of telling
about the future. It differs from simple guessing by taking into account the
experiences, opinions, and other relevant information in conducting the task of
foretelling. A term that is commonly associated with prediction is forecasting.
Even though many believe that these two terms are synonymous, there is a subtle
but critical difference between the two. Whereas prediction is largely
experience and opinion based, forecasting is data and model based. That is, in
order of increasing reliability, one might list the relevant terms as guessing,
predicting, and forecasting, respectively.
In data mining terminology, prediction and forecasting
are used synonymously, and the term prediction
is used as the common representation of the act.
Classification:
analyzing the historical behavior of groups of entities with similar
characteristics, to predict the future behavior of a new entity from its
similarity to those groups
Clustering:
finding groups of entities with similar characteristics
Association:
establishing relationships among items that occur together
Sequence
discovery: finding time-based associations
Visualization:
presenting results obtained through one or more of the other methods
Regression: a
statistical estimation technique based on fitting a curve defined by a
mathematical equation of known type but unknown parameters to existing data
Forecasting: estimating
a future data value based on past data values.
36. What are the main data mining application areas?
Discuss the commonalities of these areas that make them a prospect for data
mining studies.
Applications are
listed near the beginning of this section: CRM, banking, retailing and
logistics, manufacturing and production, brokerage, insurance, computer
hardware and software, government, travel, healthcare, medicine, entertainment,
homeland security, and sports.
The
commonalities are the need for predictions and forecasting for planning
purposes and to support decision making.
37. Why do we need a standardized data mining process?
What are the most commonly used data mining processes?
In order to systematically carry
out data mining projects, a general process is usually followed. Similar to other information systems
initiatives, a data mining project must follow a systematic project management
process to be successful. Several
data mining processes have been proposed: CRISP-DM, SEMMA, and KDD.
38. Discuss the differences between the two most commonly
used data mining process.
The main difference between
CRISP-DM and SEMMA is that CRISP-DM takes a more comprehensive
approach—including understanding of the business and the relevant data—to data
mining projects, whereas SEMMA implicitly assumes that the data mining project’s
goals and objectives along with the appropriate data sources have been
identified and understood.
39. Are data mining processes a mere sequential set of
activities?
Even though these steps are sequential in nature, there is
usually a great deal of backtracking. Because the data mining is driven by
experience and experimentation, depending on the problem situation and the
knowledge/experience of the analyst, the whole process can be very iterative
(i.e., one should expect to go back and forth through the steps quite a few
times) and time consuming. Because latter steps are built on the outcome of the
former ones, one should pay extra attention to the earlier steps in order not
to put the whole study on an incorrect path from the onset.
40. Why do we need data preprocessing? What are the main
tasks and relevant techniques used in data preprocessing?
Data preprocessing is essential to any successful data
mining study. Good data leads to good information; good information leads to
good decisions. Data preprocessing includes four main steps:
data consolidation: access, collect, select and filter data
data cleaning: handle missing data, reduce noise, fix
errors
data transformation: normalize the data, aggregate data,
construct new attributes
data reduction: reduce number of attributes and records;
balance skewed data
41. Discuss the reasoning behind the assessment of
classification models.
The model-building step also
encompasses the assessment and comparative analysis of the various models
built. Because there is not a universally known best method or algorithm for a
data mining task, one should use a variety of viable model types along with a
well-defined experimentation and assessment strategy to identify the “best”
method for a given purpose.
42. What is the main difference between classification
and clustering? Explain using concrete examples.
Classification learns patterns
from past data (a set of information—traits, variables, features—on
characteristics of the previously labeled items, objects, or events) in order
to place new instances (with unknown labels) into their respective groups or
classes. The objective of classification is to analyze the historical data
stored in a database and automatically generate a model that can predict future
behavior. Classifying customer-types as
likely to buy or not buy is an example.
Cluster analysis is an
exploratory data analysis tool for solving classification problems. The
objective is to sort cases (e.g., people, things, events) into groups, or
clusters, so that the degree of association is strong among members of the same
cluster and weak among members of different clusters. Customers can be grouped according to
demographics.
43. What are the most common myths and mistakes about
data mining?
Data
mining provides instant, crystal-ball predictions.
Data
mining is not yet viable for business applications.
Data
mining requires a separate, dedicated database.
Only those
with advanced degrees can do data mining.
Data
mining is only for large firms that have lots of customer data.
