The acquire
stage involves selecting and gathering the raw data. The organize stage
involves cleaning the data and doing the necessary pre-processing and storing
the data. The analyse stage involves applying algorithms and methodologies to
model the data while the decide stage will involve extracting the knowledge,
making the decision and visualizing the outcome (Oracle, 2013). The data analytics
and KDD processes are extensions of these phases
Data Analytics
can be categorized roughly into 3 main areas which are Descriptive Analytics, Prescriptive
Analytics and Predictive Analytics
Descriptive
analytics
Descriptive analytics uses the analysis of
historical data to model or classify data into groups or clusters by
identifying not just single relationships in the data but rather many different
relationships within and between data. It can be used for example to categorize
or cluster customers into groups of high spenders, low spenders and product
preferences, based on the issue being analyzed. For example a target group can
be determined for the new products to be introduced (Rose Business
Technologies).
Descriptive Analytics tends to provide solutions to
analytic questions like, what happened and what did it happen, using the
provided data as input. This aspect of Analytics generally strives to identify
business opportunities and likely related problems (Dursun
Delen and Haluk Demirkan, 2013).
Prescriptive
Analytics
Prescriptive Analytics uses a combination of
mathematical algorithms on the data to find best possible alternative actions
or best alternative decisions that will improve the given analytic challenge (Dursun Delen and Haluk Demirkan, 2013).
Prescriptive
Analytics predicts future outcomes and also provides alternative measures or
options for example, it shows the options on how to better utilize advantages
or rather how to mitigate risk showing the effects of both decisions. Prescriptive
analytics can model outcomes using both internal and external factors at the
same time to provide decision options and the impact of the various options
(Rose Business Technologies).
Predictive
Analytics
Predictive Analytics uses mathematical methods on
the data to find out the hidden patterns, tendencies and relationship that can
be inferred from the data. This aspect of Analytics provides solution to
analytic questions like, what will happen and why it will happen. It is the
outcome of this aspect of analytics that is used to make future predictions (Dursun Delen and Haluk Demirkan, 2013).
Predictive Analytics includes lots of other areas of
specialties like the Machine Learning, Data Mining, Game Theory, Modelling and
Statistical Methods. In a nutshell it uses both historical and current data to
analyze the given situations and to predict the future. The main areas of
predictive analytics that is most commonly used are the Decision Analysis and
Optimization, Transaction profiling and Predictive modelling. In Decision
Analysis and Optimization Prescriptive Analytics brings out the patterns in the
given data for example customer data, and this patterns can then be used to
predict the customer behavior for the present or future situation (Rose Business
Technologies). Statistics and Data
mining are two of the most important methodologies used in the KDD process for patterns
finding (Fayyad et al.,
1996).
Statistics
Statistics is the branch of science that deals with
collecting and analyzing numerical data and then using the result of the
analysis to make decisions, to resolve issues or to create new products and
processes (Montgomery and Runger, 2003). There are two areas of statistics:
Descriptive statistics and Inferential Statistics
Descriptive
Statistics
Descriptive Statistics is the part of statistics
that, like the name implies, describes data and present a summary of the data
showing the Central Tendency of the sample data or the distribution of the
sample data. The Central Tendency will show if it is a normal distribution or
not. The main types of Central Tendencies are the Mean, Median and the Mode of
the sample data. The Mean represents the average which is the sum of the values
of the variable divided by the number of items. The Median is the middle value
of all the items if they are arranged in descending or ascending order while
the Mode is the item that has got the most counts or that is the most frequency
occurring item. When any of the above
Central Tendency is used depends of the nature of the data. If the data set
includes Outliers, which is a numerical value, that is clearly different from
the rest of the data set, then Median is used because it is less sensitive to
outliers. The Mean is very sensitive to outliers and the Mode is mostly used to
represent categorical data for example on a bar chart but it does not produce
unique results like the Mean or the Median (Montgomery and Runger, 2003).
Inferential
Statistics
Inferential
Statistics on the other hand is the area of statistics that concerns taking a
sample from a large number objects (population, customers or product) and using
the analysis of this sample to infer outcomes or make predictions that can be
applied to the entire objects (Marshall and Jonker, 2011; Montgomery and
Runger, 2003).
It involves using an appropriate sampling method to
obtain the sample that will be representative of the entire objects in
question. Using statistical techniques it is possible to calculate the required
size of the samples that will be needed to carry out an inferential analysis.
Inferential statistics involves different statistical test like the hypothesis
test and the t-test using certain conventions like the Confidence Level of 95%
or the Significance Level of 0.05. These values though arbitrary, are set by
statistician and used as a convention based on best practice over many years
(Marshall and Jonker, 2011). These values are predefined in the various data mining
tools and they can also be adjusted as needed during any data mining project.
However it is worth stating at this point that
stand-alone projects in statistics do not have any general standard process
model like the CRISP-DM standard process that can act as a reference when doing
statistical projects
Data
Mining
When talking
about Data Analytics the first discipline that comes to mind is Data Mining.
Data mining as defined by Daciuk and Jou (2013) is the process whereby
Statistical and Mathematical Methods as well as patterns recognition
Methodologies are used to analyse large sets of data with the aim of finding
patterns, tendencies and mutual relationships in the data (Tim Daciuk and Stephan Jou, 2011; Fayyad et al., 1996). In other words Data Mining is a
complex process that involves a significant amount of computing resources in
observing data and extracting hidden but useful and meaningful patterns from
data using different methodologies and statistical algorithms. The result of
the analysis can then be used to make decisions for current situations or used
to make predictions or forecast future situations for the given analytical
issue (Dursun Delen and Haluk Demirkan, 2013).
Although lots of
the processes can be done using sophisticated automated Data Mining tools it
requires a human with the domain knowledge to be able to utilize these tools
and make accurate decisions and predictions. The basic stages of the Data
Mining process of which there are available in most of the Data Mining tools
are:
Exploring the
data, fitting models to the data, comparing the models to know which model is
the best for the given analytic issue and presenting the report.
Exploring the data
This involves
studying the raw data and finding out if there might be for example missing
values for some variables or not, if the values provided values are misleading
or not representative enough, if values need to be replaced or modified before
the doing the data analysis. This is generally regarded as the cleaning part,
the data understanding and data preparation part of the process. It will also
determine the type of Model to be used on the data. For example if the
modelling has to be done with the provided data with the missing values then
the Decision Tree Model will be the suitable Model because they can best handle missing values
in variables
Fitting models to the data
Based on the
goal of the analysis and the function that the model should serve, this is the
stage where the specific model, containing the parameters to be determined from
the data, is fitted to the data. It can be classification if there are
predefined classes (classifications) also known as supervised classifications,
to predict the categorical class variables. It can be clustering model if there
are no predefined classes or classifications and this is also known as
unsupervised classification. It can also be Regression Model which can predict
a variable by mapping the variable to a real value. Also at this stage it can
be decided on what additional algorithm or rules to use on the data for example
there are Association rules that can be used to find the datasets that occur
together frequently (Fayyad et al., 1996).
Comparing the models and representation the outcomes
Based on the
Data Mining tool used the models can be compared to see which one has the best
answer to the analytic problem. This can be done by comparing the error outcome
of each model and then choosing a representation model that best suits the
situation and that can best display the outcome in a human readable form. Models
for representation include Decision Tree model, linear model like the
Regression model or non-linear models like the Neural Network Model. Other
models include the Nearest Neighbour and the Bayesian network model (Fayyad et al., 1996). The process model used during
the entire project can be a major determinant of the output type, quality and
deployment method
Process models
All data mining
projects basically follow a sequence of processes, mostly predefined. Some
analytics use their own process, some use the specific company’s process and
others choose from a variety of available processes like SEMMA, KDD process, and
CRISP-DM. The most commonly used standard is the Cross Industry Standard for
Data Mining (CRISP-DM)according to Piatetsky-Shapiro (2007) on his website from
his survey in 2007 as shown on the figure below taken from his website
Cross Industry
Standard for Data Mining (CRISP-DM) – the origin
CRISP-DM is a
reference model for Data Mining projects first propounded in 1996 by a group of
companies who were using the Data Mining technology. The companies were the
then Daimler Benz, an Automobile giant, now known as DaimlerChystler AG in
Germany. The second company was the SPSS Incorporated, a computer software
company in the United States of America (USA), it was acquired by IBM since
2009 and is now called IBM SPSS. The third company was the NCR Incorporation, a
computer hardware and electronics company based in the USA. The fourth company
was Teradata Corporation, a computer company also based in the USA. The fifth
company was the OHRA Insurance Company based in the Netherlands. These
companies came up with this idea to find a model that will be standard for all
Data Mining related processes. They invented the acronym, CRISP-DM, got the
European Commission funding for the project and finally came out with the end
result which is the documentation of the standard procedure that should serve
as a reference model for Data Mining processes (SPSS, Chapman et al. 2000).
What is CRISP-DM
CRISP-DM is a
reference model that proposes a sequence of processes that should be followed in
a typical Data Mining project. As a reference model all the proposed processes
might not be needed depending on the nature of a particular project but
CRISP-DM is the recognized process model standardized for use in all data
mining projects (Marbán
et al., 2009)
As shown in the
figure 3 below, it comprises of 6 processes that entails all the processes
involved in the Data Mining project lifecycle starting from the Business
understanding stage and goes on sequentially until the final deployment stage. It
is meant to serve as a guide for professionals in the Data Mining profession when
doing any data mining project.
CRISP-DM
processes
1.
Business
understanding
Business
understanding phase of CRISP-DM is the
stage where the analyst tries to understand the objectives of the project and
what the business goals are. This includes gathering as much information as
possible regarding the proposed project, look at the current solutions if any
and describe the problems area and how it can be resolved using data mining.
Assess the situation to know the available data that can be used for the
analysis taking into consideration the time factor, the financial factor. Try to
figure out any risk that might be involved, make contingency plans and take
note of the resources available for the project. During this phase create a
data mining plan stating the data mining problem to be resolved like clustering
or classification or prediction providing actual number in the initial sketch
where possible. At the end of this phase create a data mining plan that should
show all the phases of the project from start to finish, the estimated time for
each phase and the available resources and risk for each phase (IBM SPSS
Modeler CRISP-DM).
2.
Data
understanding
The data
understanding phase involves collecting required data from available sources, studying
and exploring the data. Using available methods describe and summarize the data
so as to have a clear overview of the data and be able to figure out some
details about the quality of the data. The data summary should show the
attributes of the variables, the variable types and the values displayed using
tables and/or graphs and charts. Do a quality check to discover the variables
with missing values or wrong inputs and conclude the phase with a data quality
report (IBM SPSS Modeler CRISP-DM).
3.
Data
preparation
Based on the
reports from the previous phases the data can be prepared for the intended
analysis type by cleaning and formatting the data if necessary. This phase
involves selecting the required data from the explored data, adding additional
data if needed, replacing the missing values, adding new attributes and
splitting the data into test , validation and training data set as needed.
A point to note
here is that although the data mining tools provide this option for
partitioning the data into 3 different sections: test; validation and training,
most analysts think the test partition is a waste of resources and prefer to
partition the data equally into validation and training so as to have enough
data for the analysis. The consequence of this action though comes later in the
process when comparing the models to know the best model that best resolves the
problem. If more data is assigned to the
Training partition it will result in a better and stable prediction but the
model assessment will not be stable in the model assessment stage. On the other
hand if less data is assigned to the training partition it will result in a
less stable prediction but more stable model assessment. This is an area that
will require some future research in data mining tools
4.
Modelling
The modelling
phase involves fitting models to the data by using the required model and
setting the parameters as needed. The modelling stage can be done a couple of
times by testing and trying out different models techniques with different
parameters. The test can be repeating by tuning the parameters slightly as the
case may be, monitor the result and draw some initial conclusions on the
models. Assess the results of the various tests, compare the models using
available techniques and find the best result that meets the requirement of the
analytic problem
5.
Evaluation
The evaluation
phase involves assessing the modelling results to ensure that it aligns with
the predefined business and data mining goals. Document the findings and state
the conclusions and any assumptions or new issues raised from the findings.
Make sure the findings are presented in human understandable forms. Summarize
the activities and the decisions for all the phases and if necessary go back
and review the phases making adjustments if need be
6.
Deployment
This is the last
phase of the project where the final results of the entire process is deployed
into the production environment by way of improvement or informed decision or
prediction depending on the original goal of the analysis. If necessary create
a deployment plan for this stage, note all the requirements and state any
further actions that need to be taken or monitored or any future requirements.
Write a final report summarizing the findings of the project and any
recommendations while taking into consideration the intended audience and
structure the report to the meet the level of the audience. Present the project
report if required
Conclusion
This paper
touched on data analytics in general as an area of data science that is
multi-disciplinary, involving several other methodologies from Statistics,
machine learning, visualization, artificial intelligence and data mining. Data
analytics can be subdivided into 3 different types: descriptive, prescriptive
and predictive analytics. Statistics is one of the methodologies in data
analytics. It comprises of Descriptive statistics that describes data by
providing a summary of the data in the form of the central tendency measures:
mean, median and mode and Inferential statistics that uses different
statistical tests to resolve problems and predict future outcomes. There is no
cross-industry standard for standalone projects in statistics like the
CRISP-DM. Data mining is a part of data analytics that uses statistical and
mathematical methods to explore, analyse and model data, using the findings to
make predictions and decisions. Data mining projects follow the standard
process proposed in the reference model CRISP-DM which details the different
phases of a data mining project, how it should be done and what the outcome of
each phase should strive to be.
Further research
issues
The data
partition phase of the data mining process involves some ambiguities. A further
research is needed in this area to resolve the problem of data partitioning in
the data preparation stage. How either all 3 partitions can be used in the
analysis without compromising the model assessment or how the 3 partitioning
system can be reduced to 2 partitions, eliminating the test partition?
Just like the
CRISP-DM there could be an industry standard for standalone statistics projects
too: a CRISP for Statistical Inference, aka CRISP-SI for example or something
similar that can server as reference when doing statistics prediction projects
CRISP-DM is a
well detailed process model and most people refer to it as methodology although
it does not fulfil the criteria of a methodology in that it does not provide
how things can be done. When using a different process model like SEMMA in SAS
Enterprise Mining tool the process is easy to follow because the tool is
developed to support the process? A research on how to make CRISP customizable
in such a way that all data mining tool can provide features that support the
processes
Since there are
different disciplines in the data analytics and knowledge discovery process
CRISP can be extended to other areas too just like data mining so that there is
a CRISP for every methodology involved in data analysis and knowledge discovery
process
References
Hilborn, Don Leo, Healthcare Informational Data
Analytics (December 3, 2013). Available at SSRN: http://ssrn.com/abstract=2362781,
[Accessed 23rd December 2013].
Usama Fayyad, Gregory Piatetsky-Shapiro,
and Padhraic Smyth. 1996. The KDD process for extracting useful knowledge from
volumes of data. Commun. ACM 39, 11 (November 1996), 27-34.
DOI=10.1145/240455.240464 http://doi.acm.org/10.1145/240455.240464
Oracle, 2013,
MySQL and Hadoop – Big Data Integration, URL: http://www.mysql.com/why-mysql/white-papers/mysql_wp_enterprise_ready.php
Óscar
Marbán, Gonzalo Mariscal and Javier Segovia (2009), A Data Mining & Knowledge
Discovery Process Model, In Data Mining and Knowledge Discovery in Real Life
Applications. Book edited by: Julio Ponce and Adem Karahoca, ISBN
978-3-902613-53-0, pp. 438-453, February 2009, I-Tech, Vienna, Austria.
Nadali, A., E. N. Kakhky, and H. E. Nosratabadi. "Evaluating the
success level of data mining projects based on CRISP-DM methodology by a Fuzzy
expert system." Electronics Computer Technology (ICECT), 2011 3rd
International Conference on. Vol. 6. IEEE, 2011.
Tim Daciuk and Stephan Jou, 2011, An
introduction to data mining and predictive analytics, In Proceedings of the
2011 Conference of the Center for Advanced Studies on Collaborative Research
(CASCON '11), Marin Litoiu, Eleni Stroulia, and Stephen MacKay (Eds.). IBM
Corp., Riverton, NJ, USA, 323-324.
Dursun
Delen and Haluk Demirkan, 2013. Data, information and
analytics as services, [3. Analytics-as-a-service], Decision Support Systems
55, 1 (April 2013), 359-363. DOI=10.1016/j.dss.2012.05.044 http://dx.doi.org/10.1016/j.dss.2012.05.044
Gill Marshall, Leon Jonker, An introduction to
inferential statistics: A review and practical guide, Radiography, Volume 17,
Issue 1, February 2011, Pages e1-e6, ISSN 1078-8174, URL: http://dx.doi.org/10.1016/j.radi.2009.12.006{Marshall,
2011, An introduction to inferential statistics: A review and practical guide}
Montgomery,Runger, MR, 2003. Applied Statistics and
Probability for Engineers. 3rd ed. United States of America: John Wiley &
Sons, Inc.