Sunday, 21 June 2015

step to set up mahout and maven in eclipse

Installing/setting up mahout and Building a Recommender System
Part 1: installing mahout
Step 1:
Install m2eclipse from within eclipse using the link below:
Start eclipse > help > install new software > Add > enter m2eclipe as name and paste the URL above into the URL field. Next > install (this will take a while) > next > next > accept agreement … > next > restart eclipse to complete installation
Part 2: Building a User Recommender System
Step 1: create a project in eclipse
Creating the project: File > new > others > maven project > next > select eg quickstart > enter names eg:
group id: com.recommender
artifacts id: RecommenderApp
package: com.recommender1.RecommenderApp1
Click on Finish

The project will be created in eclipse and the source file will be located under:
project-name/src/main/java

Issues you might encounter:
1.       The build path error/warning:
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
To eliminate the build path warning in eclipse > right click on project > properties > java build path > libraries > add library > JRE System library > Alternative JRE > select JRE System library 1.5 > Remove
Edit pom.xm;l

 To the Official Release site: https://mahout.apache.org/general/downloads.html

 And copy: For Maven users please include the following snippet in your pom:
In eclipse ecpand the project and fine the file pom.xml > double click on it and it will show on the right side of the workspace (underneath) >



Click on the pom.xml and it will open the following:
BEFORE:
-----------------------------------------------------------------------------------------------------------------------
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.cold-start-rsystem</groupId>
  <artifactId>cold-start</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>cold-start</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
</project>

--------------------------------------------------------------------------------
Add the below to the file:
<dependency>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout-core</artifactId>
    <version>${mahout.version}</version>
</dependency>

So once again: in eclipse > click on pom.xml on the bottom of the page > paste it in just above the last entry as shown below:
AFTER:
---------------------------------------------------------------------------------------------------------------
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
       <modelVersion>4.0.0</modelVersion>

       <groupId>com.cold-start-rsystem</groupId>
       <artifactId>cold-start</artifactId>
       <version>0.0.1-SNAPSHOT</version>
       <packaging>jar</packaging>

       <name>cold-start</name>
       <url>http://maven.apache.org</url>

       <properties>
              <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
       </properties>

       <dependency>
              <groupId>org.apache.mahout</groupId>
              <artifactId>mahout-core</artifactId>
              <version>0.9</version>
       </dependency>


       <dependencies>
              <dependency>
                     <groupId>junit</groupId>
                     <artifactId>junit</artifactId>
                     <version>3.8.1</version>
                     <scope>test</scope>
              </dependency>
       </dependencies>
</project>
 ---------------------------------------------------------------------------------------------------------------
Edit the version to correspond with your version as shown above for version 0.9
1.  eg enter 0.9 like: <version>0.9</version>
2.  Select all (ctrl + A)
3.  Space it all nicely (ctrl + i)
4.  Save it (ctrl + s)
5.       Close > . This is the end of the quick start outlined here: https://mahout.apache.org/users/basics/quickstart.html

Next step:  Get the dataset and add the dataset
1.       Get the dataset eg from movielens or from link below: https://mahout.apache.org/users/recommender/userbased-5-minutes.html
2.       Copy the data
3.       Create folder in the project: right click on project > new > folder > name it eg data
4.       Create file in the newly created folder: right click on the folder > new file > name it eg dataset.csv > drag and drop the file to the work-area of eclipse (it has to be an empty wordarea) > paste the data into it > save it: ctrl +s > close it
5.       Now to the projectApp (RecommenderApp) > src/main/java > com/App.java
Issues: error import cannot be resolved ….
The import org.apache cannot be resolved
Resolution:  download and import maven libraries
1.       Download your version from: http://www.whoishostingthis.com/mirrors/apache/mahout/
1.       unzip to a directory
2.       from eclipse import the libraries: right click on the project > build path > configure build path > libraries > add external jar files > browse to the directory > click on
3.       mahout-core-0.9.jar > open
4.       mahout-core-0.9-job.jar > open
5.       mahout-integration-0.9.jar > open
6.       mahout-math-0.9.jar (if needed) > open

Issues and problem encountered:
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Resolution is:
1.       Create a folder called lib
2.       Download the file slf4j-nop-1.7.5 from (make sure it is a trusted site): http://mvnrepository.com/artifact/org.slf4j/slf4j-nop/1.7.5
3.       drag the file to the lib folder created in the project

4.       right click on the file – slf4j-nop-1.7.5 > build path > add to build path > refresh project > run project and issue should be resolved

Wednesday, 25 December 2013

Data Analytics and Predictive Analytics, Inferential Statistics and CRISP-DM process model

Abstract

Data Analytics as a science that deals with data has evolved immensely over the years. It has grown to become very vital in the processing and analysis of data for businesses, organisations.  Its applications are in both private, public sectors as well as in governmental, institutional and commercial areas. Using the findings of the data analysis informed decisions can be made as well as making future predictions. This paper looks at Data analytics from Data mining perspective taking a closer look at the processes involved in the data mining analysis, in line with the recognised standard for data mining projects. There are various process models used by data analyst for carrying out data mining projects and some companies have developed their own process to match their data mining tools. An example is the process called Sample Explore Modify Model and Analyse (SEMMA) developed by the SAS Institute to match their data mining tool called SAS Enterprise Miner. Although there are different process models for data analysis and knowledge Discovery in Databases (KDD) the process reviewed in this paper is the generally recognised standard process for data mining projects which is the Cross Industry Standard Process for Data Mining (CRISP-DM). Even if the CRISP-DM process is not all encompassing, it does not really fulfil the criteria for a methodology and sometimes not all the processes proposed in the model are really needed in a project, it contains details of very useful guidelines that can see any data mining project to success. There is no analysis of the various disciplines involved in the data analytics process like machine learning, visualization and artificial intelligence in this paper.

Keywords

Data analytics, predictive analytics, data analysis, data mining, knowledge discovery in databases, CRISP-DM, KDD, CRISP, process model for data mining

Introduction

For a layman it might be difficult to tell the difference between data analytics and data mining. While data analytics encompasses all the various disciplines from statistics, visualization, Neurocomputing, machine learning and artificial intelligence to Databases, Data mining is just one of these many different methodologies that make up data analytics. Data mining on its own also employs various methodologies from areas like database, mathematics, statistics and pattern recognition in its analysis, following a sequence of processes. It is not mandatory to follow the predefined processes in the CRISP-DM. other process models can also be used to accomplish the data mining task. On the other hand the data analytics profession is an area that is still evolving at a very fast pace just like data mining that is a part of it. More and more professionals are getting involved in it both within and outside the data science profession. In order to make it easy for the different parties involved and to foster better understanding and communication for the data analysts the CRISP-DM reference model was developed and agreed on as the standard for all data mining projects

This paper takes a general look at data analytics and a more detailed look at data mining using CRISP-DM. the first section looks at data analytics and the different type of data analytics: Descriptive, Prescriptive and Predictive

It then looks at data mining and the stages in data mining before looking at Data mining processes models. This is followed by the CRISP-DM process reference model and finally the conclusion

 

Data Analytics as a science that deals with data has evolved immensely over the years. It has evolved to become very vital in the processing and analysis of data for businesses and organisations and has provided huge advantages for both businesses and organisations in being able to use the finding of the analysis not just to make informed decisions but also to be able to predict future situations based 

A simple web definition of Data Analytics is that it is a branch of science that deals with gathering huge data set, sometimes called Big Data, exploring and analysing the data, transforming the data and modelling the data so that hidden patterns and associations or correlations can be discovered from the data. The outcome or knowledge discovered from the data analysis can then be used to make informed decisions and to predict future outcomes and future expectations.  It is a multi-disciplinary process that incorporates different methodologies from various expert domains like Visualization, Artificial Intelligence, Machine Learning, Database, High performance computing, Statistics and Data Mining (Hilborn, 2013; Wikipedia, 2013).

Fayyad et al. (1996) in his definition referred to these multi-processes simply as Knowledge Discovery in Database (KDD) and stated that Data Mining and Statistics are just two particular steps in the entire process. Data Mining and Statistics use specific algorithms to extract patterns from the data while the other additional steps in the KDD process (Data gathering, data exploring, data analysis, data transformation and data modelling) help to infer knowledge from the data (Fayyad et al., 1996).

The data lifecycle both for Data Analytics or KDD follow more or less the same process as shown in the figure below



 

The acquire stage involves selecting and gathering the raw data. The organize stage involves cleaning the data and doing the necessary pre-processing and storing the data. The analyse stage involves applying algorithms and methodologies to model the data while the decide stage will involve extracting the knowledge, making the decision and visualizing the outcome (Oracle, 2013). The data analytics and KDD processes are extensions of these phases

Data Analytics can be categorized roughly into 3 main areas which are Descriptive Analytics, Prescriptive Analytics and Predictive Analytics

Descriptive analytics

Descriptive analytics uses the analysis of historical data to model or classify data into groups or clusters by identifying not just single relationships in the data but rather many different relationships within and between data. It can be used for example to categorize or cluster customers into groups of high spenders, low spenders and product preferences, based on the issue being analyzed. For example a target group can be determined for the new products to be introduced (Rose Business Technologies).

Descriptive Analytics tends to provide solutions to analytic questions like, what happened and what did it happen, using the provided data as input. This aspect of Analytics generally strives to identify business opportunities and likely related problems (Dursun Delen and Haluk Demirkan, 2013).

Prescriptive Analytics

Prescriptive Analytics uses a combination of mathematical algorithms on the data to find best possible alternative actions or best alternative decisions that will improve the given analytic challenge (Dursun Delen and Haluk Demirkan, 2013).

 Prescriptive Analytics predicts future outcomes and also provides alternative measures or options for example, it shows the options on how to better utilize advantages or rather how to mitigate risk showing the effects of both decisions. Prescriptive analytics can model outcomes using both internal and external factors at the same time to provide decision options and the impact of the various options (Rose Business Technologies).  

Predictive Analytics

Predictive Analytics uses mathematical methods on the data to find out the hidden patterns, tendencies and relationship that can be inferred from the data. This aspect of Analytics provides solution to analytic questions like, what will happen and why it will happen. It is the outcome of this aspect of analytics that is used to make future predictions (Dursun Delen and Haluk Demirkan, 2013).

Predictive Analytics includes lots of other areas of specialties like the Machine Learning, Data Mining, Game Theory, Modelling and Statistical Methods. In a nutshell it uses both historical and current data to analyze the given situations and to predict the future. The main areas of predictive analytics that is most commonly used are the Decision Analysis and Optimization, Transaction profiling and Predictive modelling. In Decision Analysis and Optimization Prescriptive Analytics brings out the patterns in the given data for example customer data, and this patterns can then be used to predict the customer behavior for the present or future situation (Rose Business Technologies).  Statistics and Data mining are two of the most important methodologies used in the KDD process for patterns finding (Fayyad et al., 1996).

Statistics

Statistics is the branch of science that deals with collecting and analyzing numerical data and then using the result of the analysis to make decisions, to resolve issues or to create new products and processes (Montgomery and Runger, 2003). There are two areas of statistics: Descriptive statistics and Inferential Statistics

Descriptive Statistics

Descriptive Statistics is the part of statistics that, like the name implies, describes data and present a summary of the data showing the Central Tendency of the sample data or the distribution of the sample data. The Central Tendency will show if it is a normal distribution or not. The main types of Central Tendencies are the Mean, Median and the Mode of the sample data. The Mean represents the average which is the sum of the values of the variable divided by the number of items. The Median is the middle value of all the items if they are arranged in descending or ascending order while the Mode is the item that has got the most counts or that is the most frequency occurring item.  When any of the above Central Tendency is used depends of the nature of the data. If the data set includes Outliers, which is a numerical value, that is clearly different from the rest of the data set, then Median is used because it is less sensitive to outliers. The Mean is very sensitive to outliers and the Mode is mostly used to represent categorical data for example on a bar chart but it does not produce unique results like the Mean or the Median (Montgomery and Runger, 2003).

Inferential Statistics

 Inferential Statistics on the other hand is the area of statistics that concerns taking a sample from a large number objects (population, customers or product) and using the analysis of this sample to infer outcomes or make predictions that can be applied to the entire objects (Marshall and Jonker, 2011; Montgomery and Runger, 2003).

It involves using an appropriate sampling method to obtain the sample that will be representative of the entire objects in question. Using statistical techniques it is possible to calculate the required size of the samples that will be needed to carry out an inferential analysis. Inferential statistics involves different statistical test like the hypothesis test and the t-test using certain conventions like the Confidence Level of 95% or the Significance Level of 0.05. These values though arbitrary, are set by statistician and used as a convention based on best practice over many years (Marshall and Jonker, 2011). These values are predefined in the various data mining tools and they can also be adjusted as needed during any data mining project.

However it is worth stating at this point that stand-alone projects in statistics do not have any general standard process model like the CRISP-DM standard process that can act as a reference when doing statistical projects

Data Mining

When talking about Data Analytics the first discipline that comes to mind is Data Mining. Data mining as defined by Daciuk and Jou (2013) is the process whereby Statistical and Mathematical Methods as well as patterns recognition Methodologies are used to analyse large sets of data with the aim of finding patterns, tendencies and mutual relationships in the data (Tim Daciuk and Stephan Jou, 2011; Fayyad et al., 1996). In other words Data Mining is a complex process that involves a significant amount of computing resources in observing data and extracting hidden but useful and meaningful patterns from data using different methodologies and statistical algorithms. The result of the analysis can then be used to make decisions for current situations or used to make predictions or forecast future situations for the given analytical issue (Dursun Delen and Haluk Demirkan, 2013).

Although lots of the processes can be done using sophisticated automated Data Mining tools it requires a human with the domain knowledge to be able to utilize these tools and make accurate decisions and predictions. The basic stages of the Data Mining process of which there are available in most of the Data Mining tools are:

Exploring the data, fitting models to the data, comparing the models to know which model is the best for the given analytic issue and presenting the report.

Exploring the data

This involves studying the raw data and finding out if there might be for example missing values for some variables or not, if the values provided values are misleading or not representative enough, if values need to be replaced or modified before the doing the data analysis. This is generally regarded as the cleaning part, the data understanding and data preparation part of the process. It will also determine the type of Model to be used on the data. For example if the modelling has to be done with the provided data with the missing values then the Decision Tree Model will be the suitable Model  because they can best handle missing values in variables

Fitting models to the data

Based on the goal of the analysis and the function that the model should serve, this is the stage where the specific model, containing the parameters to be determined from the data, is fitted to the data. It can be classification if there are predefined classes (classifications) also known as supervised classifications, to predict the categorical class variables. It can be clustering model if there are no predefined classes or classifications and this is also known as unsupervised classification. It can also be Regression Model which can predict a variable by mapping the variable to a real value. Also at this stage it can be decided on what additional algorithm or rules to use on the data for example there are Association rules that can be used to find the datasets that occur together frequently (Fayyad et al., 1996).

Comparing the models and representation the outcomes

Based on the Data Mining tool used the models can be compared to see which one has the best answer to the analytic problem. This can be done by comparing the error outcome of each model and then choosing a representation model that best suits the situation and that can best display the outcome in a human readable form. Models for representation include Decision Tree model, linear model like the Regression model or non-linear models like the Neural Network Model. Other models include the Nearest Neighbour and the Bayesian network model (Fayyad et al., 1996). The process model used during the entire project can be a major determinant of the output type, quality and deployment method

Process models

All data mining projects basically follow a sequence of processes, mostly predefined. Some analytics use their own process, some use the specific company’s process and others choose from a variety of available processes like SEMMA, KDD process, and CRISP-DM. The most commonly used standard is the Cross Industry Standard for Data Mining (CRISP-DM)according to Piatetsky-Shapiro (2007) on his website from his survey in 2007 as shown on the figure below taken from his website


Cross Industry Standard for Data Mining (CRISP-DM) – the origin

CRISP-DM is a reference model for Data Mining projects first propounded in 1996 by a group of companies who were using the Data Mining technology. The companies were the then Daimler Benz, an Automobile giant, now known as DaimlerChystler AG in Germany. The second company was the SPSS Incorporated, a computer software company in the United States of America (USA), it was acquired by IBM since 2009 and is now called IBM SPSS. The third company was the NCR Incorporation, a computer hardware and electronics company based in the USA. The fourth company was Teradata Corporation, a computer company also based in the USA. The fifth company was the OHRA Insurance Company based in the Netherlands. These companies came up with this idea to find a model that will be standard for all Data Mining related processes. They invented the acronym, CRISP-DM, got the European Commission funding for the project and finally came out with the end result which is the documentation of the standard procedure that should serve as a reference model for Data Mining processes (SPSS, Chapman et al. 2000).

 

What is CRISP-DM

CRISP-DM is a reference model that proposes a sequence of processes that should be followed in a typical Data Mining project. As a reference model all the proposed processes might not be needed depending on the nature of a particular project but CRISP-DM is the recognized process model standardized for use in all data mining projects (Marbán et al., 2009)

As shown in the figure 3 below, it comprises of 6 processes that entails all the processes involved in the Data Mining project lifecycle starting from the Business understanding stage and goes on sequentially until the final deployment stage. It is meant to serve as a guide for professionals in the Data Mining profession when doing any data mining project.

 
CRISP-DM processes

1.        Business understanding

Business understanding phase of CRISP-DM  is the stage where the analyst tries to understand the objectives of the project and what the business goals are. This includes gathering as much information as possible regarding the proposed project, look at the current solutions if any and describe the problems area and how it can be resolved using data mining. Assess the situation to know the available data that can be used for the analysis taking into consideration the time factor, the financial factor. Try to figure out any risk that might be involved, make contingency plans and take note of the resources available for the project. During this phase create a data mining plan stating the data mining problem to be resolved like clustering or classification or prediction providing actual number in the initial sketch where possible. At the end of this phase create a data mining plan that should show all the phases of the project from start to finish, the estimated time for each phase and the available resources and risk for each phase (IBM SPSS Modeler CRISP-DM).

2.        Data understanding

The data understanding phase involves collecting required data from available sources, studying and exploring the data. Using available methods describe and summarize the data so as to have a clear overview of the data and be able to figure out some details about the quality of the data. The data summary should show the attributes of the variables, the variable types and the values displayed using tables and/or graphs and charts. Do a quality check to discover the variables with missing values or wrong inputs and conclude the phase with a data quality report (IBM SPSS Modeler CRISP-DM).

3.        Data preparation

Based on the reports from the previous phases the data can be prepared for the intended analysis type by cleaning and formatting the data if necessary. This phase involves selecting the required data from the explored data, adding additional data if needed, replacing the missing values, adding new attributes and splitting the data into test , validation and training data set as needed.

A point to note here is that although the data mining tools provide this option for partitioning the data into 3 different sections: test; validation and training, most analysts think the test partition is a waste of resources and prefer to partition the data equally into validation and training so as to have enough data for the analysis. The consequence of this action though comes later in the process when comparing the models to know the best model that best resolves the problem.  If more data is assigned to the Training partition it will result in a better and stable prediction but the model assessment will not be stable in the model assessment stage. On the other hand if less data is assigned to the training partition it will result in a less stable prediction but more stable model assessment. This is an area that will require some future research in data mining tools

4.        Modelling

The modelling phase involves fitting models to the data by using the required model and setting the parameters as needed. The modelling stage can be done a couple of times by testing and trying out different models techniques with different parameters. The test can be repeating by tuning the parameters slightly as the case may be, monitor the result and draw some initial conclusions on the models. Assess the results of the various tests, compare the models using available techniques and find the best result that meets the requirement of the analytic problem

5.        Evaluation

The evaluation phase involves assessing the modelling results to ensure that it aligns with the predefined business and data mining goals. Document the findings and state the conclusions and any assumptions or new issues raised from the findings. Make sure the findings are presented in human understandable forms. Summarize the activities and the decisions for all the phases and if necessary go back and review the phases making adjustments if need be

6.        Deployment

This is the last phase of the project where the final results of the entire process is deployed into the production environment by way of improvement or informed decision or prediction depending on the original goal of the analysis. If necessary create a deployment plan for this stage, note all the requirements and state any further actions that need to be taken or monitored or any future requirements. Write a final report summarizing the findings of the project and any recommendations while taking into consideration the intended audience and structure the report to the meet the level of the audience. Present the project report if required

Conclusion

This paper touched on data analytics in general as an area of data science that is multi-disciplinary, involving several other methodologies from Statistics, machine learning, visualization, artificial intelligence and data mining. Data analytics can be subdivided into 3 different types: descriptive, prescriptive and predictive analytics. Statistics is one of the methodologies in data analytics. It comprises of Descriptive statistics that describes data by providing a summary of the data in the form of the central tendency measures: mean, median and mode and Inferential statistics that uses different statistical tests to resolve problems and predict future outcomes. There is no cross-industry standard for standalone projects in statistics like the CRISP-DM. Data mining is a part of data analytics that uses statistical and mathematical methods to explore, analyse and model data, using the findings to make predictions and decisions. Data mining projects follow the standard process proposed in the reference model CRISP-DM which details the different phases of a data mining project, how it should be done and what the outcome of each phase should strive to be.

Further research issues

The data partition phase of the data mining process involves some ambiguities. A further research is needed in this area to resolve the problem of data partitioning in the data preparation stage. How either all 3 partitions can be used in the analysis without compromising the model assessment or how the 3 partitioning system can be reduced to 2 partitions, eliminating the test partition?

Just like the CRISP-DM there could be an industry standard for standalone statistics projects too: a CRISP for Statistical Inference, aka CRISP-SI for example or something similar that can server as reference when doing statistics prediction projects

CRISP-DM is a well detailed process model and most people refer to it as methodology although it does not fulfil the criteria of a methodology in that it does not provide how things can be done. When using a different process model like SEMMA in SAS Enterprise Mining tool the process is easy to follow because the tool is developed to support the process? A research on how to make CRISP customizable in such a way that all data mining tool can provide features that support the processes

Since there are different disciplines in the data analytics and knowledge discovery process CRISP can be extended to other areas too just like data mining so that there is a CRISP for every methodology involved in data analysis and knowledge discovery process   

 

References

Hilborn, Don Leo, Healthcare Informational Data Analytics (December 3, 2013). Available at SSRN: http://ssrn.com/abstract=2362781, [Accessed 23rd December 2013].

Wikipedia, 2013, [Last updated: 21 December 2013], Analytics. Available at: http://en.wikipedia.org/wiki/Analytics, [Accessed 23rd December  2013].

Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. The KDD process for extracting useful knowledge from volumes of data. Commun. ACM 39, 11 (November 1996), 27-34. DOI=10.1145/240455.240464 http://doi.acm.org/10.1145/240455.240464

MySQL and Hadoop – Big Data Integration, Oracle, 2013, The Data Lifecycle photographer unknown, URL: < http://www.mysql.com/why-mysql/white-papers/mysql_wp_enterprise_ready.php>.

Oracle, 2013, MySQL and Hadoop – Big Data Integration, URL: http://www.mysql.com/why-mysql/white-papers/mysql_wp_enterprise_ready.php

Óscar Marbán, Gonzalo Mariscal and Javier Segovia (2009), A Data Mining & Knowledge Discovery Process Model, In Data Mining and Knowledge Discovery in Real Life Applications. Book edited by: Julio Ponce and Adem Karahoca, ISBN 978-3-902613-53-0, pp. 438-453, February 2009, I-Tech, Vienna, Austria.

Gregory Piatetsky-Shapiro, KDnuggets : Polls : Data Mining Methodology (Aug 2007), [Accessed: 24th December 2013 ], URL: http://www.kdnuggets.com/polls/2007/data_mining_methodology.htm.

Nadali, A., E. N. Kakhky, and H. E. Nosratabadi. "Evaluating the success level of data mining projects based on CRISP-DM methodology by a Fuzzy expert system." Electronics Computer Technology (ICECT), 2011 3rd International Conference on. Vol. 6. IEEE, 2011.

Tim Daciuk and Stephan Jou, 2011, An introduction to data mining and predictive analytics, In Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research (CASCON '11), Marin Litoiu, Eleni Stroulia, and Stephen MacKay (Eds.). IBM Corp., Riverton, NJ, USA, 323-324.

Dursun Delen and Haluk Demirkan, 2013. Data, information and analytics as services, [3. Analytics-as-a-service], Decision Support Systems 55, 1 (April 2013), 359-363. DOI=10.1016/j.dss.2012.05.044 http://dx.doi.org/10.1016/j.dss.2012.05.044

Rose Business Technologies. 2012. Rose Business Technologies. [ONLINE] Available at: http://www.rosebt.com/1/post/2012/08/predictive-descriptive-prescriptive-analytics.html. [Accessed 22 December 13].

Gill Marshall, Leon Jonker, An introduction to inferential statistics: A review and practical guide, Radiography, Volume 17, Issue 1, February 2011, Pages e1-e6, ISSN 1078-8174, URL: http://dx.doi.org/10.1016/j.radi.2009.12.006{Marshall, 2011, An introduction to inferential statistics: A review and practical guide}

Montgomery,Runger, MR, 2003. Applied Statistics and Probability for Engineers. 3rd ed. United States of America: John Wiley & Sons, Inc.