The Data Analysis Process
• Problem definition
The definition step and the corresponding documentation (deliverables) of the scientific problem or business are both very important in order to focus the entire analysis strictly on getting results. In fact, a comprehensive or exhaustive study of the system is sometimes complex and you do not always have enough information to start with. So the definition of the problem and especially its planning can determine uniquely the guidelines to follow for the whole project.
• Data extraction
When you want to get the data, a good place to start is just the Web. But most of the data on the Web can be difficult to capture; in fact, not all data are available in a file or database, but can be more or less implicitly content that is inside HTML pages in many different formats. To this end, a methodology called Web Scraping, which allows the collection of data through the recognition of specific occurrence of HTML tags within the web pages, has been developed. There are software specifically designed for this purpose, and once an occurrence is found, they extract the desired data. Once the search is complete, you will get a list of data ready to be subjected to the data analysis.
• Data preparation
The preparation of the data is concerned with obtaining, cleaning, normalizing, and transforming data into an optimized data set, that is, in a prepared format, normally tabular, suitable for the methods of analysis that have been scheduled during the design phase.
• Data exploration
Exploring the data is essentially the search for data in a graphical or statistical presentation in order to find patterns, connections, and relationships in the data. Data visualization is the best tool to highlight possible patterns.
Generally, the data analysis requires processes of summarization of statements regarding the data to be studied. The summarization is a process by which data are reduced to interpretation without sacrificing important information.
Clustering is a method of data analysis that is used to find groups united by common attributes (grouping).
Another important step of the analysis focuses on the identification of relationships, trends, and anomalies in the data. In order to find out this kind of information, one often has to resort to the tools as well as performing another round of data analysis, this time on the data visualization itself.
Other methods of data mining, such as decision trees and association rules, automatically extract important facts or rules from data. These approaches can be used in parallel with the data visualization to find information about the relationships between the data.
• Predictive modeling
Classification models: If the result obtained by the model type is categorical.
Regression models: If the result obtained by the model type is numeric.
Clustering models: If the result obtained by the model type is descriptive.
• Model validation/test
Generally, you will refer to the data as the training set, when you are using them for building the model, and as the validation set, when you are using them for validating the model.
• Deployment of the solution
The deployment basically consists of putting into practice the results obtained from the data analysis.
Open Data
DataHub (http://datahub.io/dataset)
World Health Organization (http://www.who.int/research/en/)
Data.gov (http://data.gov)
European Union Open Data Portal (http://open-data.europa.eu/en/data/)
Amazon Web Service public datasets (http://aws.amazon.com/datasets)
Facebook Graph (http://developers.facebook.com/docs/graph-api)
Healthdata.gov (http://www.healthdata.gov)
Google Trends (http://www.google.com/trends/explore)
Google Finance (https://www.google.com/finance)
Google Books Ngrams (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)
Machine Learning Repository (http://archive.ics.uci.edu/ml/)
No comments:
Post a Comment