Published on January 12, 2011

Author: Philipp K. Janert

Publisher: O'Reilly
ISBN: 978-0-596-80235-6

Analyzing data has become more sophisticated and involved with a parabolic increase in data sources and increasing demands to make sense out of data collected by organizations. Data analysis with open source tools aims to highlight the major techniques in analyzing data using tools which are essentially free.

Author: Philipp K. Janert
Publisher: O'Reilly
ISBN: 978-0-596-80235-6

Click Here To Purchase Data Analysis with Open Source Tools

Analyzing data has become more sophisticated and involved with a parabolic increase in data sources and increasing demands to make sense out of data collected by organizations. Data analysis with open source tools aims to highlight the major techniques in analyzing data using tools which are essentially free.

In the first part, emphasis is given to graphical methods in analyzing the data’s properties using jitter plots, histograms, kernel density estimates, and cumulative distribution function. The best parts of this section are the author’s advise in problems faced and issues that introduce bias in analyzing data graphically. NumPy modules are used in highlighting the open source implementation.

With respect to multiple variables, there is a similar exposition of estimating the relationship and using plots. Again, most readers should benefit from issues in smoothing, advise on communication of information and results, and principles in using different types of multi-variable charts.

While handling time series data, the book covers basic trend, seasonality and noise and the auto correlation function, followed by practical advise on detrending and implementation issues. The SciPy signal package is demonstrated in the context of filtering and convolutions.

As much as there are numerous techniques to analyze data, much of data analysis is also an art form. Here, Philipp Janert has done his best in demystifying the sequence of practical steps an analyst takes in the course of analyzing data using step by step example of analyzing and fitting a basic model using an open source tool (GNU plot).

The section on modeling data covers guesstimation principles. Along with several examples of guesstimation, there is also a discussion of perturbation and error propogation in guesstimation.

While discussing scaling arguments, the author emphasizes they key principle that effective modeling depends on “what to leave out.” There is a well rounded treatment including sensitivity analysis, dimensional analysis, and using principles such as symmetry, conservation and extreme values.

In the section on probability modeling, apart from covering concepts such as binomial and normal distributions, power law, and non normal statistics, there are pointers to literature and papers as well as a backgrounder on classical and bayesian statistics.

In the section on computation, there is an emphasis on forecasting methods. Several simulation methods including monte carlo and discrete event simulation using SimPy are highlighted with some background on queueing theory. Use of PyCluster and CCluster libraries are highlighted for clustering and unsupervised learning techniques. Finally, a case in R is used with respect to finding principal component factors.

Predictive analytics dealing with prescriptive models, classification, and clustering are described starting from the terminology, basic algorithms, and process, but also common errors and problems.

The final part deals with applications of using data in business environment such as business intelligence and dashboards. The book goes beyond explaining basic concepts of data warehousing to reasons for failures, implementation dos and donts, good practises and recommendations for corporate dashboards, and issues in data quality and data consistency.

There are several useful appendices which deal with selecting software tools and a catalog of scientific software, results from calculus, and practical working with data.

Data analysis using open source tools is a straightforward, well explained and practical book which does not give just a laundry list of techniques but also useful pointers in how to use them intelligently. Several practical workshops demonstrate implementation using the open source tools. The book fills a strong need for a practical data analysis book appealing to a broad range of context and using open source tools, and is strongly recommended.

Click Here To Purchase Data Analysis with Open Source Tools