Books on Big Data

Our current area of primary interest is a field known as “Big Data”. Essentially quantitative analysts are learning how mine the wealth of data that is being generated by modern data systems. At Hensky, we have years of experience at working with the administrative data that is produced by large bureaucratic organizations. This is a fruitful area but we are also expanding into other areas.

It is difficult to get a balanced view on this subject as it is very much a flavour of the day.  However, an excellent overview of the subject is available with Mayer-Schonbergers and Cukiers Big Data. It is a bestseller that gives you a balanced view of the field.  The fact that it is for sale in airport bookstores says something about the importance of the book and subject.

However, when the time comes to actually do the work, an experienced analysts will need to retool.  The MIT online course is a fabulous place to start.  However, the course takes more a computer scientist view of the issue.  We strongly recommend that the statistical aspects of issue be studied as things really have changed. When many of us went to school and studied statistics, it was all about how to perform inference on estimates based on small samples.  In many cases, the professors would admit that they did not know what the statistical properties of some of the estimators were with the small samples.  With Big Data, that has all changed.  To help make the shift from the world of T-statistics to Big Data, Gareth James and company have produced An Introduction to Statistical Learning.  In their world estimators are assessed by how well models predict the future, not by how tightly they replicate the past.  If you need more of an introduction to R, Robert Kabacoff’s R In Action is a great introduction.

At Hensky, we are very much on the Big Data bandwagon as we are sure that this will be the next big thing.  However, there are always a few challenges with any bandwagon.  We anticipate that coping with the lower quality of the data will be the major stumbling block.  We are even involved in writing papers for the academic market in this area.  The classic introduction to the field for us was written by a consultant in the area, Laura Sebastien-Coleman.  Her book Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework is the foundation of a lot of our current work.