Having worn multiple hats related to dealing with data on a near-daily basis through my professional career – data modeling, schema/representation development, redesign for indexing and query performance, information extraction, feature extraction for machine learning etc. I have been following the evolution of data modeling/analytics from just data to “big data” over the past two decades. Curious over the recent crescendo, I just finished reading two books addressed to the laymen, business practitioners/CXO’s, investors, TED attendees and the folks looking to ride the next buzz (now that Twitter’s IPO is done !). and just for sanity, an old paper which I had read quite a while ago.
These books are:
a) Nate Silver’s The Signal and the Noise: Why So Many Predictions Fail-but Some Don’t . (For those interested, a quick review from an eminent statistician)
b) Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer Schonberger and Kenneth Cukier
and the paper (still relevant after a decade ) by Prof. Leo Breiman.
These books raised a number of related thoughts which I would like to share and follow up in greater detail in future posts. Before getting to those, I thought I’d share a summary of the above material which frames some of the issues in the “Big Data” movement in terms of my own paradigm:
Big data is “old wine” in a new bottle. So what’s changed ? The belief is: We are producing/gathering data faster, hence can we drive the cycle of empiricism (aka the scientific method) faster. The idea is to reduce the time span to a) formulate theory, b) collect data, c) develop a model or models, d) predict/infer based on a model and test, and finally, e) revise beliefs in a), thus leading to the next cascade of activities. Why do we want to this ? Schonberger et al. posit that: We want answers/benefits now, we believe since there is more data (we should do better but on what?), We really do not need answers to the question Why but just What is (so in this case what does it mean to predict something when I can look it up in a table). To build such a table (along multiple dimensions), I need to digitize/quantify/represent the world (aka datafication). Given the data (of different aspects of the world), we can combine, extend, reuse, repurpose the same to support other theories (than the original intent the data was collected for). However, it is user beware when one does this because when this is done, the models or inferences are not tested under all possible usage contexts. Furthermore, there is a range of new “risks” – personal privacy violations (in context of consumer data) and other (organizational intelligence, cyber-espionage) in the context of how many touch points are to the data and in how many contexts it is utilized and who controls the data and its uses. Additionally, another key viewpoint raised is we do not need “general theories” that apply to a large number of data points but a number of “small theories” that address small samples is good enough (where the small samples are defined contextually). With the availability of computing technology, we can work with a number of small sample theories to solve “small problems” which are large in number (hence Big data!). When you really think about it, in the best (or is it worst?) case scenario, each “realworld” individual is a small sample – the grail of “personalization”.
Silver in his book focuses on the process of using data – promoting a Bayesian worldview of updates on beliefs to distinguish the “signal” from the “noise”. However, his discussions are framed in a frequentist perspective. To perform updates on beliefs or even apriori discriminate signal from noise – one needs a model – the baseline. He discusses the limitations of models in different domains. See Table below. The rows indicate if the primary domain models are deterministic (Newton’s laws, rules of Chess) or inherently probabilistic. The columns indicate the completeness of our current knowledge – do we have all the models for phenomena in those domains.
Complete Domain Knowledge | Incomplete Domain Knowledge | Comments | |
---|---|---|---|
Deterministic | Analog - Weather Digital - Chess | Analog - Earthquake Digital - Worldwar II events, Terrorism events | The term "Digital" is used to mean "quantized", "Analog" - refers to the physical world |
Probabilistic/Statistical | Digital -Finance,Baseball,Poker Analog - Politics | Digital - Economy, Basketball Analog - Epidemics, Global Warming |
Improving the models or making a prediction requires analysis of data in context of the model. The book highlights the notion of “prediction” – saying something about something in the future. However, it sheds little light on “inference” – saying something new about something at present – thus “completing ones incomplete knowledge”. However, the sources of uncertainity are different in the models – A deterministic game like Chess has uncertainity introduced because of player behavior and the search tree whereas earthquake prediction is uncertain because we just do not know enough (our model is complete). The book however makes no statements on the role of “Big data” per se in terms of tools (all the analysis in the book can be done with spreadsheets or on “R”). Furthermore, the book highlights the different types of analyst “bias” that may be introduced in the inferences and predictions.
In contrast to the books, the paper makes a case for a change in the approach to professional model building by statisticians. Instead of picking the model and fitting data, let the data pick the model for you. Though if one is a professional statistician, one would search through the space of potential models (via a generate-and-test approach) and finally picking the appropriate one on some criteria. Considering the premise of this paper, one can see a potential use-case for big data in the model building lifecycle reconciling the different discussions in the books.
Will have more on this topic in future posts.
Addendum: See the latest article on this topic in CACM.