Monday, September 9, 2013

Which came first, the data or the econometrics?

The New York Times ran an article on Big Data and Economics yesterday. Big Data has become an unavoidable buzzword in the main stream media (see BBC Documentary on Big Data) although most applications covered have been outside the traditional realm of economics and econometrics – crime prediction and prevention, disease control or discerning the mood of a country or region.

Applying Big Data to economic problems is clearly going to require new econometric approaches, especially with respect to model building. The broadly taught method of manual model selection will likely not be possible with billions of records and hundreds or thousands of variables per observation. Even common practices which fall under the heading of data cleaning won’t be possible on many of these datasets.

Financial Econometrics and Big Data

Financial economists have been using Big Data for far longer than the expression has existed. Even the CRSP database, a tiny database by modern standards, was pushing the envelope when it first became available. More recently, the TAQ database – used by financial economists to understand microstructure and measure volatility and correlation – continues to push the limits of computing. As of some point in 2012, there are more than 1,000,000,000,000 (trillion) quote records in the TAQ database, and about 10% as many trades. TAQ contains only the completed transactions, and using the raw message flow results in two orders of magnitude more data (see LOBSTER).

Whether TAQ is Big Data in the modern parlance is not completely clear. Big Data is typically used to reference unstructured data (or at best weakly structured), such as the information contained in Facebook. TAQ data and exchange message feeds are highly structured (although typically contain many errors) and so they can be organized and analyzed without invoking a reference to a stuffed elephant

Say’s Law

Say’s Law, at least in one incarnation, states

Supply creates its own demand.

Say’s Law is especially true for econometrics and statistics – developing statistical techniques that can’t be applied to economically interesting data is usually a poor choice for an econometrician. On the other hand, the availability of data allows economists (and econometricians) to develop techniques that can lead to new insights. Recent examples of this include the research into realized variance, model-free implied volatility and their combination to provide new insights into risks which are actually compensated.

It is not obvious that all data vendors understand that making financial data available to academics, typically on a delayed basis, is a fantastic way to get free press, especially for data providers (while not undermining commercial viability). Moreover, new insights which come from analysis of data would be likely to increase the commercial value of the database.

1 comment:

  1. What's up with Oxford's realized library? I haven't checked it in a while. Still going strong? Any other similar data available on the web?