Thursday, March 6, 2014

Time for WRDS 2.0?

In the beginning...

Managing financial data was very painful. Using CRSP required either using a clunky program to extract or compiling some FORTRAN when more control was needed. Using TAQ meant spending a day rotating CDs through a reader (and also either using a clunky GUI or writing your own code to read a binary format). Wharton Research Data Services (WRDS) dramatically simplified the process of accessing financial data, whether it was simply extracting a large set of return data or accessing quarterly report information. WRDS has grown considerably in scope and covers both a wide range of proprietary databases as well as offering a warehouse for free-to-use datasets.

The good, the bad and the SAS

The WRDS infrastructure seems to be built mostly on SAS, one of the grand-daddy’s of statistical software.  SAS was one of the first statistical packages I used as an undergraduate (along with Shazam, which I didn’t realize still exists and possibly the best domain name).  Back in these dark days it took 30 minutes to run a cross-section regression with 800,000 observations and a dozen or so variables on the shared Sun server.  Of course, the 800,000 observations had actually been read off of a 10.5 inch tape.  But this environment was revolutionary since it could run the regression at all, and so was very valuable.

A short 20 years since I ran my first regressions and I have no use for SAS – well, I would have no use for SAS were it not the only practical method to make non-trivial queries on WRDS.  I am sympathetic to the idea that SAS provides a simple abstraction for a wide range of data form the (now) tiny Monthly CRSP dataset to the large TAQ dataset.  However, this is a decidedly dated approach, especially for larger datasets.  I know a wide range of practitioners who work around high-frequency data and I am not aware of any who use SAS as an important component of their financial modeling toolkit.  Commercial products such as kdb exploit the structure of the dataset to be both faster and require a less storage for the same dataset.  I recently was introduced to an alternative, widely used open data storage format HDF by a former computer science colleague which can also achieve fantastic compression while providing direct access to MATLAB (or R, Python, C, Java, C#, inter alia.) It has been so successful at managing TAQ-type data that the entire TAQ database (1993-2013) can be stored on a $200 desktop hard drive. 

The deep issue is that the SAS data file format is not readily usable in many software packages, which creates an unnecessary cycle of using SAS to export data (possibly aggregating at some level) before re-importing into the native format for a more appropriate statistical analysis package designed for rapidly iterating between model and results.

WRDS Cloud

In 2012 WRDS introduced the cloud, which provided a much needed speed boost to the now aging main server. The cloud operates as a batch processor where jobs – mostly SAS programs – are submitted and run in an orderly, balanced fashion. This is a far superior, both in terms of fairness and long-run potential for growth since it follows the scale-out model so that as the use of WRDS increases, or as new, large datasets are introduced, new capacity can be brought on-line to reflect demand. The limitation of the Cloud is that it mostly is still running SAS jobs, just on faster hardware, and so the deep issues about access remain. The WRDS cloud does also support R for a small minority of datasets which have been exported to text (also not a good format since conversion from text to binary is slow, text files are verbose (although this can be mitigated using compression) and, if not carefully done, the conversion may not perfectly preserve fidelity).

Expectations of a modern data provider

What changes would I really like to see in WRDS? A brief list:

  • Use of more open data formats that support a wide range of tools, especially those which are free (e.g. R or Python, but also Octave, Julia, Java or C++ if needed) or free for academic use (e.g. Ox).
  • The ability to submit queries directly using a Web API. This is how the Oxford-Man Realized dB operates using Thompson-Reuter’s TickHistory – a C# program manages submission requests, check completion queues and downloads data, all using a Web Service.
  • The ability to execute small, short running queries directly from leading, popular software packages. MATLAB, for example, has an add-on that allows Bloomberg data to be directly pulled into MATLAB with essentially no effort and especially no importing code.

No comments:

Post a Comment