A new report from IBM Cloud (2017) found that 90 per cent of data in today’s world has been created in the last two years, at 2.5 quintillion bytes of data per day. With this figure likely to accelerate with emerging technologies, data production alone will be around 50 times greater in 2020 than it was in 2009. Never has there been more data volume and variety at our finger tips.
The volume of data needed and ingested by quantitative research firms isn’t as large as technology powerhouses such as Google and Amazon. Yet in today’s world, data variety is key for quantitative research firms to gain insights, ranging from social media to geographical and weather data sets. At G-Research, we use rigorous scientific methodology, robust statistical analysis and pattern recognition to analyse a varied data eco system, extracting deep insights from diverse datasets. Julien Lavigne Du Cadet, Data Processing Development Manager at G-Research, pinpoints some of the challenges that come with working with such an extensive variety of data sets.
Most technology companies build their models on recent data. Recent data for those companies is vital: for instance a recommender system might only look at internet browsing history over the last year since you must optimize for recent behavior. This is the opposite for quant research. G-Research holds, uses and analyses 20 plus years of relevant data every day. Our researchers build models that must accommodate different market conditions and not be over fitted on recent data. This creates complexity because data sets tend to evolve significantly over time.
Quant research also needs to take latency into account. When models are deployed in production, they may only have fractions of a second to analyse a new piece of information, whereas technology companies would generally have tens or hundreds of milliseconds to respond to requests. It is an important constraint for quant research but also an exciting challenge.
G-Research’s client receives data from lots of places and providers – a single data set could contain hundreds of different measures and dimensions. With this challenges arise, including schema management (is this a number, a date or a string), versioning (when does the schema change), validation and data cleaning.
It is also difficult to model the data that we use. What is the right amount of normalisation? Do you want quant researchers to know about a schema change or abstract it from them? How do you deal with data that is very different in nature in an optimal way, for example factors, graphs or images? We must also find innovative ways, technology and tools to store, compress and index this data. This results in a lot of work to build resilient pipelines to efficiently process different data types.
Another challenge is ‘symbology’. Each data provider has different identifiers for the same entity, for example a stock, a country and so on. The mapping is rarely trivial, especially back in time where data is at best inaccurate. If a stock changes identifier mid-month for instance, some data providers may still send data with the old identifier for weeks. You often have to do some fuzzy matching.
One of the hardest tasks is the optimisation of large data storage and organisation. We need to keep historic data of many years for trend prediction and other complex analytics. This poses a challenge for the reliability of storage systems, and how we handle data set discovery (how do you describe all the data you have to make it readily available to your consumers) and management.
Processing greater volumes and variety of data is critical for success in the future. Ultimately, G-Research’s goal is to know more and to predict more than the competition. In very simple terms, if you are buying a house, you may look at the price in the area and make an offer based on that. But what if you’ve missed the detail that a neighbouring waste plant will be erected next to it in two years’ time. Would you still offer if you knew this? For G-Research, the more we know, the more we can combine insights from different sources to forecast markets and make predictions. The amount of data available to the world is exploding. So the question really is: ‘how do we utilise all this data in the best way?’