One of the most renowned EU&USA publishers reached out to us with an intention to build a new value-added service for their subscribers focused on prediction of a bill-voting outcome in US congress. Such service was offering an innovative, impartial and accurate approach to political analytics, as opposite to classic “expert” approach used by other players on the publishing market.
The journey began with deep analysis of the problem and available data to solve it. The main challenge was that in order to get high accuracy the machine learning model had to gather attributes of a bill and voting habits of senators from multiple non-congruent sources. Added complexity was the fact that both “bill” and “voter” are complex entities and specific voter decision regarding specific bill depends on multiple attributes and their combinations, therefore requiring advanced approaches for prediction.
Initial development team consisted of two data scientists and one DevOps. First, we built an API-like connector for grabbing and processing the data. A lot of work was dedicated to clean up existing data – all success of the model hinged on apposite data cleansing and enriching. Moreover, considering potential growth of external sources, current data processing module must have easily expendable architecture for rapid scale up once we have new source on the board.
Next step was dedicated to getting a full understanding of the data thus we ran a set of models to get descriptive statistics and full comprehension of the given data. Having the latter completed, we did the PCA (Principal Component Analysis) to understand the weights and how the key features do affect the outcome. After the all of abovementioned, we devised a plan of model testing – starting from 10 “competitors” we boiled a list down to the four key models and amalgamated them into an ensemble.
Long story short – the model had an accuracy of 84% proving the business case. Such successful proof of concept initiated series of new initiatives and value-added services based on the data existing in the organization and utilizing the power of data science and machine learning to gain unique insights from it.