Arηs Cerebro Real-Time Engine
Arηs has proven for more than 10 years now to deliver classic BI solutions. Arηs introduced BI 3.0 in 2013, which demonstrated a paradigm shift in the following areas : Mobile BI, Cloud Computing, Social Media Analytics, Self-service BI and Big Data. Our goal is to improve society.
Demonstrate our capacity to create a Lambda Architecture, where the best of two worlds can be combined. This is the batch mode processing with Hadoop together with the real-time layer engine of Apache Storm. These will be used to handle massive amounts of data and to produce new insights to our customers. The project was initiated to analyze via the different social media the sentiment of the European jobseekers, and the impact of their regional mobility due to the different conferences and activities on the European Job Days. With the large potential discovered during this initial phase, the project was enlarged globally to diverse actionable and searchable content.
According to a recent Gartner survey, 64% of IT companies are already investing in big data or have it in their plans over the next 12 to 24 months.
Doing Big Data can mean several things and the technologies or platforms to use, depend on the type of Big Data challenges you are trying to face. Basically, Big Data can be broken up into 3 dimensions all tackling different problems, to create value out of the mass of information available:
- Volume -> Scaling of data
- Variety -> Different forms of data
- Velocity -> Analysis of real-time data
To demonstrate the capacity of the velocity dimension, it has been decided to realize a tool for real-time analysis of data. Classic Big Data examples can be found along the web, so the project was initiated not to do yet another map-reduce application counting hash tags in tweets.
The goal of the project is to visualize the feeling of social media regarding a particular topic, which is being achieved via sentiment analysis. Twitter communicated that mid-2014, on average, around 20.000 public tweets per second are posted, making it a good reference for the input data stream.
The main characteristics of the project are:
- Read tweets from Twitter;
- Filter tweets;
- Calculate sentiment of tweets;
- Store raw tweets & sentiment;
- Visualize sentiment on web;
- Make the system scalable;
- Make the system fault-tolerant.
Evaluation of the traditional approach
Instead of jumping right into the buzzed Big Data technologies, the reflection was made on how to solve this using proven technologies and our current experiences. A classic solution would be based on queues and workers. Our strong background in java would make it possible implementing this using JEE, meaning having a set of Message Driven Beans wired together with JMS. However this solutions has one major drawback, it does not scale well horizontally, which was one of the objectives.
Use case: A system with queues and workers is configured to evenly balance the workload over different workers. No problems so far.
Imagine scaling to create more throughput. Deployment of a new worker is creating an additional (logical) queue. This implies to reconfigure and redeploy the first set of workers in order to enable the system to rebalance the workload over the new system. These cascading dependencies make it a tedious task to scale these types of systems.
Another drawback of a system with queues is that it makes your system slow. The queues are a a way of making the system fault-tolerant. This is achieved by taking a message of the queue only if the worker has acknowledged that it successfully processed the message, so when a worker fails the message is not lost. The use of queues however, requests that every time you push a message, the system will need to persist that message on the disk, serializing it and deserializing it when your worker wants to consume the message, making your system slow.
Additionally coding this type of solutions will require a lot of code for routing and message serialization instead of coding the actual business logic.
Sentiment Analysis with Storm and Kafka
These drawbacks make it a good opportunity to start with recent technologies such as Storm and Kafka and build a horizontally scalable, fault tolerant, real-time processing engine with guaranteed data processing.
Twitter Streaming API
To get input data into our system, it has been decided to source from Twitter, which has a Firehose API, a streaming service allowing to listen to all public tweeting in (near) real-time. Because only a handful of Certified Product Partners have access to the Twitter Firehose API, our approach was to use Twitter s Streaming API. The Streaming API allows you to listen to the public tweets, just like the Firehose API, but it comes with limitations: Twitter only returns a small percentage of all tweets matching your search query in real-time.
Kafka instead of Queues
Kafka is a persistent, distributed, replicated pub/sub messaging system originally developed at LinkedIn and designed to overcome some of the problems caused by the traditional Queues Worker pattern. Kafka has the following three design principles:
- Very simple API for both producers and consumers;
- Low overhead in network transfer as well as on-disk storage;
- A scaled out architecture from the beginning.
Storm instead of workers and intermediate message brokers
Apache Storm is a free and open source distributed real-time computation system. Storm will replace the workers and intermediate brokers from our original setup. It has three types of components:
- A spout is a source of streams in a computation, in our case the Kafka Spout;
- A bolt processes any number of input streams and produces any number of new output streams. The business logic of the application can be found in the bolts;
- A topology is a graph of spouts and bolts, where each edge of the graph represents a bolt subscribing to the output of some other spout or bolt.
Let s go over the different steps for the set up of the application
- Use the Twitter Streaming API to get all the tweets, present in a set of keywords, and push it to a Kafka Topic.
- Now that the messages are arriving in the Kafka topic we need to create a Spout to subscribe to the Kafka Topic to get the related tweets for processing. This is where Storm comes into the picture.
- Use Storm to do the real-time processing of the messages.
- Setup the topology with the necessary groupings.
Before going into the topologies itself, let s first address the bolts. The twitter sentiment analyses grouped the different actions that had to be done in a set of bolts. Wiring the bolts and spout together is done by a topology. In the topology you define how spouts and bolts are logically connected and you hint on the parallelism.
In the exciting world of big data today, ArÎ·s has been able to prove its capability to work successfully on new stacks of information and tools, providing new knowledge and insights to the business, thus improving their capabilities for decision making.