Orange Data for Development is an open data challenge, encouraging research teams around the world to use four datasets of anonymous call patterns of Orange’s Ivory Coast subsidiary, to help address society development questions in novel ways. The data sets are based on anonymized Call Detail Records extracted from Orange’s customer base, covering the months of December 2011 to April 2012.
As we explained in a previous post, during the last months we have been working on a project for the Orange D4D Challenge. Our main task has been analyzing and visualizing the provided mobile communication datasets (collected in Ivory Coast from December, 2011 to April, 2012) looking for relevant and original findings for the society of this West-African country, that is, showing deductions in an easy and friendly way which helps government and NGOs to perform more accurate and correct decisions.
Orange “Data for Development” – D4D – is an open data challenge, encouraging research teams around the world to use four datasets of anonymous call patterns of Orange’s Ivory Coast subsidiary, to help address society development questions in novel ways. The data sets are based on anonymized Call Detail Records extracted from Orange’s customer base, covering the months of December 2011 to April 2012.
Paradigma labs team
Our idea is to use the geolocation data from the antennas processing the mobile phones calls in order to know which sub-prefectures the customers have been getting around. The main goal of our project is developing spatio-temporal models to detect patterns for the different sub-prefectures, including some other factors related to the region and/or time: wealth, development, infrastructure, investment, grants…
By means of GIS technology, we will be able to apply our generated models to the gathered data and to analyze their correlations over the Côte d’Ivoire surface, working with geographical layers: landcover, roads map, railways lines, water sources… Consequently, the reached conclusions from our study will be properly visualized, allowing a better explanation of the facts.
In the near future, some other measures could be included. For instance, hospitals and police stations locations, their calls rate… Thus, we could know its real use, being able to improve their service to the citizens: dangerous areas, crowded hospitals…
At this moment a lot of companies offer end-point services (data providers, semantic analysis, …) that we can integrate with our applications. However, when designing our own service, it could be tough find the ideal parameters to configure it and to find the best software to make it scalable and highly available.
Continuous-Time Markov Chains (Yin, G. et all, 1998) (CTMC) provides an ideal framework to estimate this most important parameters, and by means of simulations we can find them. An special model of CTMC which belongs to the Queuing Theory (Breuer, L. et all, 2005) is the M/M/c/K model, and modelize our service like a queuing system, implying that our system holds:
c: the number of parallel process
K: is the maximum number of clients waiting in the queue
E.g.: The next CTMC can represent a simple M/M/3/4 queuing system (Download .dot):
As seen in the picture above, grey nodes mean that n-3 clients exist waiting in the queue and the last state will be the red node (#7) which implies that at this moment incoming clients will be reject of our system.
M/M/c/K model simulation
+ MODEL PARAMETERS
Stability: True (rho = 0.4444)
Average number of clients (l) = 1.4562
Average length (lq) = 0.1268
Average waiting time for a client into the queue (w) = 0.0365
Average waiting time into the system (wq) = 0.0032
+ PROBABILITY DISTRIBUTION
P_0 = 0.2550368777
P_1 = 0.340049170234300
P_2 = 0.226699446822867
P_3 = 0.100755309699052
P_4 = 0.044780137644023
P_5 = 0.019902283397344
P_6 = 0.008845459287708
P_7 = 0.003931315238981
[Total Probability: 1.0]
Elapsed time: 0.00025105
Once we have calculated the best-fit values for our system, it is time to present our service based on a Wikipedia Semantic Graph. The next picture shows the main structure creating relations between articles and categories:
So, in first instance our service will perform lookup queries in order to identify Entities onto a text. We can see the result of a query to our service:
Up to this point, we have calculated several parameters for our system: Incoming Lambda (λ), Service Mu (μ), c (parallel servers) and K (queue length). To ensurethe system holds these several constrains we should implement a two layers throttle system.
IPTABLES filter: Several clients will try to access to our system, however only a portion of them will succeed.
LOGIC filter: Is a software based filter and perform this throttle by means of user tokens. It applies temporal restrictions handling the incoming rate of each user.
Therefore, the following software help us to implement these restrictions:
Iptables filter: Using Iptables (debian-administration.org) we can restrict the incoming connections avoiding denial-of-service attack (DoS).
Logic filter: Using a time control and token manager script we can deal with this problem.
Several parallel servers and queue system: We set up Gunicorn to run several tornado servers to implement the queue restrictions.
A sample tornado server scaffold for our service could be:
># -*- coding: utf-8 -*-
from tornado.web import Application, RequestHandler, asynchronous
from tornado.ioloop import IOLoop
# Main class
# run application
app = tornado.web.Application([
(r"/", NerService, dict(...parameters...),
# To test single server file"
Finally, after applying this configuration we have simulated several incoming rates (testing sundry numbers of clients too) getting the next service performance statistics represented in the picture below:
Using wikipedia categories and articles, we are able to detect a huge range of Entities.
Wikipedia is always updated in real time, therefore we have a updated NER (Name Entities Recognition).
We can use Gunicorn to run and manage serveral service instances.
We have implemented a throttle system to restrict the maximum number of requests per second. Also the way to restrict the general incoming rate by means of iptables is provided.
It is proven to be neccessary to simulate different invocations of our services using Queuing Theory formulae to find the best-fit paramaters like λ, μ, ρ, L, Lq, W, Wq.
Openinfluence is an open-metric developed at Paradigmalabs and tries to define the relevance of each user in Twitter. It is open because you can see the formula and contribute to improve it. You can see the formula in the picture below:
As you can see, the formula has two main components “Popularity” and “Influence“. Popularity is related to static properties of your social network. It’s some kind of “potential influence”, the beforehand capability of getting your tweets spread.Influence is related to the propagation and repercussion of each of your tweets, the effective reach of your messages.
The correlation between Popularity and Influence (dataset) shows that the main stream of people has more or less the same Popularity and Influence. By means of the structure of this formula, some users have 0 of influence and n>0 popularity however they have not null relevance.
Suggest us your point of view !! We are expecting to improve it!!
With this Plugin for Gephi, ParadigmaLabs wants to provide the community with an useful tool to analyze Twitter information. We have encapsulated all the complexity behind a simple button. A retweet is one of the main actions for information propagation, and now you can make your own analysis in real time by means of Gephi and the Retweet Monitor plugin.
It´s internal mechanisms are fairly simple. The software will connect to the TwitterStream, then apply(if desired) a content filter. All the information gathered will be displayed by Gephi, and you can then apply the standard algorithms and layouts in order to create a representative visualization.
15th October 2011 was a world-level milestone day: Millions of people aroud the globe occupied the streets to protest against global financial crisis, influenced in a great measure by the power of social networks, essentially Twitter. The protest movement, tagged as #15o and #15oct was heavily based upon #15m (Spain) and #ows (“Occupy Wall Street”), social movements around the notion that 99% of the people is NOT responsible of the ‘financial games’ played by a minor 1% that get rich in the process of sucking their wealth from the remaining 99% (#weare99)
We present evolution through time of related Twitter activity, around 15th October 2011. Taking a Dataset of 1.2 million tweets (ranging from 13th October to 18th October), we worked to offer some global (geolocated) visualizations, local visualizations (centered around New York, San Francisco, Barcelona and Madrid) and, lastly, a visualization about how did the associated hashtags evolved in that time frame. read more…
It’s well-known that Twitter’s most powerful use is as tool for real-time journalism. Trying to understand its social connections and outstanding capacity to propagate information, we have developed a mathematical model to identify the evolution of a single tweet.
The way a tweet is spread through the network is closely related with Twitter’s retweet functionality, but retweet information is fairly incomplete due to the fight for earning credit/users by means of being the original source/author. We have taken into consideration this behavior and our approach uses text similarity measures as complement of retweet information. In addition, #hashtags and urls are included in the process since they have an important role in Twitter’s information propagation.read more…
Paradigma Labs is glad to announce its contribution to Apache Mahout’s project!
In our developments we realized that NoSQL databases were not supported natively by the machine learning library, so, trying to fill this “gap”, we decided to create a DataModel for MongoDB support. After the favorable results obtained, we determined to share it with Mahout’s community (and yeah!, the code have been accepted).
Unfortunately, this DataModel will be released with Mahout’s version 0.6, but you can access the code at Mahout’s SVN.
As bonus, we have been developed a recommender system based on this DataModel, all wrapped by a REST service. Source code, installation/configuration and more can be found at GitHub. Hope you enjoy it!
Related reading: Mahout in Action Related viewing: Social Recommendations
Paradigma Labs is glad to announce its contribution to Apache Mahout‘s project!
In our developments we realized that NoSQL databases were not supported natively by the machine learning library, so, trying to fill this “gap”, we decided to create a DataModel for MongoDB support. After the favorable results obtained, we determined to share it with Mahout’s community (and yeah!, the code have been accepted). read more…
It’s been a while since Google introduced their MapReduce framework for distributed computing on large data sets on clusters of computers. Paradigma Labs was thinking of trying Apache Hadoop to run this kind of tasks, so it was the proper choice to run some web scraping we have in our hands. I was the designated developer for the task, and my love for Ruby on Rails led me to give Hadoop Streaming a try so I could avoid Java to write the scripts. Many will agree on the virtues of Ruby, specially considering gems like Mechanize for web scraping and Rails gems like ActiveResource for accessing REST web services and ActiveRecord for ORM with a database like MySQL.