For years, I have been using NetworkX and hadn’t found any good substitute which has fast execution as CPP with rapid development as python. SNAP is a powerful tool which is aimed to be built this direction. For researches in network science, it provides tools to do general purpose and high performance system for analysis and manipulation of large networks.

Installing SNAP-python in Linux, Mac OS X and Windows is given here in this website.

Sometimes when you are working on a library that is still under development it is better to use virtualenv. A Virtual Environment, put simply, is an isolated working copy of Python which allows you to work on a specific project without worry of affecting other projects. For more details about virtualenv please take a look at this.

I faced certain problems with virtualenv and SNAP with Mac OS X 10.9.3. After reading few manuals on shared libraries, I could resolve this.

Here is the problem I faced:

1
2
3
>>> import snap
Fatal Python error: PyThreadState_Get: no current thread
Abort trap: 6

So I thought of writing a step by step blog to help users to install SNAP successfully with virtualenv.

To create a virtualenv directory :

1
virtualenv --python=$(which python) --distribute --no-site-packages snap/

Change your working directory to snap. Activate virtualenv and then install numpy, graphviz.

1
2
3
4
cd snap/
source bin/activate
pip install numpy
pip install graphviz

Install gnuplot with homebrew. Download and install python extension for gnuplot within the virtualenv snap. Now comes the last step of downloading and installing SNAP.py library. Download SNAP repository from the website and place it in the working directoy (snap/).

1
2
3
tar zxvf snap-1.0-2.2-macosx10.7.5-x64-py2.7.tar.gz
cd snap-1.0-2.2-macosx10.7.5-x64-py2.7
sudo python setup.py install

Execute the following lines in python or in ipython.

1
import snap

If you are getting the any errors related to PyThreadState_Get, then do the following.

1
install_name_tool -change bin/python /System/Library/Frameworks/Python.framework/Versions/2.7//lib/libpython2.7.dylib snap-1.0-2.2-macosx10.7.5-x64-py2.7/_snap.so

install_name_tool is used in Mac OS X to change dynamic shared library install names. Now try to run the snap example files. Voila! It should work.

Data collection and storage is the key to many projects. You have tons of data out there and you just want the part that is key to the problem. Yes, we are working a twitter virulence project to determine how virulent a tweet will go. In other words number of retweets it is going to have. We looked at SNAP dataset it has 467 million tweets collected from 20 million people covered over 7 month period. But the problem was this dataset was huge and most of it was junks, some spam messages too. Finding viral tweets and separating it out for the project was more than a herculean task.

So we decide to collect fresh data in real time. Collecting data in real time isn’t that easy job, as there are a lot of restrictions put along by twitter. Twitter streaming api came to our rescue. So how does it work?

To a streaming client (python snippet) messages will be pushed indicating Tweets and other events have occurred, without any of the overhead associated with polling a REST endpoint. For more documentation refer link.

For this kind of data collection we need a stable internet. Of course BSNL or any other services cannot provide it. :P We had this code running in AWS servers. The free usage tier from AWS has just 8gb of space. Having a DB kinda solution to store the tweets is really not an option because of the space constraint.

Jugad solution, use directories and files to store the tweets. This is something basic which uses file operations and occupies less space compared to DB. Yes, here the tweets are not indexed.

Having the infrastructure ready, we need to think about capturing viral tweets? After extensive search in twitter web, we made a list of people who have a good retweet count (Avg : 500 – 1000 ) and we tracked them using the streaming api. A glimpse of the list will reveal these name : Justin, Barack Obama, Taylor Swift, Rihana, Ronaldo, Britney Spears, Shakira, Ellen Degeneres, Bruno Mars, Selena Gomez, Justin timberlake etc.

Here is the link to tweets-data set collected for a period of 1 month.

Stats about current dataset :

  • Number of tweets collected : 8227
  • Tweets with more than 1000 retweet : 1341

About the dataset

This dataset is of 1.2 gb which was collected for nearly a month.

Each folder is named after the person-id. In each folder you have user.info. It is typically in this format :

LilTunechi ( Name ) IM YOUNG MONEY http://trukfit.com ( Bio ) Mars ( Location as specified by twitter-user ) Alaska ( Time-zone ) 2010-02-22 05:29:44 ( Profile created time in UTC ) -28800 ( UTC-offset ) true ( Verified )

Other files are tweet files, named after there tweet-id. First line is the original tweet and the format is in this form :

Tweet Message ||| length of tweet message ||| tweet-message created_at time ||| hashtags ||| followers_count at that point when he updated his status ||| hasLink

Eg :

Rich Gang album right slime!! & it’s f#Kin awesome!!! ||| 66 ||| 2013-07-23 21:40:30 ||| [] ||| 12582328 ||| False

Following lines are retweet details and it is the following format.

user id_str ||| retweet-time ||| retweeter’s location ||| retweeter-virefied ||| screen-name of retweeter ||| bio

Eg :

1511641856 ||| 2013-07-23 21:40:37 ||| Alexandria, PA ||| False ||| LynetteCarter10 ||| #Arithmetic is being able to count up to twenty without taking off your #shoes. (Mickey Mouse)

PS: This is the inaugural blog post of this website. :)

Copyright © 2015 - Vijay Mahantesh SM - Powered by Octopress