Mini-Tutorial: Counting Tweets with Python

Counting, Voting, Polling, Abusing Statistics...

Sometimes the volume of tweets says more than the actual tweets. You may be interested in the frequency over time of tweets containing a particular word. Or, you may want to use hashtags as a proxy vote. In either case, you have the option of counting old tweets or currently tweeted tweets. Twitter retains only about about a week's worth of tweets. Older tweets you either have to store yourself or pay a service. For real-time tweets Twitter delivers a statistical sampling of about 1% of the total stream. However, 1% may be more than enough to capture the tweets that contain the terms you are following.

Twitter's APIs

Twitter supports two types of APIs distinguished by the type of connection used. The Streaming API uses a continuous connection that doesn't close until the client breaks the connection. With the REST API the connection always closes after a result is returned. Each API has a variety of URL endpoints for making requests of timelines, lists, followers, etc.

There are a number of popular open source Python packages that simplify Twitter requests. We will use TwitterAPI, a package that supports all API endpoints and works with both Python 2 and Python 3. We will need just two endpoints: statuses/filter to get new tweets, and search/tweets to get old tweets. But, before making requests Twitter requires developers to create an application and generate credentials on dev.twitter.com. Once that is accomplished, copy and past your credentials from the website to your code.

          
            from TwitterAPI import TwitterAPI
            
            # Substitute your API and ACCESS credentials.
            api = TwitterAPI(API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
          
        

Counting New Tweets

The statuses/filter endpoint uses a continuous connection to stream new tweets. It will keep returning tweets until the user or Twitter breaks the connection or when a socket error occurs. If we are careful, we can receive tweets forever. This is how TwitterAPI gets a continuous stream of tweets:

          
            r = api.request('statuses/filter', {'track':'giraffe'})
            for item in r:
              print(item['text'])
          
        

The above code downloads tweets containing the word "giraffe." The loop will exit if there is a socket error or the connection closes. For example, the client will disconnect if there is a lull in data for a period of 90 seconds, per Twitter's instructions. Twitter may also send messages other than tweets, so we need to code for that too. Here is the improved version that also keeps count.

          
            count = 0
            skip = 0
            r = api.request('statuses/filter', {'track':'giraffe'})
            for item in r:
              if 'text' in item:
                count += 1
              elif 'limit' in item:
                skip = item['limit'].get('track')
                print('*** SKIPPED %d TWEETS' % skip)
              elif 'disconnect' in item:
                print('[disconnect] %s' % item['disconnect'].get('reason'))
                break
              print(count+skip);
          
        

Each item is a Python dictionary contructed from the JSON string that represents either an individual tweet or a special control message. A tweet always contains the key text. If that key is not present, we look for certain messages. The key limit indicates the rate of tweets containing "giraffe" has exceeded Twitter's Sample stream limit, about 1% of all tweets. Adding the limit's track value to the number of tweets that were not missed gives us an accurate total. The other key we look for is disconnect, which indicates Twitter wants the client to disconnect.

So far we have shown how to use Twitter's free Sample stream to count occurences of a single word, without paying for or having to consume the entire Firehose stream. However, to count more than one word -- to conduct a poll, for instance -- this method will work only if the total number of filtered tweets does not exceed the Sample stream limit. For example, to poll for the hashtags "#yes", "#no" and "#maybe" you would change the q parameter to "#yes,#no,#maybe" (comma-separated). This will download tweets having any of the filter words. Since the number of missed tweets returned with limit does not distinguish between the filter words you should only use it to record the total number of uncountable tweets. See count-new-words.py for the final version that counts multiple words.

Counting Old Tweets

To get tweets that have been previously tweeted we use search/tweets, a REST API endpoint that closes after returning a maximum of 100 recent tweets. With successive requests we can get older and older tweets. The requests must be spaced out because Twitter permits no more than 180 requests every 15 minutes, or one request every 5 seconds. Exceeding the rate limit will result in being shut out for 15 minutes. Using TwitterAPI, our first attempt looks very similar to getting new tweets.

          
            r = api.request('search/tweets', {'q':'giraffe'})
            for item in r:
              print(item['text'])
          
        

Besides the endpoint and the parameter q, the code has not changed. Running it returns the default 15 tweets. To get the maximum tweets per request we simply need to supply the count parameter. But, to download successive pages of tweets while not violating the rate limit requires a bit more code. So, for that TwitterAPI has a helper class.

          
            from TwitterAPI import TwitterRestPager
          
        

TwitterRestPager works with any Twitter REST API endpoint that supports paging. It has one method that returns an iterator, which gets successive pages of results and strings them together as if the pages were one continuous stream.

          
            r = TwitterRestPager(api, 'search/tweets', {'q':'giraffe', 'count':100})
            for item in r.get_iterator():
              print(item['text'])
          
        

Under the hood, TwitterRestPager spaces out successive requests to stay under the rate limit. You can increase the default 5-second wait interval by supplying an optional wait parameter to get_iterator(). The above code should get all tweets containing "giraffe" in the past week or so. That is, if you don't encounter an error. Here is a more robust attempt that also keeps count.

          
            count = 0
            r = TwitterRestPager(api, 'search/tweets', {'q':'giraffe', 'count':100})
            for item in r.get_iterator(wait=6):
              if 'text' in item:
                count += 1
              elif 'message' in item and item['code'] == 88:
                print('SUSPEND, RATE LIMIT EXCEEDED: %s' % item['message'])
                break
              print(count)
          
        

We have conservatively increased the wait interval to 6 seconds. Although we are careful not to exceed the rate limit, we handle this specific error as a precaution, for instance, if another program is simulaneously authenticated with the same credentials. You can lookup all the possible codes that Twitter returns. Error 88 indicates we should suspend and how long to suspend before making a new request. The complete example for counting multiple words tweeted in the past week is in count-old-words.py.

Python Packages

Python Tutorial Source Code