Using Tweepy to Extract Content From Twitter

This article will show you how to quickly start fetching tweets from any public Twitter handle or hashtag, or get a list of followers and friends (following) for any public Twitter handle.

Hey Twitter, I’d like to get access to your API

In order to get access to the Twitter API, you will need a developer account. After you apply, it might take a little while to get a response from Twitter, but you should be able to easily generate your keys once your request is accepted.

You should never share your API keys with anybody, or upload them onto GitHub. Simply keep them in a .csv or .txt file on your local device. To follow the tutorial below, you will want to have your keys in the following order and format:

consumer_key, consumer_secret, access_token_key, access_token_secret

Now that we have our developer keys stored in a .csv file, we can open any IDE and import the following modules:

Tweepy: a Python library that enables developers to interact with the official Twitter API and retrieve any publicly available data from Twitter (tweets, retweets, favourites, likes, hashtags, etc..).
Pandas: the tweets we harvest from the Twitter API will be stored into a dataframe.

Let’s get started

import pandas as pd 
import tweepy

We first need to open our .csv file and pass its values into a dictionary. Please note that we will have to remove the quotation marks from our string values, as shown on line 4. The names chosen for the keys within the dictionary are absolutely aribtrary, feel free to rename them as you please.

def getKeys(file_name):
    with open(file_name, "r") as keys_csv:
        for key in keys_csv.readlines():
            keys = [k.replace('"',"") for k in key.split(",")]
            twitter_keys = {
                "consumer_key":        keys[0],
                "consumer_secret":     keys[1],
                "access_token_key":    keys[2],
                "access_token_secret": keys[3]
                }
    return twitter_keys

Our next step is to create a function that will use the values stored in the aforementioned dictionary. To do so, we pass our first set of two keys as arguments within the method OAuthHandler(), and save this into a variable named auth. We then pass the remaining two keys as arguments for the set_access_token method on the variable that we just called.

def getAccess(twitter_keys):
    auth = tweepy.OAuthHandler(
        twitter_keys["consumer_key"],
        twitter_keys["consumer_secret"]
        )
    auth.set_access_token(
        twitter_keys["access_token_key"],
        twitter_keys["access_token_secret"]
                        )
    api = tweepy.API(
        auth,
        wait_on_rate_limit=True
                   )
    return api

At a high level, this is more or less how Tweepy interacts with the Twitter API:

alt text

Wait, I can’t fetch as many tweets as I want to?

You will have probably noticed that the final step has a non mandatory argument named wait_on_rate_limit, but what does that mean? Well, let’s see what the Tweepy documentation has to say about that.

wait_on_rate_limit: Whether or not to automatically wait for rate limits to replenish

In other words, there’s a catch. We are unfortunately limited in the number of content that we can get from the Twitter API (this is not related to Tweepy). You can find more here.

Getting our first tweets

The following part is absolutely optional, but creating something that resembles a struct will help keep the code clean and easy to debug.

def getStruct():
  data = {
      "created": [],
      "author": [],
      "favorites": [],
      "retweets": [],
      "tweet": [],
      "replying_to": [],
      "quoted": [],
      "place": [],
      "favorited": [],
      "retweeted": [],
      "geo": []
      }
  return data

Most of the times, some of the keys in the above dictionary will contain no values. This is particularly true for the "geo": or "quoted": keys.

Our next step is to wrap the three functions we just created into a fourth and final function, which will also contain a smaller nested function named getUser().

We will nedd to define the following four arguments when calling the function:

choice: this parameter will take a single letter, either u or q. Entering u will mean that we are fetching tweets for a specific user handle, while entering q will allow us to query Twitter for a particular set of strings, or a hashtag.

Example 1: getTweets("u","ID_AA_Carmack",None,20) to get the latest 20 tweets from John Carmack.

Example 2: getTweets("q",None,"matplotlib", 15)to get the latest 15 tweets about Matplotlib.

The nested getUser() function will return the string “Unknown” if no User ID can be found, which surprisingly happens more often than not.

def getTweets(choice,user=None,query=None,volume):

    def getUser(id_user):
      try:
        api.get_user(id = c.id_user).user_name
      except:
        return "Unknown"

    keys = getKeys("tweepy.csv")
    api = getAccess(keys)
    data = getStruct()
    if choice == "u":
      cursor = tweepy.Cursor(
          api.user_timeline,
          id=user,
          tweet_mode="extended"
          ).items(volume)
      for c in cursor:
        data["created"].append(c.created_at),
        data["author"].append(getUser(c.id)),
        data["favorites"].append(c.favorite_count),
        data["retweets"].append(c.retweet_count),
        data["tweet"].append(c.full_text),
        data["replying_to"].append(c.in_reply_to_screen_name),
        data["quoted"].append(c.is_quote_status),
        data["place"].append(c.place),
        data["favorited"].append(c.favorited),
        data["retweeted"].append(c.retweeted),
        data["geo"].append(c.geo)
      df = pd.DataFrame(data)
    elif choice == "q":
      cursor = tweepy.Cursor(
          api.search,
          q=query,
          tweet_mode="extended"
          ).items(volume)
      for c in cursor:
          data["created"].append(c.created_at),
          data["author"].append(getUser(c.id)),
          data["favorites"].append(c.favorite_count),
          data["retweets"].append(c.retweet_count),
          data["tweet"].append(c.full_text),
          data["replying_to"].append(c.in_reply_to_screen_name),
          data["quoted"].append(c.is_quote_status),
          data["place"].append(c.place),
          data["favorited"].append(c.favorited),
          data["retweeted"].append(c.retweeted),
          data["geo"].append(c.geo)
      df = pd.DataFrame(data)
    else:
      print("Wrong input")
    df["time"] = pd.to_datetime(df["created"]).dt.time
    df["created"] = pd.to_datetime(df["created"]).dt.to_period("D")
    return df

Basically, what the long block of code above does is pretty simple. Tweepy will create a Cursor() constructor method which will handle all the pagination work and the parameters for us.

Once we have instantiated the getAccess() function, Tweepy’s Cursor() will perform different actions depending on the parameters we entered. If we entered u for User, the cursor will call api.user_timeline and will search for a username through id=user. However, if we entered q for Query, it will call api.search and look for whichever search terms we passed through q=query.

The rest is pretty simple: we loop through the results fetched by Tweepy’s Cursor(), and map them as values to their corresponding keys within the dictionary that was created when calling the getStruct() function. The last two lines simply add some extra series to the returned Pandas dataframe.

Important: when playing around with the code above, I highly recommend setting the volume parameter to 3 or 4 tweets max. As explained earlier, we want to avoid reaching the limit of tweets we can pull.

Here’s what happends when we run the following code: getTweets("u","ID_AA_Carmack",None,20)

alt text

Who’s following who

Last but not least, and as described in the opening lines of this article, we can also return all the followers and friends from any public user, making some slight changes to our previous function:

def getUserInfo(twitter_handle,volume):
    keys = getKeys("tweepy.csv")
    api = getAccess(keys)
    followers = [f.screen_name for f in tweepy.Cursor(api.followers, twitter_handle).items(volume)]
    following = [f.screen_name for f in tweepy.Cursor(api.friends, twitter_handle).items(volume)]
    df = pd.DataFrame({"following": pd.Series(following), "followers": pd.Series(followers)})
    return df

Again, Tweepy’s Cursor() has a built-in method to retrieve what we need, and below are the first rows from the dataframe that is returned when passing the following parameters to our newly created function:

getUserInfo("TDataScience", 30)

alt text