GnipApi¶ ↑
Connect with different Gnip APIs and get data from streams. Currently Full Archive Search and PowerTrack APIs for Twitter data are implemented.
Documentation about Gnip APIs can be found here.
Gnip API status page can be found here
What is Gnip?¶ ↑
A Twitter division that offers access to Twitter data both historically and in real time. Gnip is not restricted to Twitter only though, it offers a set of different data sources you can integrate, Twitter is their main one of course.
Gnip APIs¶ ↑
Full Archive Search¶ ↑
It provides historical data with some aggregiations and can fetch both activities and counts over a period of time. There are some limitations so be sure to check the documentation.
Search API could return a 503 Software Error, which to me is just a different way of a 500 error. Usually this happens on specific situations with specific queries, however, it's not repeatable 100% of the time. If you encounter this error, make your script wait for some seconds and retry. Alternatively you can break down further the rules you're using or the periods. As far as I'm aware of, it's likely to happen when querying large amounts of data on wide periods.
PowerTrack¶ ↑
Provides ways to setup rules that act as filters/matchers and an HTTP stream endpoint that will send the results to the consumer.
The HTTP stream can suffer from unexpected connection loss. Sometimes it's intended from Gnip, and sometimes it doesn't seem so. Depending on what you're doing with the received data, you may be disconnected due to a slow consumer. Ideally you shouldn't do anything else than read and do processing on a different process/thread. GnipApi offers a few different methods to deal with this.
A word about rules¶ ↑
It can be tricky to define proper rules. Please read the documentation on each APIs to know how rules work. Search API and PowerTrack both use similar rule structure, but there are differences between what can each do.
In some cases a rule can match undesired information. This is because Gnip tokenizes the data and applies the rules to that parsed data. For example, URLs can be matched by accident, and it won't be clear why exaclty. Gnip doesn't mention what fields of a source object is considering to match so be sure to target the matching properly.
Installation¶ ↑
Add this line to your application's Gemfile:
gem 'gnip_api'
And then execute:
$ bundle
Or install it yourself as:
$ gem install gnip_api
Use the master branch to get more frequent updates on this gem.
Usage¶ ↑
Configure the gem¶ ↑
GnipApi.configure |config| config.user = 'someone' # Gnip Account Username config.password = 'something' # Gnip Password config.account = 'myGnipAccount' # Your accounts name config.logger = Logger.new('myLog.log') # You can also provide a custom logger config.source = 'twitter' # General source, if none defined when quering, this will be used config.label = 'mystream' # General stream label, if none defined when quering, this will be used config.request_timeout = 120 # Default time out on all requests, defaults to 60 config.debug = false # Defaults to false, enables/disables debug output on log config.log_level = Logger::WARN # Set it to Logger::DEBUG if you have problems to inspect queries and data end
Put the avobe code in an initializer if you're using Rails, or somewhere else if you aren't. After that you can interact with Gnip APIs.
Note that you'll need a source and a label. Source is the data source within Gnip, such as Twitter, and label is the identifier of your stream.
Search API¶ ↑
Some notes¶ ↑
While using the Full Archive Search or FAS as we call it we faced some issues that you may encounter as well if you use it. The most notorious one is the 503 “You encountered a problem in our software” which is mentioned avobe. Upon troubleshooting this error, the client side “solution” or workaround to better put it, is to iterate from the client the period. Instead of letting GNIP paginate the data build smaller periods of time. For example, instead of requesting from year 2016 to 2017, do 12 requests of 1 month each. We found that making this period size smaller and smaller ends up making it work. A higher process built using this gem splits any given period in smaller ones and iterates over the data, re running missing periods split further to fill in missing data. The smalles period that seems to have 100% chances of success is 1 hour. If you wonder who came up with this ugly solution, the answer is GNIP itself, upon talking to their support area about this. It doesn't seem proper to include this on the gem since this errors is not supposed to happen, but it may eventually be included as an alternative querying to mitigate the problem.
Overview¶ ↑
The Search API allows you to get counts or activities in a time period, with a maximum period size of 30 days per request. PowerTrack rules are used as query parameter, but be careful PowerTrack operators may not be supported on Search API or could behave differently. Read the Gnip docs to make sure. To access the Search API you will need a rule first, you can use PowerTrack Rule object for it:
rule = GnipApi::PowerTrack::Rule.new :value => 'keyword1 OR keyword2'
Then you can query the search endpoint to get counts or activities. For counts:
results = GnipApi::Search.new.counts :rule => rule
For activities:
results = GnipApi::Search.new.activities :rule => rule
Responses are parsed, so you can then use the output normally as any other Ruby object. For the case of activities, they get converted to Gnip::Activity objects, and have all the rest parsed as they would came from stream.
You can set different parameters:
results = GnipApi::Search.new.counts :rule => rule, :from_date => DateTime.parse('2016-01-01 00:00'), :to_date => DateTime.parse('2016-05-01 22:00'), :bucket => 'day'
For activities, there are a few extra considerations:
-
A param
:max_results
indicates how many activities to return on a response, valid values are from 10 to 500, default is 100, this param does not work on counts. -
As you noticed, you pass a
GnipApi::PowerTrack::Rule
object to the search endpoint, and as you may also know, these objects have mostly 2 things: value (actual rule), and tag. When querying activities on the Search API, you can optionally use a tag that is returned on the activity, along with the rule. This tag is deduced from the rule object you pass, in other words, if you want a tag, add it on theGnipApi::PowerTrack::Rule
object, it's not a valid param for the method. -
The
:bucket
option is only for counts.
When you query for more than 30 days or more activities than
:max_results
, the results will include a :next
token to iterate over the remaining pages. You can instantly feed this
token to a following request with same parameters:
results = GnipApi::Search.new.counts :rule => rule, :from_date => DateTime.parse('2016-01-01 00:00'), :to_date => DateTime.parse('2016-05-01 22:00'), :bucket => 'day', :next_token => 'token_from_previous_request'
PowerTrack¶ ↑
PowerTrack API has various functions. You can upload, delete and get rules and you can stream the activities. To create rules you need to create the rule objects:
rules = [] rules << GnipApi::PowerTrack::Rule.new :value => 'keyword1 OR keyword2', :tag => 'first_rule' rules << GnipApi::PowerTrack::Rule.new :value => 'keyword3 keyword4', :tag => 'second_rule'
Once you have your rule objects set, you can put them into an array and feed them to the PowerTrack Rules API:
GnipApi::PowerTrack::Rules.new.create rules
That will upload the rules to the stream. The endpoint doesn't return anything on success but it will validate rules before applying and any syntax error will be raised as an error.
To get a list of rules defined in the stream:
GnipApi::PowerTrack::Rules.new.list
That will return an array of GnipRule::PowerTrack::Rule objects. In the same way as the upload the delete method removes 1 or more rules:
GnipApi::PowerTrack::Rules.new.delete rules
Same as upload, no response from Gnip when deleting. Important: There's no mapping between PowerTrack Rules and the rules you create, and they do not generate any identifier. Gnip suggests to generate an UID including the tag, to create an identifier and keep the mapping. When you delete a rule, the rule you are sending needs to be exaclty the same you used on upload, otherwise you would be trying to delete a non-existent rule or deleting a different rule, both cases without error from Gnip alerting this. Running a hash function over the JSON rule should do the trick.
Finally, you can stream the activities and do something with them:
GnipApi::PowerTrack::Stream.new.consume do |messages| messages.select{|m| m.activity?}.each{|a| puts a.body} messages.select{|m| m.system_message?}.each{|s| puts s.message} end
Documentation¶ ↑
RDoc was integrated for this gem and documents will be included in the repo to browse. You can execute:
$ rake rdoc
To regenerate it. Browse doc/rdoc/index.html to inspect the bundled documentation.
WIP State¶ ↑
Various Gnip features aren't implemented yet and I lack access to them. I could implement them from documentation alone, but given the experience I have with Gnip, it might not work at all.
Contributing¶ ↑
-
Fork it ( github.com/[my-github-username]/gnip_api/fork )
-
Create your feature branch (
git checkout -b my-new-feature
) -
Commit your changes (
git commit -am 'Add some feature'
) -
Push to the branch (
git push origin my-new-feature
) -
Create a new Pull Request
Feel free to ask/suggest ideas or features, or to report any bugs or issues.
This library was constructed with the help of Armando Andini who provided the basis to connect with the Gnip APIs.