README.md in PageRankr-3.1.2 vs README.md in PageRankr-3.2.0

- old
+ new

@@ -86,27 +86,27 @@ ``` If you don't specify a rank provider, then all of them are used. ``` ruby - PageRankr.ranks('www.google.com', :alexa_us, :alexa_global, :compete, :google) - #=> {:alexa_us=>1, :alexa_global=>1, :google=>10, :compete=>1} + PageRankr.ranks('www.google.com', :alexa_us, :alexa_global, :google) + #=> {:alexa_us=>1, :alexa_global=>1, :google=>10} # this also gives the same result PageRankr.ranks('www.google.com') - #=> {:alexa_us=>1, :alexa_global=>1, :google=>10, :compete=>1} + #=> {:alexa_us=>1, :alexa_global=>1, :google=>10} ``` You can also use the alias `rank` instead of `ranks`. -Valid rank trackers are: `:alexa_us, :alexa_global, :compete, :google`. To get this you can do: +Valid rank trackers are: `:alexa_us, :alexa_global, :google`. To get this you can do: ``` ruby - PageRankr.rank_trackers #=> [:alexa_global, :alexa_us, :compete, :google] + PageRankr.rank_trackers #=> [:alexa_global, :alexa_us, :google] ``` -Alexa and Compete ranks are descending where 1 is the most popular. Google page ranks are in the range 0-10 where 10 is the most popular. If a site is unindexed then the rank will be nil. +Alexa ranks are descending where 1 is the most popular. Google page ranks are in the range 0-10 where 10 is the most popular. If a site is unindexed then the rank will be nil. ## Use it a la carte! From versions >= 3, everything should be usable in a much more a la carte manner. If all you care about is google page rank (which I speculate is common) you can get that all by itself: @@ -128,10 +128,27 @@ # The body of the response tracker.body #=> "<html><head>..." ``` +## Rate limiting and proxies + +One of the annoying things about each of these services is that they really don't like you scraping data from them. In order to deal with this issue, they throttle traffic from a single machine. The simplest way to get around this is to use proxy machines to make the requests. + +In PageRankr >= 3.2.0, this is much simpler. The first thing you'll need is a proxy service. Two are provided [here](https://github.com/blatyo/page_rankr/tree/master/lib/page_rankr/proxy_services). A proxy service must define a `proxy` method that takes two arguments. It should return a string like `user:password@192.168.1.1:50501`. + +Once you have a proxy service, you can tell PageRankr to use it. For example: + +``` ruby + PageRankr.proxy_service = PageRankr::ProxyServices::Random.new([ + 'user:password@192.168.1.1:50501', + 'user:password@192.168.1.2:50501' + ]) +``` + +Once PageRankr knows about your proxy service, any request that is made will ask for a proxy from the proxy service. It does this by calling the `proxy` method. When it calls the `proxy` method, it passed the name of the tracker (e.g. `:ranks_google`) and the site that is being looked up. Hopefully, this information is sufficient for you to build a much smarter proxy service than the ones provided (pull requests welcome!). + ## Fix it! If you ever find something is broken it should now be much easier to fix it with version >= 1.3.0. For example, if the xpath used to lookup a backlink is broken, just override the method for that class to provide the correct xpath. ``` ruby @@ -196,23 +213,22 @@ future version unintentionally. * Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull) * Send me a pull request. Bonus points for topic branches. -## TODO Version 3-4 -* Use API's where possible -* New Compete API -* Some search engines throttle the amount of queries. It would be nice to know when this happens. Probably throw an exception. +## TODO Version 4 +* Detect request throttling ## Contributors * [Dru Ibarra](https://github.com/Druwerd) - Use Google Search API instead of scraping. * [Iteration Labs, LLC](https://github.com/iterationlabs) - Compete rank tracker and domain indexes. * [Marc Seeger](http://www.marc-seeger.de) ([Acquia](http://www.acquia.com)) - Ignore invalid ranks that Alexa returns for incorrect sites. * [Rémy Coutable](https://github.com/rymai) - Update public_suffix_service gem. * [Jonathan Rudenberg](https://github.com/titanous) - Fix compete scraper. * [Chris Corbyn](https://github.com/d11wtq) - Fix google page rank url. * [Hans Haselberg](https://github.com/i0rek) - Update typhoeus gem. * [Priit Haamer](https://github.com/priithaamer) - Fix google backlinks lookup. +* [Marty McKenna](https://github.com/martyMM) - Idea for proxy service ## Shout Out Gotta give credit where credits due! Original inspiration from: \ No newline at end of file