Monkeyshines is a tool for doing an algorithmic scrape. It's designed to handle large-scale scrapes that may exceed the capabilities of single-machine relational databases, so it plays nicely with Hadoop / Wukong, with distributed databases (MongoDB, tokyocabinet, etc.), and distributed job queue (eg "edamame/beanstalk":http://mrflip.github.com/edamame). h2. Install This is best run standalone -- not as a gem; it's still in heavy development. I recommend cloning * http://github.com/mrflip/edamame * http://github.com/mrflip/wuclan * http://github.com/mrflip/wukong * http://github.com/mrflip/monkeyshines (this repo) into a common directory. Additionally, you'll need some of these gems: * addressable (2.1.0) * extlib (0.9.12) * htmlentities (4.2.0) And if you spell ruby with a 'j', you'll want * jruby-openssl (0.5.2) * json-jruby (1.1.7) --------------------------------------------------------------------------- h2. Help! Send Monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code --------------------------------------------------------------------------- h2. Request Queue h3. Periodic requests Request stream can be metered using read-through, scheduled (eg cron), or test-and-sleep. * Scheduled * Test and sleep. A queue of resources is cyclically polled, sleeping whenever bored. h2. Requests * Base: simple fetch and store of URI. (URI specifies immutable unique resource) * : single resource, want to check for updates over time. * Timeline: ** Message stream, eg. twitter search or user timeline. Want to do paginated requests back to last-seen ** Feed: Poll the resource and extract contents, store by GUID. Want to poll frequently enough that single-page request gives full coverage. --------------------------------------------------------------------------- h2. Scraper * HttpScraper -- ** JSON ** HTML *** \0 separates records, \t separates initial fields; *** map \ to \\, then tab, cr and newline to \t, \r and \n resp. *** map tab, cr and newline to and resp. x9 xa xd x7f * HeadScraper -- records the HEAD parameters --------------------------------------------------------------------------- h2. Store * Flat file (chunked) * Key store * Read-through cache --------------------------------------------------------------------------- h2. Periodic * Log only every N requests, or t minutes, or whatever. * Restart session every hour * Close file and start new chunk every 4 hours or so. (Mitigates data loss if a file is corrupted, makes for easy batch processing). --------------------------------------------------------------------------- h2. Pagination h4. Session * *Twitter Search*: Each req brings in up to 100 results in strict reverse ID (pseudo time) order. If the last item ID in a request is less than the previous scrape session's max_id, or if fewer than 100 results are returned, the scrape session is complete. We maintain two scrape_intervals: one spans from the earliest seen search hit to the highest one from the previous scrape; the other ranges backwards from the highest in _this_ scrape session (the first item in the first successful page request) to the lowest in this scrape session (the last item on the most recent successful page request). ** Set no upper limit on the first request. ** Request by page, holding the max_id fixed ** Use the lowest ID from the previous request as the new max_id ** Use the supplied 'next page' parameter * *Twitter Followers*: Each request brings in 100 followers in reverse order of when the relationship formed. A separate call to the user can tell you how many _total_ followers there are, and you can record how many there were at end of last scrape, but there's some slop (if 100 people in the middle of the list /un/follow and 100 more people at the front /follow/ then the total will be the same). High-degree accounts may have as many as 2M followers (20,000 calls). * *FriendFeed*: Up to four pages. Expiry given by result set of <100 results. * Paginated: one resource, but requires one or more requests to ** Paginated + limit (max_id/since_date): rather than request by increasing page, request one page with a limit parameter until the last-on-page overlaps the previous scrape. For example, say you are scraping search results, and that when you last made the request the max ID was 120_000; the current max_id is 155_000. Request the first page (no limit). Using the last result on each page as the new limit_id until that last result is less than 120_000. ** Paginated + stop_on_duplicate: request pages until the last one on the page matches an already-requested instance. ** Paginated + velocity_estimate: . For example, say a user acquires on average 4.1 followers/day and it has been 80 days since last scrape. With 100 followers/req you will want to request ceil( 4.1 * 80 / 100 ) = 4 pages. h4. Rescheduling Want to perform next scrape to give a couple pages or a mostly-full page. Need to track a rate (num_items / timespan), clamped to a min_reschedule / max_reschedule bounds.