README.rdoc in jsl-feedzirra-0.0.12.8 vs README.rdoc in jsl-feedzirra-0.0.12.9

- old
+ new

@@ -1,8 +1,8 @@ -== Feedzirra += Feedzirra -=== Description +== Description Feedzirra is a feed library that is designed to get and update many feeds as quickly as possible. This includes using libcurl-multi through the taf2-curb[link:http://github.com/taf2/curb/tree/master] gem for faster http gets, and libxml through nokogiri[link:http://github.com/tenderlove/nokogiri/tree/master] and sax-machine[link:http://github.com/pauldix/sax-machine/tree/master] for faster parsing. @@ -10,26 +10,24 @@ It allows for easy customization of feed parsing options through the definition of custom parsing classes, and allows you to take as little or as much control as you want in updating feeds. Feedzirra makes it easy to figure out which content in feeds is new by storing the previous retrieval of a feed in a key-value store. Feedzirra uses the the "moneta" gem, which is a unified interface to key-value storage systems, in order to provide access to many different types of stores depending on your requirements. -=== Installation +== Installation For now Feedzirra exists only on github. It also has a few gem requirements that are only on github. Before you start you need to have libcurl[link:http://curl.haxx.se/] and libxml[link:http://xmlsoft.org/] installed. If you're on Leopard you have both. Otherwise, you'll need to -grab them. Once you've got those libraries, these are the gems that get used: nokogiri, pauldix-sax-machine, taf2-curb (note that this is a fork -that lives on github and not the Ruby Forge version of curb), and pauldix-feedzirra. The feedzirra gemspec has all the dependencies so you should -be able to get up and running with the standard github gem install routine: +grab them. Once you've got those libraries, you should be able to get up and running with the standard github gem install routine: gem sources -a http://gems.github.com # if you haven't already - gem install pauldix-feedzirra + gem install jsl-feedzirra -=== Usage +== Usage -This experimental branch offers a new interface to feed fetching with persistent back-end stores. This allows you to -easily run a script retrieving the feeds once per hour or once per day, and it will remember which feeds have been seen -before and which are new. This features uses the Feedzirra::Reader interface. +This experimental branch offers a new interface to feed fetching with persistent back-end stores. This allows you to easily run a script +retrieving the feeds once per hour or once per day, and it will remember which feeds have been seenbefore and which are new. This feature +uses the Feedzirra::Reader interface. You can create a Feedzirra::Reader object after the Feedzirra library (with require 'feedzirra') is loaded as follows: reader = Feedzirra::Reader.new('http://www.woostercollective.com/rss/index.xml') feed = reader.fetch @@ -49,88 +47,64 @@ the results of every fetch, so Feedzirra will maintain state between executions. Feedzirra currently supports filesystem, memcache and a Ruby Hash structure-based back end that doesn't attempt to persist any information. Once you've retrieved a single feed, you can use the accessors below to query the results. - # feed and entries accessors - feed.title # => "Paul Dix Explains Nothing" - feed.url # => "http://www.pauldix.net" - feed.feed_url # => "http://feeds.feedburner.com/PaulDixExplainsNothing" - feed.etag # => "GunxqnEP4NeYhrqq9TyVKTuDnh0" - feed.last_modified # => Sat Jan 31 17:58:16 -0500 2009 # it's a Time object + # feed and entries accessors + feed.title # => "Paul Dix Explains Nothing" + feed.url # => "http://www.pauldix.net" + feed.feed_url # => "http://feeds.feedburner.com/PaulDixExplainsNothing" + feed.etag # => "GunxqnEP4NeYhrqq9TyVKTuDnh0" + feed.last_modified # => Sat Jan 31 17:58:16 -0500 2009 # it's a Time object - entry = feed.entries.first - entry.title # => "Ruby Http Client Library Performance" - entry.url # => "http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html" - entry.author # => "Paul Dix" - entry.summary # => "..." - entry.content # => "..." - entry.published # => Thu Jan 29 17:00:19 UTC 2009 # it's a Time object - entry.categories # => ["...", "..."] + entry = feed.entries.first + entry.title # => "Ruby Http Client Library Performance" + entry.url # => "http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html" + entry.author # => "Paul Dix" + entry.summary # => "..." + entry.content # => "..." + entry.published # => Thu Jan 29 17:00:19 UTC 2009 # it's a Time object + entry.categories # => ["...", "..."] - # sanitizing an entry's content - entry.title.sanitize # => returns the title with harmful stuff escaped - entry.author.sanitize # => returns the author with harmful stuff escaped - entry.content.sanitize # => returns the content with harmful stuff escaped - entry.content.sanitize! # => returns content with harmful stuff escaped and replaces original (also exists for author and title) - entry.sanitize! # => sanitizes the entry's title, author, and content in place (as in, it changes the value to clean versions) - feed.sanitize_entries! # => sanitizes all entries in place + # sanitizing an entry's content + entry.title.sanitize # => returns the title with harmful stuff escaped + entry.author.sanitize # => returns the author with harmful stuff escaped + entry.content.sanitize # => returns the content with harmful stuff escaped + entry.content.sanitize! # => returns content with harmful stuff escaped and replaces original (also exists for author and title) + entry.sanitize! # => sanitizes the entry's title, author, and content in place (as in, it changes the value to clean versions) + feed.sanitize_entries! # => sanitizes all entries in place - # updating a single feed - updated_feed = Feedzirra::Feed.update(feed) + # updating a single feed + updated_feed = Feedzirra::Feed.update(feed) - # an updated feed has the following extra accessors - updated_feed.updated? # returns true if any of the feed attributes have been modified. will return false if only new entries - updated_feed.new_entries # a collection of the entry objects that are newer than the latest in the feed before update + # an updated feed has the following extra accessors + updated_feed.updated? # returns true if any of the feed attributes have been modified. will return false if only new entries + updated_feed.new_entries # a collection of the entry objects that are newer than the latest in the feed before update - # fetching multiple feeds - feed_urls = ["http://feeds.feedburner.com/PaulDixExplainsNothing", "http://feeds.feedburner.com/trottercashion"] - feeds = Feedzirra::Reader.new(feed_urls).fetch + # fetching multiple feeds + feed_urls = ["http://feeds.feedburner.com/PaulDixExplainsNothing", "http://feeds.feedburner.com/trottercashion"] + feeds = Feedzirra::Reader.new(feed_urls).fetch - # feeds is now a hash with the feed_urls as keys and the parsed feed objects as values. If an error was thrown - # there will be a Fixnum of the http response code instead of a feed object + # feeds is now a hash with the feed_urls as keys and the parsed feed objects as values. If an error was thrown + # there will be a Fixnum of the http response code instead of a feed object - # updating multiple feeds. if you're using a persistent back-end, Feedzirra uses that to determine which entries are ones that you haven't seen before - updated_feeds = Feedzirra::reader.new(urls).fetch + # updating multiple feeds. if you're using a persistent back-end, Feedzirra uses that to determine which entries are ones that you haven't seen before + updated_feeds = Feedzirra::reader.new(urls).fetch - # defining custom behavior on failure or success. note that a return status of 304 (not updated) will call the on_success handler - feed = Feedzirra::Reader.new("http://feeds.feedburner.com/PaulDixExplainsNothing", - :on_success => lambda {|feed| puts feed.title }, - :on_failure => lambda {|url, response_code, response_header, response_body| puts response_body }).fetch - - # if a collection was passed into the initializer, the handlers will be called for each one + # defining custom behavior on failure or success. note that a return status of 304 (not updated) will call the on_success handler + feed = Feedzirra::Reader.new("http://feeds.feedburner.com/PaulDixExplainsNothing", + :on_success => lambda {|feed| puts feed.title }, + :on_failure => lambda {|url, response_code, response_header, response_body| puts response_body }).fetch + + # if a collection was passed into the initializer, the handlers will be called for each one -=== Extending +== Discussion -Feedzirra is easily extended with custom parsing classes and persistent back-ends. You'll have to read the source to find out how, though, because we -still haven't written the documentation. :( - -=== Benchmarks - -One of the goals of Feedzirra is speed. This includes not only parsing, but fetching multiple feeds as quickly as possible. I ran a benchmark getting 20 feeds 10 times using Feedzirra, rFeedParser, and FeedNormalizer. For more details the {benchmark code can be found in the project in spec/benchmarks/feedzirra_benchmarks.rb}[http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/feedzirra_benchmarks.rb] - - feedzirra 5.170000 1.290000 6.460000 ( 18.917796) - rfeedparser 104.260000 12.220000 116.480000 (244.799063) - feed-normalizer 66.250000 4.010000 70.260000 (191.589862) - -The result of that benchmark is a bit sketchy because of the network variability. Running 10 times against the same 20 feeds was meant to smooth some of that out. However, there is also a {benchmark comparing parsing speed in spec/benchmarks/parsing_benchmark.rb}[http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/parsing_benchmark.rb] on an atom feed. - - feedzirra 0.500000 0.030000 0.530000 ( 0.658744) - rfeedparser 8.400000 1.110000 9.510000 ( 11.839827) - feed-normalizer 5.980000 0.160000 6.140000 ( 7.576140) - -There's also a {benchmark that shows the results of using Feedzirra to perform updates on feeds}[http://github.com/pauldix/feedzirra/blob/45d64319544c61a4c9eb9f7f825c73b9f9030cb3/spec/benchmarks/updating_benchmarks.rb] you've already pulled in. I tested against 179 feeds. The first is the initial pull and the second is an update 65 seconds later. I'm not sure how many of them support etag and last-modified, so performance may be better or worse depending on what feeds you're requesting. - - feedzirra fetch and parse 4.010000 0.710000 4.720000 ( 15.110101) - feedzirra update 0.660000 0.280000 0.940000 ( 5.152709) - -=== Discussion - I'd like feedback on the api and any bugs encountered on feeds in the wild. I've set up a {google group here}[http://groups.google.com/group/feedzirra]. -==== Troubleshooting Installation +== Troubleshooting Installation *NOTE:*Some people have been reporting a few issues related to installation. First, the Ruby Forge version of curb is not what you want. It will not work. Nor will the curl-multi gem that lives on Ruby Forge. You have to get the taf2-curb[link:http://github.com/taf2/curb/tree/master] fork installed. If you see this error when doing a require: @@ -153,11 +127,11 @@ Another problem could be if you are running Mac Ports and you have libcurl installed through there. You need to uninstall it for curb to work! The version in Mac Ports is old and doesn't play nice with curb. If you're running Leopard, you can just uninstall and you should be golden. If you're on an older version of OS X, you'll then need to {download curl}[http://curl.haxx.se/download.html] and build from source. Then you'll have to install the taf2-curb gem again. You might have to perform the step above. If you're still having issues, please let me know on the mailing list. Also, {Todd Fisher (taf2)}[link:http://github.com/taf2] is working on fixing the gem install. Please send him a full error report. -=== TODO +== TODO This thing needs to hammer on many different feeds in the wild. I'm sure there will be bugs. I want to find them and crush them. I didn't bother using the test suite for feedparser. i wanted to start fresh. Here are some more specific TODOs. @@ -169,31 +143,8 @@ * I'm not keeping track of modified on entries. Should I add this? * Clean up the fetching code inside feed.rb so it doesn't suck so hard. * Make the feed_spec actually mock stuff out so it doesn't hit the net. * Readdress how feeds determine if they can parse a document. Maybe I should use namespaces instead? -=== LICENSE +== LICENSE -(The MIT License) - -Copyright (c) 2009: - -{Paul Dix}[http://pauldix.net] - -Permission is hereby granted, free of charge, to any person obtaining -a copy of this software and associated documentation files (the -'Software'), to deal in the Software without restriction, including -without limitation the rights to use, copy, modify, merge, publish, -distribute, sublicense, and/or sell copies of the Software, and to -permit persons to whom the Software is furnished to do so, subject to -the following conditions: - -The above copyright notice and this permission notice shall be -included in all copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, -EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF -MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. -IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY -CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, -TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. \ No newline at end of file +This library is provided under the MIT License. See {the complete LICENSE}[link:files/LICENSE_rdoc.html] for details.