# WP2TXT Wikipedia dump file to text converter CAUTION: This software is on an experimental stage. Use with care! ### About ### WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes. ### Features ### * Convert dump files of Wikipedia of multiple languages (I hope). * Create output files of specified size. * Allow users to specify text elements to be extracted/converted (page titles, section titles, lists, and tables). WP2TXT before version 0.4.0 came with Mac/Windows GUI. Now it's become a pure command-line application--Sorry GUI folks, but there seems more demand for an easy-to-hack CUI package than a not-very-flexible GUI app. ### Installation $ gem install wp2txt ### Usage Obtain a Wikipedia dump file (see the link below) with a file name such as: xxwiki-yyyymmdd-pages-articles.xml.bz2 where `xx` is language code such as "en (English)" or "ja (Japanese)", and `yyyymmdd` is the date of creation (e.g. 20120601). Command line options are as follows: Usage: wp2txt [options] where [options] are: --input-file, -i: Wikipedia dump file with .bz2 (compressed) or .txt (uncompressed) format --output-dir, -o : Output directory (default: Present working directory) --convert-off, -c: Output XML (without converting to plain text) --list-off, -l: Exclude list items from output --heading-off, -d: Exclude section titles from output --title-off, -t: Exclude page titles from output --table-off, --no-table-off, -a: Exclude page titles from output (default: true) --template-off, --no-template-off, -e: Remove template notations from output (default: true) --redirect-off, -r: Not show redirect destination --strip-marker, -s: Remove symbols prefixed to list items, definitions, etc. --category-off, -g: Not show article category information --file-size, -f : Approximate size (in MB) of each output file (default: 10) --version, -v: Print version and exit --help, -h: Show this message ### Limitations ### * Certain types of data such as mathematical equations and computer source code are not be properly converted. Please remember this software is originally intended for correcting “sentences” for linguistic studies. * Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc). * Conversion process can take far more than you would expect. It could take several hours or more when dealing with a huge data set such as the English Wikipedia on a low-spec environments. * Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software __at your own risk__. ### Useful Link ### * [Wikipedia Database backup dumps](http://dumps.wikimedia.org/backup-index.html) ### Author ### * Yoichiro Hasebe () ### License ### This software is distributed under the MIT License. Please see the LICENSE file.