Sha256: 2bf2703c234373eca9971ce794196d48e7a3610efcc0f18ffd460e29f45efc86

Contents?: true

Size: 1.6 KB

Versions: 50

Compression:

Stored size: 1.6 KB

Contents

# Based on an unscientific sample of 63 documents I could find on my hard drive,
# all docx/pptx/xlsx files contain, at the minimum, the following files:
#
#   [Content_types].xml
#   _rels/.rels
#   docProps/core.xml
#   docPropx/app.xml
#
# Additionally, per file type, they contain the following:
#
#   word/document.xml
#   xl/workbook.xml
#   ppt/presentation.xml
#
# These are sufficient to say with certainty that a ZIP is in fact an Office document.
# Also that unscientific sample revealed that I came to dislike MS Office so much as to
# only have 63 documents on my entire workstation.
#
# We do not perform the actual _decoding_ of the Office documents here, because to read
# their contents we need to:
#
# * inflate the compressed part files (potential for deflate bombs)
# * parse the document XML (potential for XML parser exploitation)
#
# which are real threats and require adequate mitigation. For our purposes the
# token detection of specific filenames should be enough to say with certainty
# that a document _is_ an Office document, and not just a ZIP.
module FormatParser::ZIPParser::OfficeFormats
  OFFICE_MARKER_FILES = Set.new([
    '[Content_Types].xml',
    '_rels/.rels',
    'docProps/core.xml',
    'docProps/app.xml',
  ])

  def office_document?(filenames_set)
    OFFICE_MARKER_FILES.subset?(filenames_set)
  end

  def office_file_format_from_entry_set(filenames_set)
    if filenames_set.include?('word/document.xml')
      :docx
    elsif filenames_set.include?('xl/workbook.xml')
      :xlsx
    elsif filenames_set.include?('ppt/presentation.xml')
      :pptx
    else
      :unknown
    end
  end
end

Version data entries

50 entries across 50 versions & 1 rubygems

Version Path
format_parser-0.18.0 lib/parsers/zip_parser/office_formats.rb
format_parser-0.17.0 lib/parsers/zip_parser/office_formats.rb
format_parser-0.16.1 lib/parsers/zip_parser/office_formats.rb
format_parser-0.16.0 lib/parsers/zip_parser/office_formats.rb
format_parser-0.15.1 lib/parsers/zip_parser/office_formats.rb
format_parser-0.15.0 lib/parsers/zip_parser/office_formats.rb
format_parser-0.14.1 lib/parsers/zip_parser/office_formats.rb
format_parser-0.14.0 lib/parsers/zip_parser/office_formats.rb
format_parser-0.13.6 lib/parsers/zip_parser/office_formats.rb
format_parser-0.13.5 lib/parsers/zip_parser/office_formats.rb
format_parser-0.13.4 lib/parsers/zip_parser/office_formats.rb
format_parser-0.13.3 lib/parsers/zip_parser/office_formats.rb
format_parser-0.13.2 lib/parsers/zip_parser/office_formats.rb
format_parser-0.13.1 lib/parsers/zip_parser/office_formats.rb
format_parser-0.13.0 lib/parsers/zip_parser/office_formats.rb
format_parser-0.12.4 lib/parsers/zip_parser/office_formats.rb
format_parser-0.12.2 lib/parsers/zip_parser/office_formats.rb
format_parser-0.12.1 lib/parsers/zip_parser/office_formats.rb
format_parser-0.12.0 lib/parsers/zip_parser/office_formats.rb
format_parser-0.11.0 lib/parsers/zip_parser/office_formats.rb