Sha256: c007d49b0e7f83f53a4483dece467d604ae7ac997319d17c61d50f09771ce522

Contents?: true

Size: 1.86 KB

Versions: 35

Compression:

Stored size: 1.86 KB

Contents

# Based on an unscientific sample of 63 documents I could find on my hard drive,
# all docx/pptx/xlsx files contain, at the minimum, the following files:
#
#   [Content_types].xml
#   _rels/.rels
#   docProps/core.xml
#   docPropx/app.xml
#
# Additionally, per file type, they contain the following:
#
#   word/document.xml
#   xl/workbook.xml
#   ppt/presentation.xml
#
# These are sufficient to say with certainty that a ZIP is in fact an Office document.
# Also that unscientific sample revealed that I came to dislike MS Office so much as to
# only have 63 documents on my entire workstation.
#
# We do not perform the actual _decoding_ of the Office documents here, because to read
# their contents we need to:
#
# * inflate the compressed part files (potential for deflate bombs)
# * parse the document XML (potential for XML parser exploitation)
#
# which are real threats and require adequate mitigation. For our purposes the
# token detection of specific filenames should be enough to say with certainty
# that a document _is_ an Office document, and not just a ZIP.
module FormatParser::ZIPParser::OfficeFormats
  OFFICE_MARKER_FILES = Set.new([
    '[Content_Types].xml',
    '_rels/.rels',
    'docProps/core.xml',
    'docProps/app.xml',
  ])

  def office_document?(filenames_set)
    OFFICE_MARKER_FILES.subset?(filenames_set)
  end

  def office_file_format_and_mime_type_from_entry_set(filenames_set)
    if filenames_set.include?('word/document.xml')
      [:docx, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document']
    elsif filenames_set.include?('xl/workbook.xml')
      [:xlsx, 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet']
    elsif filenames_set.include?('ppt/presentation.xml')
      [:pptx, 'application/vnd.openxmlformats-officedocument.presentationml.presentation']
    else
      [:unknown, 'application/zip']
    end
  end
end

Version data entries

35 entries across 35 versions & 1 rubygems

Version Path
format_parser-2.10.0 lib/parsers/zip_parser/office_formats.rb
format_parser-2.9.0 lib/parsers/zip_parser/office_formats.rb
format_parser-2.8.0 lib/parsers/zip_parser/office_formats.rb
format_parser-2.7.2 lib/parsers/zip_parser/office_formats.rb
format_parser-2.7.1 lib/parsers/zip_parser/office_formats.rb
format_parser-2.7.0 lib/parsers/zip_parser/office_formats.rb
format_parser-2.6.0 lib/parsers/zip_parser/office_formats.rb
format_parser-2.5.0 lib/parsers/zip_parser/office_formats.rb
format_parser-2.4.5 lib/parsers/zip_parser/office_formats.rb
format_parser-2.4.4 lib/parsers/zip_parser/office_formats.rb
format_parser-2.4.3 lib/parsers/zip_parser/office_formats.rb
format_parser-2.3.0 lib/parsers/zip_parser/office_formats.rb
format_parser-2.2.1 lib/parsers/zip_parser/office_formats.rb
format_parser-2.2.0 lib/parsers/zip_parser/office_formats.rb
format_parser-2.1.0 lib/parsers/zip_parser/office_formats.rb
format_parser-2.0.0 lib/parsers/zip_parser/office_formats.rb
format_parser-2.0.0.pre.4 lib/parsers/zip_parser/office_formats.rb
format_parser-2.0.0.pre.3 lib/parsers/zip_parser/office_formats.rb
format_parser-2.0.0.pre.2 lib/parsers/zip_parser/office_formats.rb
format_parser-2.0.0.pre lib/parsers/zip_parser/office_formats.rb