WordPress.org

Make WordPress Core

Opened 5 years ago

Last modified 5 hours ago

#38231 assigned enhancement

Allow download_url to respect content-disposition header

Reported by: cklosows Owned by: johnjamesjacoby
Milestone: 5.9 Priority: normal
Severity: normal Version: 4.7
Component: HTTP API Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

In #34938 it was proposed to use a filter in the wp_tempnam function which was rejected in favor of ignore the query string when downloading a file. This, however can still propose a problem when someone is the function to download a tokenized file where the path can be long.

In the case where the headers of the response provide a content-disposition header including a file name, it would be ideal to have the download_url function respect this file name and move the file name (which by default is the token in the path) to the file name requested.

Attachments (11)

38231.diff (810 bytes) - added by cklosows 5 years ago.
38231-tests.diff (1.2 KB) - added by psrpinto 5 months ago.
Unit tests
38231.1.diff (1.1 KB) - added by psrpinto 5 months ago.
Check return values and other improvements
38231-tests.1.diff (1.2 KB) - added by psrpinto 5 months ago.
Test that the file is actually created
38231.2.diff (982 bytes) - added by psrpinto 5 months ago.
Refactor according to @dd32's suggestions
38231-tests.2.diff (1.8 KB) - added by psrpinto 5 months ago.
Add a test for Content-Disposition=filename=foo.txt (i.e. without quotes)
38231.3.diff (3.3 KB) - added by johnjamesjacoby 13 days ago.
Updated regex to support asterisk in filename; lcase header key; add 2 more tests
38231.4.diff (3.3 KB) - added by johnjamesjacoby 13 days ago.
Rename data provider, and lowercase the "content-disposition" header keys
38231.5.diff (3.4 KB) - added by johnjamesjacoby 12 days ago.
More feedback from @costdev
38231.6.diff (3.4 KB) - added by johnjamesjacoby 12 days ago.
Alternate approach, using WP_REST_Attachments_Controller::get_filename_from_disposition()
38231.7.diff (7.5 KB) - added by johnjamesjacoby 5 hours ago.
Use sanitize_file_name() and validate_file() - props @costdev (comment imminent)

Download all attachments as: .zip

Change History (28)

@cklosows
5 years ago

#1 @cklosows
5 years ago

  • Keywords has-patch added

#2 @dd32
5 years ago

  • Component changed from Filesystem API to HTTP API

#3 @johnjamesjacoby
7 months ago

  • Keywords needs-testing added
  • Milestone changed from Awaiting Review to 5.8
  • Owner set to johnjamesjacoby
  • Status changed from new to assigned

Let's try to make this happen 5.8.

@psrpinto
5 months ago

Unit tests

#4 @psrpinto
5 months ago

  • Keywords has-unit-tests added

@psrpinto
5 months ago

Check return values and other improvements

@psrpinto
5 months ago

Test that the file is actually created

#5 follow-up: @psrpinto
5 months ago

  • Keywords needs-testing removed

38231.1.diff improves on @cklosows's original patch by:

  1. Explicitly checking the return value of preg_match()
  2. Returning an error when the call to wp_tempnam() fails
  3. Returning an error when the call to rename() fails

38231-tests.1.diff adds an extra assertion to make sure the file is actually created.

I've also now manually tested this and can confirm the Content-Disposition header is respected when calling download_url().

#6 @desrosj
5 months ago

  • Milestone changed from 5.8 to 5.9

Today is feature freeze for the 5.8 release. This one is really close! But unfortunately the time for including in 5.8 has passed.

I'm going to punt to 5.9 instead of Future Release since there is a recent momentum, and @psrpinto has done a great job iterating on the approaches here so far.

#7 in reply to: ↑ 5 @dd32
5 months ago

Replying to psrpinto:

38231.1.diff improves on @cklosows's original patch by:

  1. Explicitly checking the return value of preg_match()
  2. Returning an error when the call to wp_tempnam() fails
  3. Returning an error when the call to rename() fails

Hi @psrpinto! Thanks for the improved patch :)

In general, WordPress tries to use the most basic loose boolean checking it can, so for example (I'm being nit-picky here, since you're new to Core patches, this is intended to be constructive feedback)

$content_disposition = wp_remote_retrieve_header( $response, 'Content-Disposition' );
if ( ! empty( $content_disposition ) && 1 === preg_match( '/filename="([^ ]+)"/', $content_disposition, $matches ) ) {

can be simplified to simply:

$content_disposition = wp_remote_retrieve_header( $response, 'Content-Disposition' );
if ( $content_disposition && preg_match( '/filename="([^ ]+)"/', $content_disposition, $matches ) ) {

as a) We only want to know that $content_disposition is 'something' the preg_match will validate it further and b) any truthful value return is acceptable.

I personally think in the event that the Content-Disposition filename can't be used, it should be something like this, what do you think?

set tmpName = create random name tmp file
if ( disposition header is set ) {
   set tmpNameDisposition = create disposition-named tmp file
   if ( can create tmpNameDisposition AND can rename tmpName to tmpNameDisposition ) {
      set tmpName = tmpNameDisposition
      continue on with processing
   } else {
      continue on like no error was encountered, tmpName is still set to random
   }
}
/// .. continue on

So that if the header is invalid, can't be created on the filesystem, or something else odd the function can still succeed without erroring out. As this function is used within the WordPress update process, failing on some edgecase here could be disastrous.
In other words, treating this as a progressive enhancement if possible but retaining existing behaviour in the event that it can't be used.

Other edgecases worth considering here:

  • filename="file.zip" and filename=file.zip are both valid but the patch only handles the former
  • What if the filename in the URL and in the Content-Disposition header match? For example https://wordpress.org/wordpress-5.7.2.zip

@psrpinto
5 months ago

Refactor according to @dd32's suggestions

@psrpinto
5 months ago

Add a test for Content-Disposition=filename=foo.txt (i.e. without quotes)

#8 @psrpinto
5 months ago

Thanks for the feedback @dd32! I agree with pretty much all your points so I've now refactored the code to reflect your suggested changes.

Please do let me know if you have further suggestions, they will be very welcome and I will be happy to consider them.


What if the filename in the URL and in the Content-Disposition header match?

Filenames are generated through wp_tempnam() so there can be no collisions, since that method makes sure the filename is unique. So even if the filename in the URL and in the Content-Disposition header are the same, the temporary files that are created will have unique names.


Changes in 38231.2.diff:

  • No longer check that $content_disposition is not empty before the call to preg_match(). If it is empty, it won't match, so that check is unnecessary. (wp_remote_retrieve_header() returns either the header or an empty string).
  • Treat Content-Disposition as "progressive enhancement", falling back to original randomly-generated filename in case of error.
  • Support the case where the header contains no quotes: Content-Disposition: filename=foo.txt

Changes in 38231-tests.2.diff

  • Add test for the case where the header contains no quotes: Content-Disposition: filename=foo.txt
Last edited 5 months ago by psrpinto (previous) (diff)

#9 @dd32
5 months ago

38231-tests.2.diff Looks good to me @psrpinto! Thanks

The only thought I have is that the unlink could be moved to the else to avoid a file_exists() call, but that's all I got :) I'll leave that up to you or whomever commits this.

if ( tmpfname_disposition && rename ) {
   set ...
} else if ( tmpfname_disposition ) {
   unlink tmpfname_disposition
}

#10 @johnjamesjacoby
3 weeks ago

After testing the patch and doing some research, I believe this will need a bit more work.

(I've started that work, and will post an updated patch in the coming days.)

Citations:

In short:

  • The $header parameter of wp_remote_retrieve_header() needs to be lowercase – I'm opening a new core ticket imminently to give this more thought
  • The regex needs tweaking to account for relative paths and the alternate filename* directive:

    filename
    Is followed by a string containing the original name of the file transmitted. The filename is always optional and must not be used blindly by the application: path information should be stripped, and conversion to the server file system rules should be done. This parameter provides mostly indicative information. When used in combination with Content-Disposition: attachment, it is used as the default filename for an eventual "Save As" dialog presented to the user.

    filename*
    The parameters "filename" and "filename*" differ only in that "filename*" uses the encoding defined in RFC 5987. When both filename and filename* are present in a single header field value, filename* is preferred over filename when both are understood.

    Warning: The string following filename should always be put into quotes; but, for compatibility reasons, many browsers try to parse unquoted names that contain spaces.

Possible implications from the RFC, to confirm with additional unit tests:

   It is essential that recipients treat the specified filename as
   advisory only, and thus be very careful in extracting the desired
   information.  In particular:

   o  Recipients MUST NOT be able to write into any location other than
      one to which they are specifically entitled.  To illustrate the
      problem, consider the consequences of being able to overwrite
      well-known system locations (such as "/etc/passwd").  One strategy
      to achieve this is to never trust folder name information in the
      filename parameter, for instance by stripping all but the last
      path segment and only considering the actual filename (where 'path
      segments' are the components of the field value delimited by the
      path separator characters "\" and "/").

   o  Many platforms do not use Internet Media Types ([RFC2046]) to hold
      type information in the file system, but rely on filename
      extensions instead.  Trusting the server-provided file extension
      could introduce a privilege escalation when the saved file is
      later opened (consider ".exe").  Thus, recipients that make use of
      file extensions to determine the media type MUST ensure that a
      file extension is used that is safe, optimally matching the media
      type of the received payload.

   o  Recipients SHOULD strip or replace character sequences that are
      known to cause confusion both in user interfaces and in filenames,
      such as control characters and leading and trailing whitespace.

   o  Other aspects recipients need to be aware of are names that have a
      special meaning in the file system or in shell commands, such as
      "." and "..", "~", "|", and also device names.  Recipients SHOULD
      ignore or substitute names like these.

      Note: Many user agents do not properly handle the escape character
      "\" when using the quoted-string form.  Furthermore, some user
      agents erroneously try to perform unescaping of "percent" escapes
      (see Appendix C.2), and thus might misinterpret filenames
      containing the percent character followed by two hex digits.

@johnjamesjacoby
13 days ago

Updated regex to support asterisk in filename; lcase header key; add 2 more tests

#11 @johnjamesjacoby
13 days ago

38231.3.diff confirms (with unit tests) that:

  • wp_tempnam() prevents path traversal
  • single, double, and no quotes all work
  • both filename= and filename*= now supported

This ticket was mentioned in Slack in #core by jjj. View the logs.


13 days ago

@johnjamesjacoby
13 days ago

Rename data provider, and lowercase the "content-disposition" header keys

#13 @johnjamesjacoby
13 days ago

38231.4.diff evolves 3.diff with the following feedback from @costdev:

  • Prefer data_ over provider_ to match existing data pattern from other tests
  • Lower case content-disposition in our fake headers arrays, because wp_remote_retrieve_header() won't have the luxury of extending Requests_Utility_CaseInsensitiveDictionary in our filtered data

@johnjamesjacoby
12 days ago

More feedback from @costdev

#14 @johnjamesjacoby
12 days ago

38231.5.diff includes the following improvements:

  • Prefers assertMatchesRegularExpression over assertContains for future PHPUnit compatibility
  • Adds array keys to data provider for easier identification

#15 @ocean90
12 days ago

Are you aware of WP_REST_Attachments_Controller::get_filename_from_disposition() (Reference)?

@johnjamesjacoby
12 days ago

Alternate approach, using WP_REST_Attachments_Controller::get_filename_from_disposition()

#16 @johnjamesjacoby
12 days ago

Hey @ocean90 👋 thank you for mentioning that. I was unaware and it seems like a good idea to use it.

  • It is a public static method so intended to be used elsewhere
  • It is always included by WordPress core
  • It has an array of unit tests already to match RFC6266

38231.6.diff updates the approach to use it over the bespoke regular expression.

@johnjamesjacoby
5 hours ago

Use sanitize_file_name() and validate_file() - props @costdev (comment imminent)

#17 @johnjamesjacoby
5 hours ago

38231.7.diff includes the following iterations:

  • Abandons WP_REST_Attachments_Controller::get_filename_from_disposition() approach. @costdev and I collaborated privately for a few full days, and concluded that it does not fully sanitize a filename from an untrusted source in the many ways that download_url() can be used.
  • Looks for 'attachment; filename=' instead of filename=. Per the specification, the Content-Disposition header only triggers save-as dialogs when the attachment; "type" is sent with it. (The inline; type would be displayed as a web page instead of being downloaded anyways. The type is a required parameter.)
  • Uses sanitize_file_name() to ensure that the suggested filename value is as close to acceptable for the web server and WordPress rules as possible. The nice thing about this is that values with a period in them will automatically be checked against wp_get_mime_types() using wp_check_filetype(). The value in doing this here (early) is that the temporary file name can still be used if the suggestion happens to be illegal.
  • Uses validate_file() to avoid directory traversal. This adequately prevents the filename value from including relative file paths and writing to unintended locations in the file system.
  • Includes 6 separate tests with 9 total assertions. These tests confirm that valid file name values are correctly saved, and invalid file name values are correctly rejected.

Massive props again to @costdev for going the extra mile with me behind the scenes on this. 🙏

Barring anything we've missed, this patch is ready to be committed.

Note: See TracTickets for help on using tickets.