Drupal file migrations: The s3fs module

A recent project gave me the opportunity to familiarize myself with the Drupal 8 version of the S3 File System (s3fs) module (having used the D7 version briefly in the distant past). This module provides an s3:// stream wrapper for files stored in an S3 bucket, allowing them to be used as seamlessly as locally stored public and private files. First we present the migrations and some of the plugins implemented to support import of files stored on S3 - below we will go into some of the challenges we faced.

Our client was already storing video files in an S3 bucket, and it was decided that for the Drupal site we would also store image files there. The client handled bulk uploading of images to an "image" folder within the bucket, using the same (relative) paths as those stored for the images in the legacy database. Thus, for migration we did not need to physically copy files around (the bane of many a media migration!) - we "merely" needed to create the appropriate entities in Drupal pointing at the S3 location of the files.

The following examples are modified from the committed code - to obfuscate the client/project, and to simplify so we focus on the subject at hand.

Image migrations

Gallery images

In the legacy database all gallery images were stored in a table named asset_metadata, which is structured very much like Drupal's file_managed table, with the file paths in an asset_path column. The file migration looked like this:

id: acme_image
source:
  plugin: acme
process:
  filename:
    plugin: callback
    callable: basename
    source: asset_path
  uri:
    # Construct the S3 URI - see implementation below.
    plugin: acme_s3_uri
    source: asset_path
  # Source data created/last_modified fields are YYYY-MM-DD HH:MM:SS - convert
  # them to the classic UNIX timestamps Drupal loves. Oh, and they're optional,
  # so when empty leave them empty and let Drupal set them to the current time.
  created:
    -
      plugin: skip_on_empty
      source: created
      method: process
    -
      plugin: callback
      callable: strtotime
  changed:
    -
      plugin: skip_on_empty
      source: last_modified
      method: process
    -
      plugin: callback
      callable: strtotime
destination:
  plugin: entity:file

Because we also needed to construct the S3 uris in places besides the acme_s3_uri process plugin, we implemented the construction in a trait which cleans up some inconsistencies and prepends the image location:

trait AcmeMakeS3Uri {
  /**
   * Turn a legacy image path into an S3 URI.
   *
   * @param string $value
   *
   * @return string
   */
  protected function makeS3Uri($value) {
     // Some have leading tabs.
     $value = trim($value);
     // Path fields are inconsistent about leading slashes.
     $value = ltrim($value, '/');
     // Sometimes they contain doubled-up slashes.
     $value = str_replace('//', '/', $value);
     return 's3://image/' . $value;
  }
}

So, the process plugin in the image migration above uses the trait to construct the URI, and verifies that the file is actually in S3 - if not, we skip it. See the Challenges and Contributions section below for more on the s3fs_file table.

/**
 * Turn a legacy image path into an S3 URI.
 *
 * @MigrateProcessPlugin(
 *   id = "acme_s3_uri"
 * )
 */
class AcmeS3Uri extends ProcessPluginBase {
  use AcmeMakeS3Uri;
 
  /**
   * {@inheritdoc}
   */
  public function transform($value, MigrateExecutableInterface $migrate_executable, Row $row, $destination_property) {
    $uri = $this->makeS3Uri($value);
  // For now, skip any images not cached by s3fs.
  $s3_uri = \Drupal::database()->select('s3fs_file', 's3')
    ->fields('s3', ['uri'])
    ->condition('uri', $uri)
    ->execute()
    ->fetchField();
    if (!$s3_uri) {
      throw new MigrateSkipRowException("$uri missing from s3fs_file table");
    }
    return $uri;
  }
}

The above creates the file entities - next, we need to create the media entities that reference the files above via entity reference fields (and add other fields). These media entities are then referenced from content entities.

id: acme_image_media
source:
  plugin: acme
process:
  # For the media "name" property - displayed at /admin/content/media - our
  # first choice is the image caption, followed by the "event_name" field in
  # our source table. If necessary, we fall back to the original image path.
  name:
    -
      # Produces an array containing only the non-empty values.
      plugin: callback
      callable: array_filter
      source:
        - caption
        - event_name
        - asset_path
    -
      # From the array, pass on the first value as a scalar.
      plugin: callback
      callable: current
    -
      # Some captions are longer than the name property length.
      plugin: substr
      length: 255
  # Entity reference to the image - convert the source ID to Drupal's file ID.
  field_media_image/target_id:
    plugin: migration_lookup
    migration: acme_image
    source: id
  # Use the name we computed above as the alt text.
  field_media_image/alt: '@name'
  # We need to explicitly set the image dimensions in the field's width/height
  # subfields (more on this below under Challenges and Contributions). Note that in
  # the process pipeline you can effectively create temporary fields which can be
  # used later in the pipeline - just be sure they won't conflict with
  # anything that might be used within the Drupal entity.
  _uri:
    plugin: acme_s3_uri
    source: asset_path
  _image_dimensions:
    plugin: acme_image_dimensions
    source: '@_uri'
  field_media_image/width: '@_image_dimensions/width'
  field_media_image/height: '@_image_dimensions/height'
  caption: caption
destination:
  plugin: entity:media
  default_bundle: image
migration_dependencies:
  required:
    - acme_image

Other images

The gallery images have their own metadata table - but, there are many other images which are simply stored as paths in content tables (in some cases, there are multiple such path fields in a single table). One might be tempted to deal with these in process plugins in the content migrations - creating the file and media entities on the fly - but that would be, well, ugly. Instead we implemented a drush command, run before our migration tasks, to canonicalize and gather those paths into a single table, which then feeds the acme_image_consolidated and acme_image_media_consolidated migrations (which end up being simpler versions of acme_image and acme_image_media, since "path" is the only available source field).

function drush_acme_migrate_gather_images() {
  // Key is legacy table name, value is list of image path columns to migrate.
  $table_fields = [
    'person' => [
      'profile_picture_path',
      'left_standing_path',
      'right_standing_path',
    ],
    'event' => [
      'feature_image',
      'secondary_feature_image',
    ],
    'subevent' => [
      'generated_medium_thumbnail',
    ],
    'news_article' => [
      'thumbnail',
    ]
  ];
  $legacy_db = Database::getConnection('default', 'migrate');
  // Create the table if necessary.
  if (!$legacy_db->schema()->tableExists('consolidated_image_paths')) {
    $table = [
      'fields' => [
         'path' => [
          'type' => 'varchar',
          'length' => 191,    // Longest known path is 170.
          'not null' => TRUE,
         ]
      ],
      'primary key' => ['path'],
    ];
    $legacy_db->schema()->createTable('consolidated_image_paths', $table);
    drush_print('Created consolidated_image_paths table');
  }
  $max = 0;
  foreach ($table_fields as $table => $field_list) {
    drush_print("Gathering paths from $table");
    $count = 0;
    $query = $legacy_db->select($table, 't')
      ->fields('t', $field_list);
    foreach ($query->execute() as $row) {
      // Iterate the image path columns returned in the row.
      foreach ($row as $path) {
        if ($path) {
          $len = strlen($path);
          if ($len > $max) $max = $len;
          $path = str_replace('//', '/', $path);
          $count++;
          $legacy_db->merge('consolidated_image_paths')
            ->key('path', $path)
            ->execute();
        }
      }
    }
    // Note we will end up with far fewer rows in the table due to duplication.
    drush_print("$count paths added from $table");
  }
  drush_print("Maximum path length is $max");
}

Video migrations

The legacy database contained a media table referencing videos tagged with three different types - internal, external, and embedded. "Internal" videos were those stored in S3 with a relative path in the internal_url column; "external" videos (most on client-specific domains, but with some Youtube domains as well) had a full URL in the external_url column; and "embedded" videos were with a very few exceptions Youtube videos with the Youtube ID in the embedded_id column. It was decided that we would migrate the internal and Youtube videos, ignoring the rest of the external/embedded videos. Here we focus on the internal (S3-based) videos.

id: acme_video
source:
  plugin: acme_internal_video
  constants:
    s3_prefix: s3://
process:
  _trimmed_url:
    # Since the callback process plugin only permits a single source value to be
    # passed to the specified PHP function, we have a custom plugin which enables us
    # to pass a character list to be trimmed.
    plugin: acme_trim
    source: internal_url
    trim_type: left
    charlist: /
  uri:
    -
      plugin: concat
      source:
        - constants/s3_prefix
        - '@_trimmed_url'
    -
      # Make sure the referenced file actually exists in S3 (does a simple query on
      # the s3fs_file table, throwing MigrateSkipRowException if missing).
      plugin: acme_skip_missing_file
  fid:
    # This operates much like migrate_plus's entity_lookup, to return an existing
    # entity ID based on arbitrary properties. The purpose here is if the file URI
    # is already in file_managed, point the migrate map table to the existing file
    # entity - otherwise, a new file entity will be created.
    plugin: acme_load_by_properties
    entity_type: file
    properties: uri
    source: '@uri'
    default_value: NULL
  filename:
    plugin: callback
    callable: basename
    source: '@uri'
destination:
  plugin: entity:file

The media entity migration is pretty straightforward:

id: acme_video_media
source:
  plugin: acme_internal_video
  constants:
    true: 1
process:
  status: published
  name: title
  caption: caption
  # The source column media_date is YYYY-MM-DD HH:DD:SS format - the Drupal field is
  # configured as date-only, so the source value must be truncated to YYYY-MM-DD.
  date:
    -
      plugin: skip_on_empty
      source: media_date
      method: process
    -
      plugin: substr
      length: 10
  field_media_video/0/target_id:
    -
      plugin: migration_lookup
      migration: acme_video
      source: id
      no_stub: true
    -
      # If we haven't migrated a file entity, skip this media entity.
      plugin: skip_on_empty
      method: row
  field_media_video/0/display: constants/true
  field_media_video/0/description: caption
destination:
  plugin: entity:media
  default_bundle: video
migration_dependencies:
  required:
    - acme_video

Did I mention that we needed to create a node for each video, linking to related content of other types? Here we go:

id: acme_video_node
source:
  plugin: acme_internal_video
  constants:
    text_format: formatted
    url_prefix: http://www.acme.com/media/
    s3_prefix: s3://image/
process:
  title: title
  status: published
  teaser/value: caption
  teaser/format: constants/text_format
  length:
    # Converts HH:MM:SS to integer seconds. Left as an exercise to the reader.
    plugin: acme_video_length
    source: duration
  video:
    plugin: migration_lookup
    migration: acme_video_media
    source: id
    no_stub: true
  # Field to preserve the original URL.
  old_url:
    plugin: concat
    source:
      - constants/url_prefix
      - url_name
  _trimmed_thumbnail:
    plugin: acme_trim
    trim_type: left
    charlist: '/'
    source: thumbnail
  teaser_image:
    -
      plugin: skip_on_empty
      source: '@_trimmed_thumbnail'
      method: process
    -
      # Form the URI as stored in file_managed.
      plugin: concat
      source:
        - constants/s3_prefix
        - '@_trimmed_thumbnail'
    -
      # Look up the fid.
      plugin: acme_load_by_properties
      entity_type: file
      properties: uri
    -
      # Find the media entity referencing that fid.
      plugin: acme_load_by_properties
      entity_type: media
      properties: field_media_image
  # Note that for each of these entity reference fields, we skipped some content,
  # so need to make sure stubs aren't created for the missing content. Also note
  # that the source fields here are populated in a PREPARE_ROW event.
  related_people:
    plugin: migration_lookup
    migration: acme_people
    source: related_people
    no_stub: true
  related_events:
    plugin: migration_lookup
    migration: acme_event
    source: related_events
    no_stub: true
  tag_keyword:
    plugin: migration_lookup
    migration: acme_keyword
    source: keyword_ids
    no_stub: true
destination:
  plugin: entity:node
  default_bundle: video
migration_dependencies:
  required:
    - acme_image_media
    - acme_video_media
    - acme_people
    - acme_event
    - acme_keyword

Auditing missing files

A useful thing to know (particularly with the client incrementally populating the S3 bucket with image files) is what files are referenced in the legacy tables but not actually in the bucket. Below is a drush command we threw together to answer that question - it will query each legacy image or video path field we're using, construct the s3:// version of the path, and look it up in the s3fs_file table to see if it exists in S3.

/**
 * Find files missing from S3.
 */
function drush_acme_migrate_missing_files() {
  $legacy_db = Database::getConnection('default', 'migrate');
  $drupal_db = Database::getConnection();
  $table_fields = [
    [
      'table_name' => 'asset_metadata',
      'url_column' => 'asset_path',
      'date_column' => 'created',
    ],
    [
      'table_name' => 'media',
      'url_column' => 'internal_url',
      'date_column' => 'media_date',
    ],
    [
      'table_name' => 'person',
      'url_column' => 'profile_picture_path',
      'date_column' => 'created',
    ],
    // … on to 9 more columns among three more tables...
  ];
 
  $header = 'uri,legacy_table,legacy_column,date';
  drush_print($header);
  foreach ($table_fields as $table_info) {
    $missing_count = 0;
    $total_count = 0;
    $table_name = $table_info['table_name'];
    $url_column = $table_info['url_column'];
    $date_column = $table_info['date_column'];
    $query = $legacy_db->select($table_name, 't')
      ->fields('t', [$url_column])
      ->isNotNull($url_column)
      ->condition($url_column, '', '<>');
    if ($table_name == 'media') {
      $query->condition('type', 'INTERNALVIDEO');
    }
    if ($table_name == 'people') {
      // This table functions much like Drupal's node table.
      $query->innerJoin('publishable_entity', 'pe', 't.id=pe.id');
      $query->fields('pe', [$date_column]);
    }
    else {
      $query->fields('t', [$date_column]);
    }
    $query->distinct();
    foreach ($query->execute() as $row) {
      $path = trim($row->$url_column);
      if ($path) {
        $total_count++;
        // Paths are inconsistent about leading slashes.
        $path = ltrim($path, '/');
        // Sometimes they have doubled-up slashes.
        $path = str_replace('//', '/', $path);
        if ($table_name == 'media') {
          $s3_path = 's3://' . $path;
        }
        else {
          $s3_path = 's3://image/' . $path;
        }
        $s3 = $drupal_db->select('s3fs_file', 's3')
          ->fields('s3', ['uri'])
          ->condition('uri', $s3_path)
          ->execute()
            ->fetchField();
        if (!$s3) {
          $output_row = "$s3_path,$table_name,$url_column,{$row->$date_column}";
          drush_print($output_row);
          $missing_count++;
        }
      }
    }
    drush_log("$missing_count of $total_count files missing in $table_name column $url_column", 'ok');
  }
}

Challenges and contributions

The s3fs module's primary use case is where the configured S3 bucket is used only by the Drupal site, and populated directly by file uploads through Drupal - our project was an outlier in terms of having all files in the S3 bucket first, and in sheer volume. A critical piece of the implementation is the s3fs_file table, which caches metadata for all files in the bucket so Drupal rarely needs to access the bucket itself other than on file upload (since file URIs are converted to direct S3 URLs when rendering, web clients go directly to S3 to fetch files, not through Drupal). In our case, the client had an existing S3 bucket which contained all the video files (and more) used by their legacy site, and to which they bulk uploaded image files directly so we did not need to do this during migration. The module does have an s3fs-refresh-cache command to populate the s3fs_file table from the current bucket contents, but we did have to deal with some issues around the cache table.

Restriction on URI lengths

As soon as we started trying to use drush s3fs-refresh-cache, we ran into the existing issue Getting Exception 'PDOException'SQLSTATE[22001] When Running drush s3fs-refresh-cache - URIs in the bucket longer than the 255-character length of s3fs_file's uri column. The exception aborted the refresh entirely, and because the refresh operation generates a temporary version of the table from scratch, then swaps it for the "live" table, the exception prevented any file metadata from being refreshed if there was one overflowing URI. I submitted a patch implementing the simplest workaround - just generating a message and ignoring overly-long URIs. Discussion continues around an alternate approach, but we used my patch in our project.

Lost primary key

So, once we got the cache refresh to work, we found serious performance problems. We had stumbled on an existing issue, "s3fs_file" table has no primary key. I tracked down the cause - because the uri column is 255 characters long, with InnoDB it cannot be indexed when using a multibyte collation such as utf8_general_ci. And Drupal core has a bug, DatabaseSchema_mysql::createTableSql() can't set table collation, preventing the setting of the utf8_bin collation directly in the table schema. The s3fs module works around that bug when creating the s3fs_file table at install time by altering the collation after table creation - but the cache refresh created a new cache table using only the schema definition and did not pick up the altered collation. Thus, only people like us who used cache refresh would lose the index, and those with more modest bucket sizes might never even notice. My patch to apply the collation (later refined by jansete) was committed to the s3fs module.

Scalability of cache refresh

As the client loaded more and more images into the bucket, drush s3fs-refresh-cache started running out of memory. Our bucket was quite large (1.7 million files at last count), and the refresh function gathered all file metadata in memory before writing it to the database. I submitted a patch to chunk the metadata to the db within the loop, which has been committed to the module.

Image dimensions

Once there were lots of images in S3 to migrate, the image media migrations were running excruciatingly slowly. I quickly guessed and confirmed that they were accessing the files directly from S3, and then (less quickly) stepped through the debugger to find the reason - the image fields needed the image width and height, and since this data wasn't available from the source database to be directly mapped in the migration, it went out and fetched the S3 image to get the dimensions itself. This was, of course, necessary - but given that migrations were being repeatedly run for testing on various environments, there was no reason to do it repeatedly. Thus, we introduced an image dimension cache table to capture the width and height the first time we imported an image, and any subsequent imports of that image only needed to get the cached dimensions.

In the acme_image_media migration above, we use this process plugin which takes the image URI and returns an array with width and height keys populated with the cached values if present, and NULL if the dimensions are not yet cached:

/**
 * Fetch cached dimensions for an image path (purportedly) in S3.
 *
 * @MigrateProcessPlugin(
 *   id = "acme_image_dimensions"
 * )
 */
class AcmeImageDimensions extends ProcessPluginBase {
  public function transform($value, MigrateExecutableInterface $migrate_executable, Row $row, $destination_property) {
    $dimensions = Database::getConnection('default', 'migrate')
      ->select('s3fs_image_cache', 's3')
      ->fields('s3', ['width', 'height'])
      ->condition('uri', $value)
      ->execute()
      ->fetchAssoc();
    if (empty($dimensions)) {
      return ['width' => NULL, 'height' => NULL];
    }
    return $dimensions;
  }
}

If the dimensions were empty, when the media entity was saved Drupal core fetched the image from S3 and the width and height were saved to the image field table. We then caught the migration POST_ROW_SAVE event to cache the dimensions:

class AcmeMigrateSubscriber implements EventSubscriberInterface {
  public static function getSubscribedEvents() {
    $events[MigrateEvents::POST_ROW_SAVE] = 'import';
    return $events;
  }
 
  public function import(MigratePostRowSaveEvent $event) {
    $row = $event->getRow();
    // For image media, if width/height have been freshly obtained, cache them.
    if (strpos($event->getMigration()->id(), 'image_media') > 0) {
      // Note that this "temporary variable" was populated in the migration as a
      // width/height array, using the acme_image_dimensions process plugin.
      $original_dimensions = $row->getDestinationProperty('_image_dimensions');
      // If the dimensions are populated, everything's find and all of this is skipped.
      if (empty($original_dimensions['width'])) {
        // Find the media entity ID.
        $destination_id_values = $event->getDestinationIdValues();
        if (is_array($destination_id_values)) {
          $destination_id = reset($destination_id_values);
          // For performance, cheat and look directly at the table instead of doing
          // an entity query.
          $dimensions = Database::getConnection()
            ->select('media__field_media_image', 'msi')
            ->fields('msi', ['field_media_image_width', 'field_media_image_height'])
            ->condition('entity_id', $destination_id)
            ->execute()
            ->fetchAssoc();
          // If we have dimensions, cache them.
          if ($dimensions && !empty($dimensions['field_media_image_width'])) {
            $uri = $row->getDestinationProperty('_uri');
            Database::getConnection('default', 'migrate')
              ->merge('s3fs_image_cache')
              ->key('uri', $uri)
              ->fields([
                'width' => $dimensions['field_media_image_width'],
                'height' => $dimensions['field_media_image_height'],
              ])
              ->execute();
          }
        }
      }
    }
  }
}

Safely testing with the bucket

Another problem with the size of our bucket was that it was too large to economically make and maintain a separate copy to use for development and testing. So, we needed to use the single bucket - but of course, the videos in it were being used in the live site, so it was critical not to mess with them. We decided to use the live bucket with credentials allowing us to read and add files to the bucket, but not delete them - this would permit us to test uploading files through the admin interface, and most importantly from a migration standpoint access the files, but not do any damage. Worst-case scenario would be the inability to clean out test files, but writing a cleanup tool after the fact to clear any extra files out would be simple enough. Between this, and the fact that images were in a separate folder in the bucket (and we weren't doing any uploads of videos, simply migrating references to them), the risk of using the live bucket was felt to be acceptable. At first, though, the client was having trouble finding credentials that worked as we needed. As a short-term workaround, I implemented a configuration option for the s3fs module to disable deletion in the stream wrapper.

Investigating the permissions issues with my own test bucket, trying to add the bare minimum permissions needed for reading and writing objects, I arrived at a point where migration worked as desired, and deletion was prevented - but uploading files to the bucket through Drupal silently failed. There was an existing issue in the s3fs queue but it had not been diagnosed. I finally figured out the cause (Slack comment - "God, the layers of middleware I had to step through to find the precise point of death…") - by default, objects are private when uploaded to S3, and you need to explicitly set public-read in the ACL. Which the s3fs module does - but, to do this requires the PutObjectAcl policy, which I had not set (I've suggested the s3fs validator could detect and warn of this situation). Adding that policy enabled everything to work; once the client applied the necessary policies we were in business…

… for a while. The use of a single bucket became a problem once front-end developers began actively testing with image styles, and we were close enough to launch to enable deletion so image styles could be flushed when changed. The derivatives for S3 images are themselves stored in S3 - and with people generating derivatives in different environments, the s3fs_file table in any given environment (in particular the "live" environment on Pantheon, where the eventual production site was taking shape) became out of sync with the actual contents of S3. In particular, if styles were generated in the live environment then flushed in another environment, the live cache table would still contain entries for the derived styles (thus the site would generate URLs to them) even though they didn't actually exist in S3 - thus, no derived images would render. To address this, we had each environment set the s3fs root_folder option so they would each have their own sandbox - developers could then work on image styles at least with files they uploaded locally for testing, although their environments would not then see the "real" files in the bucket.

We discussed more permanent alternatives and Sean Blommaert put forth some suggestions in the s3fs issue queue - ultimately (after site launch) we found there is an existing (if minimally maintained) module extending stage_file_proxy. I will most certainly work with this module on any future projects using s3fs.

The tl;dr - lessons learned

To summarize the things to keep in mind if planning on using s3fs in your Drupal project:

Install the s3fs_file_proxy_to_s3 module first thing, and make sure all environments have it enabled and configured.
Make sure the credentials you use for your S3 bucket have the PutObjectAcl permission - this is non-obvious but essential if you are to publicly serve files from S3.
Watch your URI lengths - if the s3://… form of the URI is > 255 characters, it won't work (Drupal's file_managed table has a 255-character limit). When using image styles, the effective limit is significantly lower due to folders added to the path.
With image fields which reference images stored in S3, if you don't have width and height to set on the field at entity creation time, you'll want to implement a caching solution similar to the above.

Acknowledgements

Apart from the image style issues, most of the direct development detailed above was mine, but as on any project thoughts were bounced off the team, project managers handled communication with the client, testers provided feedback, etc. Thanks to the whole team, particularly Sean Blommaert (image styles, post feedback), Kevin Thompson (client communications), and Karoly Negyesi (post feedback).

Use the Twitter thread below to comment on this post:

https://t.co/erY3Gvhd97
— Virtuoso Performance (@VirtPerformance) January 9, 2019