Drupal file migrations: The s3fs module
A recent project gave me the opportunity to familiarize myself with the Drupal 8 version of the S3 File System (s3fs) module (having used the D7 version briefly in the distant past). This module provides an s3://
stream wrapper for files stored in an S3 bucket, allowing them to be used as seamlessly as locally stored public and private files. First we present the migrations and some of the plugins implemented to support import of files stored on S3 - below we will go into some of the challenges we faced.
Our client was already storing video files in an S3 bucket, and it was decided that for the Drupal site we would also store image files there. The client handled bulk uploading of images to an "image
" folder within the bucket, using the same (relative) paths as those stored for the images in the legacy database. Thus, for migration we did not need to physically copy files around (the bane of many a media migration!) - we "merely" needed to create the appropriate entities in Drupal pointing at the S3 location of the files.
The following examples are modified from the committed code - to obfuscate the client/project, and to simplify so we focus on the subject at hand.
Image migrations
Gallery images
In the legacy database all gallery images were stored in a table named asset_metadata
, which is structured very much like Drupal's file_managed
table, with the file paths in an asset_path
column. The file migration looked like this:
id: acme_image source: plugin: acme process: filename: plugin: callback callable: basename source: asset_path uri: # Construct the S3 URI - see implementation below. plugin: acme_s3_uri source: asset_path # Source data created/last_modified fields are YYYY-MM-DD HH:MM:SS - convert # them to the classic UNIX timestamps Drupal loves. Oh, and they're optional, # so when empty leave them empty and let Drupal set them to the current time. created: - plugin: skip_on_empty source: created method: process - plugin: callback callable: strtotime changed: - plugin: skip_on_empty source: last_modified method: process - plugin: callback callable: strtotime destination: plugin: entity:file
Because we also needed to construct the S3 uris in places besides the acme_s3_uri
process plugin, we implemented the construction in a trait which cleans up some inconsistencies and prepends the image location:
trait AcmeMakeS3Uri { /** * Turn a legacy image path into an S3 URI. * * @param string $value * * @return string */ protected function makeS3Uri($value) { // Some have leading tabs. $value = trim($value); // Path fields are inconsistent about leading slashes. $value = ltrim($value, '/'); // Sometimes they contain doubled-up slashes. $value = str_replace('//', '/', $value); return 's3://image/' . $value; } }
So, the process plugin in the image migration above uses the trait to construct the URI, and verifies that the file is actually in S3 - if not, we skip it. See the Challenges and Contributions section below for more on the s3fs_file
table.
/** * Turn a legacy image path into an S3 URI. * * @MigrateProcessPlugin( * id = "acme_s3_uri" * ) */ class AcmeS3Uri extends ProcessPluginBase { use AcmeMakeS3Uri; /** * {@inheritdoc} */ public function transform($value, MigrateExecutableInterface $migrate_executable, Row $row, $destination_property) { $uri = $this->makeS3Uri($value); // For now, skip any images not cached by s3fs. $s3_uri = \Drupal::database()->select('s3fs_file', 's3') ->fields('s3', ['uri']) ->condition('uri', $uri) ->execute() ->fetchField(); if (!$s3_uri) { throw new MigrateSkipRowException("$uri missing from s3fs_file table"); } return $uri; } }
The above creates the file entities - next, we need to create the media entities that reference the files above via entity reference fields (and add other fields). These media entities are then referenced from content entities.
id: acme_image_media source: plugin: acme process: # For the media "name" property - displayed at /admin/content/media - our # first choice is the image caption, followed by the "event_name" field in # our source table. If necessary, we fall back to the original image path. name: - # Produces an array containing only the non-empty values. plugin: callback callable: array_filter source: - caption - event_name - asset_path - # From the array, pass on the first value as a scalar. plugin: callback callable: current - # Some captions are longer than the name property length. plugin: substr length: 255 # Entity reference to the image - convert the source ID to Drupal's file ID. field_media_image/target_id: plugin: migration_lookup migration: acme_image source: id # Use the name we computed above as the alt text. field_media_image/alt: '@name' # We need to explicitly set the image dimensions in the field's width/height # subfields (more on this below under Challenges and Contributions). Note that in # the process pipeline you can effectively create temporary fields which can be # used later in the pipeline - just be sure they won't conflict with # anything that might be used within the Drupal entity. _uri: plugin: acme_s3_uri source: asset_path _image_dimensions: plugin: acme_image_dimensions source: '@_uri' field_media_image/width: '@_image_dimensions/width' field_media_image/height: '@_image_dimensions/height' caption: caption destination: plugin: entity:media default_bundle: image migration_dependencies: required: - acme_image
Other images
The gallery images have their own metadata table - but, there are many other images which are simply stored as paths in content tables (in some cases, there are multiple such path fields in a single table). One might be tempted to deal with these in process plugins in the content migrations - creating the file and media entities on the fly - but that would be, well, ugly. Instead we implemented a drush command, run before our migration tasks, to canonicalize and gather those paths into a single table, which then feeds the acme_image_consolidated
and acme_image_media_consolidated
migrations (which end up being simpler versions of acme_image
and acme_image_media
, since "path
" is the only available source field).
function drush_acme_migrate_gather_images() { // Key is legacy table name, value is list of image path columns to migrate. $table_fields = [ 'person' => [ 'profile_picture_path', 'left_standing_path', 'right_standing_path', ], 'event' => [ 'feature_image', 'secondary_feature_image', ], 'subevent' => [ 'generated_medium_thumbnail', ], 'news_article' => [ 'thumbnail', ] ]; $legacy_db = Database::getConnection('default', 'migrate'); // Create the table if necessary. if (!$legacy_db->schema()->tableExists('consolidated_image_paths')) { $table = [ 'fields' => [ 'path' => [ 'type' => 'varchar', 'length' => 191, // Longest known path is 170. 'not null' => TRUE, ] ], 'primary key' => ['path'], ]; $legacy_db->schema()->createTable('consolidated_image_paths', $table); drush_print('Created consolidated_image_paths table'); } $max = 0; foreach ($table_fields as $table => $field_list) { drush_print("Gathering paths from $table"); $count = 0; $query = $legacy_db->select($table, 't') ->fields('t', $field_list); foreach ($query->execute() as $row) { // Iterate the image path columns returned in the row. foreach ($row as $path) { if ($path) { $len = strlen($path); if ($len > $max) $max = $len; $path = str_replace('//', '/', $path); $count++; $legacy_db->merge('consolidated_image_paths') ->key('path', $path) ->execute(); } } } // Note we will end up with far fewer rows in the table due to duplication. drush_print("$count paths added from $table"); } drush_print("Maximum path length is $max"); }
Video migrations
The legacy database contained a media
table referencing videos tagged with three different types - internal, external, and embedded. "Internal" videos were those stored in S3 with a relative path in the internal_url
column; "external" videos (most on client-specific domains, but with some Youtube domains as well) had a full URL in the external_url
column; and "embedded" videos were with a very few exceptions Youtube videos with the Youtube ID in the embedded_id
column. It was decided that we would migrate the internal and Youtube videos, ignoring the rest of the external/embedded videos. Here we focus on the internal (S3-based) videos.
id: acme_video source: plugin: acme_internal_video constants: s3_prefix: s3:// process: _trimmed_url: # Since the callback process plugin only permits a single source value to be # passed to the specified PHP function, we have a custom plugin which enables us # to pass a character list to be trimmed. plugin: acme_trim source: internal_url trim_type: left charlist: / uri: - plugin: concat source: - constants/s3_prefix - '@_trimmed_url' - # Make sure the referenced file actually exists in S3 (does a simple query on # the s3fs_file table, throwing MigrateSkipRowException if missing). plugin: acme_skip_missing_file fid: # This operates much like migrate_plus's entity_lookup, to return an existing # entity ID based on arbitrary properties. The purpose here is if the file URI # is already in file_managed, point the migrate map table to the existing file # entity - otherwise, a new file entity will be created. plugin: acme_load_by_properties entity_type: file properties: uri source: '@uri' default_value: NULL filename: plugin: callback callable: basename source: '@uri' destination: plugin: entity:file
The media entity migration is pretty straightforward:
id: acme_video_media source: plugin: acme_internal_video constants: true: 1 process: status: published name: title caption: caption # The source column media_date is YYYY-MM-DD HH:DD:SS format - the Drupal field is # configured as date-only, so the source value must be truncated to YYYY-MM-DD. date: - plugin: skip_on_empty source: media_date method: process - plugin: substr length: 10 field_media_video/0/target_id: - plugin: migration_lookup migration: acme_video source: id no_stub: true - # If we haven't migrated a file entity, skip this media entity. plugin: skip_on_empty method: row field_media_video/0/display: constants/true field_media_video/0/description: caption destination: plugin: entity:media default_bundle: video migration_dependencies: required: - acme_video
Did I mention that we needed to create a node for each video, linking to related content of other types? Here we go:
id: acme_video_node source: plugin: acme_internal_video constants: text_format: formatted url_prefix: http://www.acme.com/media/ s3_prefix: s3://image/ process: title: title status: published teaser/value: caption teaser/format: constants/text_format length: # Converts HH:MM:SS to integer seconds. Left as an exercise to the reader. plugin: acme_video_length source: duration video: plugin: migration_lookup migration: acme_video_media source: id no_stub: true # Field to preserve the original URL. old_url: plugin: concat source: - constants/url_prefix - url_name _trimmed_thumbnail: plugin: acme_trim trim_type: left charlist: '/' source: thumbnail teaser_image: - plugin: skip_on_empty source: '@_trimmed_thumbnail' method: process - # Form the URI as stored in file_managed. plugin: concat source: - constants/s3_prefix - '@_trimmed_thumbnail' - # Look up the fid. plugin: acme_load_by_properties entity_type: file properties: uri - # Find the media entity referencing that fid. plugin: acme_load_by_properties entity_type: media properties: field_media_image # Note that for each of these entity reference fields, we skipped some content, # so need to make sure stubs aren't created for the missing content. Also note # that the source fields here are populated in a PREPARE_ROW event. related_people: plugin: migration_lookup migration: acme_people source: related_people no_stub: true related_events: plugin: migration_lookup migration: acme_event source: related_events no_stub: true tag_keyword: plugin: migration_lookup migration: acme_keyword source: keyword_ids no_stub: true destination: plugin: entity:node default_bundle: video migration_dependencies: required: - acme_image_media - acme_video_media - acme_people - acme_event - acme_keyword
Auditing missing files
A useful thing to know (particularly with the client incrementally populating the S3 bucket with image files) is what files are referenced in the legacy tables but not actually in the bucket. Below is a drush command we threw together to answer that question - it will query each legacy image or video path field we're using, construct the s3://
version of the path, and look it up in the s3fs_file
table to see if it exists in S3.
/** * Find files missing from S3. */ function drush_acme_migrate_missing_files() { $legacy_db = Database::getConnection('default', 'migrate'); $drupal_db = Database::getConnection(); $table_fields = [ [ 'table_name' => 'asset_metadata', 'url_column' => 'asset_path', 'date_column' => 'created', ], [ 'table_name' => 'media', 'url_column' => 'internal_url', 'date_column' => 'media_date', ], [ 'table_name' => 'person', 'url_column' => 'profile_picture_path', 'date_column' => 'created', ], // … on to 9 more columns among three more tables... ]; $header = 'uri,legacy_table,legacy_column,date'; drush_print($header); foreach ($table_fields as $table_info) { $missing_count = 0; $total_count = 0; $table_name = $table_info['table_name']; $url_column = $table_info['url_column']; $date_column = $table_info['date_column']; $query = $legacy_db->select($table_name, 't') ->fields('t', [$url_column]) ->isNotNull($url_column) ->condition($url_column, '', '<>'); if ($table_name == 'media') { $query->condition('type', 'INTERNALVIDEO'); } if ($table_name == 'people') { // This table functions much like Drupal's node table. $query->innerJoin('publishable_entity', 'pe', 't.id=pe.id'); $query->fields('pe', [$date_column]); } else { $query->fields('t', [$date_column]); } $query->distinct(); foreach ($query->execute() as $row) { $path = trim($row->$url_column); if ($path) { $total_count++; // Paths are inconsistent about leading slashes. $path = ltrim($path, '/'); // Sometimes they have doubled-up slashes. $path = str_replace('//', '/', $path); if ($table_name == 'media') { $s3_path = 's3://' . $path; } else { $s3_path = 's3://image/' . $path; } $s3 = $drupal_db->select('s3fs_file', 's3') ->fields('s3', ['uri']) ->condition('uri', $s3_path) ->execute() ->fetchField(); if (!$s3) { $output_row = "$s3_path,$table_name,$url_column,{$row->$date_column}"; drush_print($output_row); $missing_count++; } } } drush_log("$missing_count of $total_count files missing in $table_name column $url_column", 'ok'); } }
Challenges and contributions
The s3fs module's primary use case is where the configured S3 bucket is used only by the Drupal site, and populated directly by file uploads through Drupal - our project was an outlier in terms of having all files in the S3 bucket first, and in sheer volume. A critical piece of the implementation is the s3fs_file
table, which caches metadata for all files in the bucket so Drupal rarely needs to access the bucket itself other than on file upload (since file URIs are converted to direct S3 URLs when rendering, web clients go directly to S3 to fetch files, not through Drupal). In our case, the client had an existing S3 bucket which contained all the video files (and more) used by their legacy site, and to which they bulk uploaded image files directly so we did not need to do this during migration. The module does have an s3fs-refresh-cache
command to populate the s3fs_file
table from the current bucket contents, but we did have to deal with some issues around the cache table.
Restriction on URI lengths
As soon as we started trying to use drush s3fs-refresh-cache
, we ran into the existing issue Getting Exception 'PDOException'SQLSTATE[22001] When Running drush s3fs-refresh-cache - URIs in the bucket longer than the 255-character length of s3fs_file
's uri
column. The exception aborted the refresh entirely, and because the refresh operation generates a temporary version of the table from scratch, then swaps it for the "live" table, the exception prevented any file metadata from being refreshed if there was one overflowing URI. I submitted a patch implementing the simplest workaround - just generating a message and ignoring overly-long URIs. Discussion continues around an alternate approach, but we used my patch in our project.
Lost primary key
So, once we got the cache refresh to work, we found serious performance problems. We had stumbled on an existing issue, "s3fs_file" table has no primary key. I tracked down the cause - because the uri column is 255 characters long, with InnoDB it cannot be indexed when using a multibyte collation such as utf8_general_ci. And Drupal core has a bug, DatabaseSchema_mysql::createTableSql() can't set table collation, preventing the setting of the utf8_bin collation directly in the table schema. The s3fs module works around that bug when creating the s3fs_file table at install time by altering the collation after table creation - but the cache refresh created a new cache table using only the schema definition and did not pick up the altered collation. Thus, only people like us who used cache refresh would lose the index, and those with more modest bucket sizes might never even notice. My patch to apply the collation (later refined by jansete) was committed to the s3fs module.
Scalability of cache refresh
As the client loaded more and more images into the bucket, drush s3fs-refresh-cache
started running out of memory. Our bucket was quite large (1.7 million files at last count), and the refresh function gathered all file metadata in memory before writing it to the database. I submitted a patch to chunk the metadata to the db within the loop, which has been committed to the module.
Image dimensions
Once there were lots of images in S3 to migrate, the image media migrations were running excruciatingly slowly. I quickly guessed and confirmed that they were accessing the files directly from S3, and then (less quickly) stepped through the debugger to find the reason - the image fields needed the image width and height, and since this data wasn't available from the source database to be directly mapped in the migration, it went out and fetched the S3 image to get the dimensions itself. This was, of course, necessary - but given that migrations were being repeatedly run for testing on various environments, there was no reason to do it repeatedly. Thus, we introduced an image dimension cache table to capture the width and height the first time we imported an image, and any subsequent imports of that image only needed to get the cached dimensions.
In the acme_image_media
migration above, we use this process plugin which takes the image URI and returns an array with width and height keys populated with the cached values if present, and NULL if the dimensions are not yet cached:
/** * Fetch cached dimensions for an image path (purportedly) in S3. * * @MigrateProcessPlugin( * id = "acme_image_dimensions" * ) */ class AcmeImageDimensions extends ProcessPluginBase { public function transform($value, MigrateExecutableInterface $migrate_executable, Row $row, $destination_property) { $dimensions = Database::getConnection('default', 'migrate') ->select('s3fs_image_cache', 's3') ->fields('s3', ['width', 'height']) ->condition('uri', $value) ->execute() ->fetchAssoc(); if (empty($dimensions)) { return ['width' => NULL, 'height' => NULL]; } return $dimensions; } }
If the dimensions were empty, when the media entity was saved Drupal core fetched the image from S3 and the width and height were saved to the image field table. We then caught the migration POST_ROW_SAVE
event to cache the dimensions:
class AcmeMigrateSubscriber implements EventSubscriberInterface { public static function getSubscribedEvents() { $events[MigrateEvents::POST_ROW_SAVE] = 'import'; return $events; } public function import(MigratePostRowSaveEvent $event) { $row = $event->getRow(); // For image media, if width/height have been freshly obtained, cache them. if (strpos($event->getMigration()->id(), 'image_media') > 0) { // Note that this "temporary variable" was populated in the migration as a // width/height array, using the acme_image_dimensions process plugin. $original_dimensions = $row->getDestinationProperty('_image_dimensions'); // If the dimensions are populated, everything's find and all of this is skipped. if (empty($original_dimensions['width'])) { // Find the media entity ID. $destination_id_values = $event->getDestinationIdValues(); if (is_array($destination_id_values)) { $destination_id = reset($destination_id_values); // For performance, cheat and look directly at the table instead of doing // an entity query. $dimensions = Database::getConnection() ->select('media__field_media_image', 'msi') ->fields('msi', ['field_media_image_width', 'field_media_image_height']) ->condition('entity_id', $destination_id) ->execute() ->fetchAssoc(); // If we have dimensions, cache them. if ($dimensions && !empty($dimensions['field_media_image_width'])) { $uri = $row->getDestinationProperty('_uri'); Database::getConnection('default', 'migrate') ->merge('s3fs_image_cache') ->key('uri', $uri) ->fields([ 'width' => $dimensions['field_media_image_width'], 'height' => $dimensions['field_media_image_height'], ]) ->execute(); } } } } } }
Safely testing with the bucket
Another problem with the size of our bucket was that it was too large to economically make and maintain a separate copy to use for development and testing. So, we needed to use the single bucket - but of course, the videos in it were being used in the live site, so it was critical not to mess with them. We decided to use the live bucket with credentials allowing us to read and add files to the bucket, but not delete them - this would permit us to test uploading files through the admin interface, and most importantly from a migration standpoint access the files, but not do any damage. Worst-case scenario would be the inability to clean out test files, but writing a cleanup tool after the fact to clear any extra files out would be simple enough. Between this, and the fact that images were in a separate folder in the bucket (and we weren't doing any uploads of videos, simply migrating references to them), the risk of using the live bucket was felt to be acceptable. At first, though, the client was having trouble finding credentials that worked as we needed. As a short-term workaround, I implemented a configuration option for the s3fs module to disable deletion in the stream wrapper.
Investigating the permissions issues with my own test bucket, trying to add the bare minimum permissions needed for reading and writing objects, I arrived at a point where migration worked as desired, and deletion was prevented - but uploading files to the bucket through Drupal silently failed. There was an existing issue in the s3fs queue but it had not been diagnosed. I finally figured out the cause (Slack comment - "God, the layers of middleware I had to step through to find the precise point of death…") - by default, objects are private when uploaded to S3, and you need to explicitly set public-read
in the ACL. Which the s3fs module does - but, to do this requires the PutObjectAcl policy, which I had not set (I've suggested the s3fs validator could detect and warn of this situation). Adding that policy enabled everything to work; once the client applied the necessary policies we were in business…
… for a while. The use of a single bucket became a problem once front-end developers began actively testing with image styles, and we were close enough to launch to enable deletion so image styles could be flushed when changed. The derivatives for S3 images are themselves stored in S3 - and with people generating derivatives in different environments, the s3fs_file
table in any given environment (in particular the "live" environment on Pantheon, where the eventual production site was taking shape) became out of sync with the actual contents of S3. In particular, if styles were generated in the live environment then flushed in another environment, the live cache table would still contain entries for the derived styles (thus the site would generate URLs to them) even though they didn't actually exist in S3 - thus, no derived images would render. To address this, we had each environment set the s3fs root_folder
option so they would each have their own sandbox - developers could then work on image styles at least with files they uploaded locally for testing, although their environments would not then see the "real" files in the bucket.
We discussed more permanent alternatives and Sean Blommaert put forth some suggestions in the s3fs issue queue - ultimately (after site launch) we found there is an existing (if minimally maintained) module extending stage_file_proxy. I will most certainly work with this module on any future projects using s3fs.
The tl;dr - lessons learned
To summarize the things to keep in mind if planning on using s3fs in your Drupal project:
- Install the s3fs_file_proxy_to_s3 module first thing, and make sure all environments have it enabled and configured.
- Make sure the credentials you use for your S3 bucket have the PutObjectAcl permission - this is non-obvious but essential if you are to publicly serve files from S3.
- Watch your URI lengths - if the
s3://…
form of the URI is > 255 characters, it won't work (Drupal's file_managed table has a 255-character limit). When using image styles, the effective limit is significantly lower due to folders added to the path. - With image fields which reference images stored in S3, if you don't have width and height to set on the field at entity creation time, you'll want to implement a caching solution similar to the above.
Acknowledgements
Apart from the image style issues, most of the direct development detailed above was mine, but as on any project thoughts were bounced off the team, project managers handled communication with the client, testers provided feedback, etc. Thanks to the whole team, particularly Sean Blommaert (image styles, post feedback), Kevin Thompson (client communications), and Karoly Negyesi (post feedback).
Use the Twitter thread below to comment on this post: