Drupal 8 plugins for XML and JSON migrations

I put some work in last week on implementing wordpress_migrate for Drupal 8 (read more in the companion piece). So, this seems a good point to talk about the source plugin work that's based on, supporting XML and JSON sources in migrate_plus

History and status of the XML and JSON plugins

Last year Mike Baynton produced a basic D8 version of wordpress_migrate, accompanied by an XML source plugin. Meanwhile, Karen Stevenson implemented a JSON source plugin for Drupal 8. The two source plugins had distinct APIs, and differing configuration settings, but when you think about it they really only differ in the parsing of the data - they are both file-oriented (may be read via HTTP or from a local filesystem) unlike SQL, both require a means to specify how to select an item ("row") from within the data, and a means to specify how to select fields from within an item. I felt there should be some way to share at least a common interface between the two, if not much of the implementation.

So, in migrate_plus I have implemented an Url source plugin (please weigh in with suggestions for a better name!) which (ideally) separates the retrieval of the data (using a fetcher plugin) from parsing of the data (using a parser plugin). There are currently XML and JSON parser plugins (based on Mike Baynton's and Karen Stevenson's original work), along with an HTTP fetcher plugin. All of the former migrate_source_xml functionality is in migrate_plus now, so that module should be considered deprecated. Not everything from migrate_source_json is yet in migrate_plus - for example, the ability to specify HTTP headers for authentication, which in the new architecture should be part of the HTTP fetcher and thus available for both XML and JSON sources. Since no new work is going into migrate_source_json at this point, the best way forward for JSON migration support is to contribute to beefing up the migrate_plus version of this support.

Using the Url source plugin with the XML parser plugin

The migrate_example_advanced submodule of migrate_plus contains simple examples of both XML and JSON migrations from web services. Here, though, we'll look at at a more complex real-world example - migration from a WordPress XML export.

The outermost element of a WordPress export is <rss> - within that is a <channel> element, which contains all the exported content - authors, tags and categories, and content items (posts, pages, and attachments). Here's an example of how tags are represented:

<rss>
  <channel>
    ...
    <wp:tag>
      <wp:term_id>6859470</wp:term_id>
      <wp:tag_slug>a-new-tag</wp:tag_slug>
      <wp:tag_name><![CDATA[A New Tag]]></wp:tag_name>
    </wp:tag>
    <wp:tag>
      <wp:term_id>18</wp:term_id>
      <wp:tag_slug>music</wp:tag_slug>
      <wp:tag_name><![CDATA[Music]]></wp:tag_name>
    </wp:tag>
    ...
  </channel>
</rss>

The source plugin configuration to retrieve this data looks like the following (with comments added for annotation). The configuration for a JSON source would be nearly identical.

source:
  # Specifies the migrate_plus url source plugin.
  plugin: url
  # Specifies the http fetcher plugin. Note that the XML parser does not actually use this,
  # see below.
  data_fetcher_plugin: http
  # Specifies the xml parser plugin.
  data_parser_plugin: xml
  # One or more URLs from which to fetch the source data (only one for a WordPress export).
  # Note that in the actual wordpress_migrate module, this is not builtin to the wordpress_tags.yml
  # file, but rather saved to the migration_group containing the full set of WP migrations
  # from which it is merged into the source configuration.
  urls: private://wordpress/nimportable.wordpress.2016-06-03.xml
  # For XML, item_selector is the xpath used to select our source items (tags in this case).
  # For JSON, this would be an integer depth at which importable items are found.
  item_selector: /rss/channel/wp:tag
  # For each source field, we specify a selector (xpath relative to the item retrieved above),
  # the field name which will be used to access the field in the process configuration,
  # and a label to document the meaning of the field in front-ends. For JSON, the selector
  # will be simply the key for the value within the selected item.
  fields:
    -
      name: term_id
      label: WordPress term ID
      selector: wp:term_id
    -
      name: tag_slug
      label: Analogous to a machine name
      selector: wp:tag_slug
    -
      name: tag_name
      label: 'Human name of term'
      selector: wp:tag_name
  # Under ids, we specify which of the source fields retrieved above (tag_slug in this case)
  # represent our unique identifier for the item, and the schema type for that field. Note
  # that we use tag_slug here instead of term_id because posts reference terms using their
  # slugs.
  ids:
    tag_slug:
      type: string

Once you've fully specified the source in your .yml file (no PHP needed!), you simply map the retrieved source fields normally:

process:
  # In wordpress_migrate, the vid mapping is generated dynamically by the configuration process.
  vid:
    plugin: default_value
    default_value: tags
  # tag_name was populated via the source plugin configuration above from wp:tag_name.
  name: tag_name

Above we pointed out that the XML parser plugin does not actually use the fetcher plugin. In an ideal world, we would always separate fetching from parsing - however, in the real world, we're making use of existing APIs which do not support that separation. In this case, we are using PHP's XMLReader class in our parser - unlike other PHP XML APIs, this does not read and parse the entire XML source into memory, thus is essential for dealing with potentially very large XML files (I've seen WordPress exports upwards of 200MB). This class processes the source incrementally, and completely manages both fetching and parsing, so as consumers of that class we are unable to make that separation. There is an issue in the queue to add a separate XML parser that would use SimpleXML - this will be more flexible (providing the ability to use file-wide xpaths, rather than just item-specific ones), and also will permit separating the fetcher.  

Much more to do!

What we have in migrate_plus today is (almost) sufficient for WordPress imports, but there's still a ways to go. The way fetchers and parsers interact could use some thought; we need to move logically HTTP-specific stuff out of the general fetcher base class, etc. Your help would be much appreciated - particularly with JSON sources, since I don't have handy real-world test data for that case.