Point11

Content extraction and mapping

Patterns for extracting content from legacy CMS platforms and mapping it to headless CMS content models.

Content extraction is the most labor-intensive phase of an enterprise migration. Legacy CMS platforms store content in proprietary formats, tangled with presentation logic, and spread across database tables, file systems, and plugin-specific storage. A disciplined extraction and mapping process ensures no content is lost and the new content model is cleaner than the original.

Extraction Patterns by Platform

WordPress

WordPress stores content in the wp_posts and wp_postmeta tables, with taxonomies in wp_terms and wp_term_relationships. Extract using:

  • WP REST API: The built-in REST API (wp-json/wp/v2/) exposes posts, pages, media, categories, tags, and custom post types. Use pagination to extract all records.
  • WPGraphQL: The WPGraphQL plugin provides a GraphQL endpoint that enables more efficient querying of complex content relationships. This is the recommended approach when the target frontend is Next.js, as it aligns with the data fetching patterns used by frameworks like Faust.js.
  • Direct database export: For sites with heavy custom fields (Advanced Custom Fields, Pods, Toolset), direct MySQL queries against wp_postmeta may be more reliable than API extraction.

Drupal

Drupal's entity-field architecture stores content across many database tables:

  • JSON:API module: Drupal's core JSON:API module exposes all content entities with full field data and relationships. Use sparse fieldsets and includes to minimize payload size.
  • next-drupal: The next-drupal package provides a bridge between Drupal and Next.js, including support for preview mode and incremental static regeneration.
  • Migrate API: Drupal's built-in Migrate API can export content to JSON or CSV for transformation outside the CMS.

Sitecore and Adobe Experience Manager

These enterprise platforms require specialized extraction:

  • Sitecore: Use the Sitecore Item Web API or GraphQL endpoint to extract content items. Pay special attention to layout definitions, rendering parameters, and personalization rules stored separately from content.
  • Adobe Experience Manager: Use the AEM Assets HTTP API and Content Fragment API to extract structured content. Experience Fragments and content policies require manual mapping.

Content Model Mapping

Mapping legacy content to a headless CMS content model is a design exercise, not a mechanical translation:

Principles

  • Separate content from presentation: Legacy CMS content often contains inline styles, layout divs, shortcodes, and platform-specific markup. Strip all presentation concerns during extraction and store only semantic content.
  • Normalize content types: Legacy sites accumulate redundant content types over time. Consolidate duplicates and create a clean, minimal set of content models in the target CMS.
  • Preserve relationships: Map taxonomy terms, cross-references, and content hierarchies to the target CMS's relationship model (references, linked entries, or embedded objects).
  • Plan for localization: If the site is multilingual, map locale-specific content variants to the target CMS's localization model during extraction, not after.

Field Mapping Document

Create a detailed field mapping spreadsheet for each content type:

  • Source field: The field name and location in the legacy CMS.
  • Source format: Data type and any transformation applied by the legacy CMS (e.g., shortcode expansion, image resizing).
  • Target field: The corresponding field in the headless CMS content model.
  • Target format: Expected data type, validation rules, and any required transformation.
  • Transformation logic: Specific rules for converting between source and target (e.g., converting WordPress shortcodes to structured component references, stripping inline styles, resolving relative URLs to absolute).

Media Asset Migration

Media files require special handling:

  • Re-upload to target DAM or CDN: Do not simply copy URLs. Upload all media to the new platform's asset management system (Contentful Assets, Sanity Image Pipeline, or a dedicated DAM like Cloudinary or Imgix).
  • Preserve metadata: Migrate alt text, captions, titles, and focal point data. Missing alt text creates accessibility violations.
  • Optimize on upload: Use the target platform's image transformation pipeline to generate responsive sizes and modern formats (WebP, AVIF) during upload rather than serving legacy JPEGs.
  • Update content references: After uploading, update all content entries to reference the new asset IDs or URLs.

Validation and Reconciliation

After extraction, validate completeness:

  • Row counts: Compare source and target record counts for every content type.
  • Spot checks: Manually review a random sample (minimum 5%) of migrated entries against the source.
  • Link integrity: Verify that all internal cross-references resolve to valid entries in the target CMS.
  • Media verification: Confirm all images and files are accessible and render correctly.

Sources

Need help implementing this?

Our team can walk you through the setup.