SDTM Mapping: Specification Workbook vs. Code-First - A Practitioner's View

The SDTM mapping debate in most statistical programming groups is not really a debate — it is an assumption. Teams that came up through CRO environments typically use specification workbooks. Teams at smaller biotechs with programming-heavy cultures often default to code-first approaches. Both groups are confident that their approach is correct, and neither group has usually done a rigorous comparison. This article provides one.

The short answer is that both approaches have specific failure modes that the other avoids, and the choice between them should depend on the size and complexity of the program rather than on programmer preference or organizational inertia.

What Each Approach Actually Is

For clarity, a brief definition of each approach as it is typically practiced:

Specification workbook approach: The SDTM mapping is first documented in a structured Excel (or similar) workbook that defines, for each target SDTM variable, the source eCRF field, any applicable transformation logic, and the CDISC controlled term. The workbook is reviewed and approved before programming begins. Code is then written to execute the documented transformations. Define.xml can be generated from the workbook metadata.

Code-first approach: The programmer works directly from the eCRF specification and SDTM Implementation Guide (SDTMIG), writing transformation code without a separate mapping document. A reviewer checks the code and its output against the SDTMIG. Documentation may be generated retroactively or as code comments.

Where the Specification Workbook Breaks Down

The specification workbook approach has three specific failure points in Phase III programs:

Workbook drift after protocol amendments. A Phase III program with multiple protocol amendments will update the eCRF to match each amendment. The specification workbook is supposed to be updated in parallel. In practice, CDM teams prioritize updating the edit checks and the eCRF design; the SDTM specification workbook is updated as time permits. By the time statistical programming begins working with the final locked data, workbook-to-eCRF divergence has accumulated across two or three amendments. Programmers then spend time resolving discrepancies between the workbook, the eCRF, and the actual data — a problem that does not exist in the code-first approach because the code is written against the current eCRF directly.

Workbook does not capture conditional logic well. The tabular structure of a specification workbook handles straightforward source-to-target mappings well. It handles conditional derivations poorly. A derivation like "AESTDTC equals the adverse event start date from the AE form except when the CRF date is imputed, in which case use the imputation rule documented in the imputation log with a --DTC imputation flag" requires multiple rows, footnotes, and lookup logic that is difficult to specify unambiguously in a flat table. Programmers routinely interpret ambiguous workbook specifications differently, generating inter-programmer inconsistencies that are caught in QC but add review cycles.

Workbook review takes longer than the time saved on coding. In programs with 30+ SDTM domains, the specification workbook review cycle typically runs 3–5 weeks with two rounds of comments. The argument for the workbook approach is that this review catches design problems early. The counterfactual is that experienced programmers using a code-first approach with a paired reviewer catch the same issues in code review, with a total elapsed time of 1–2 weeks for comparable domain complexity.

Where Code-First Breaks Down

The code-first approach has equally specific failure modes:

Traceability for regulatory submissions. FDA Technical Conformance Guide requires that the submission package demonstrate the derivation logic for each SDTM variable through documentation, not just through code. A specification workbook provides this documentation as a natural output of the process. A code-first team must produce equivalent documentation retroactively, which is time-consuming and often results in incomplete or inconsistent documentation because programmers document what they think they did rather than carefully reviewing what the code actually does.

Handoff problems in multi-CRO or multi-sponsor programs. When statistical programming is distributed across multiple organizations — a common pattern in global Phase III programs — code without specification documentation creates serious handoff problems. The receiving organization cannot audit derivation logic without reading code. For complex derivations involving multiple source tables, protocol-specific rules, and imputation strategies, reading the code is substantially harder than reading a well-written specification.

Inconsistency across programmers on the same study. On a team of three or more programmers working in parallel on different SDTM domains, code-first approaches require a more rigorous central review step to ensure consistency in how shared concepts (subject identification, study day calculation, baseline flag derivation) are implemented. Without a specification layer that defines shared derivations, each programmer tends to implement them slightly differently, and the inconsistencies surface during Pinnacle 21 conformance checks.

The Hybrid Approach That Works for Phase III

The practical resolution for Phase III programs is a hybrid: a lightweight specification layer for shared derivations and domain-level design decisions, combined with code-first implementation for the actual transformations.

Specifically, document the following in a specification format:

USUBJID construction logic
Study day calculation reference date and edge case rules
Population flag definitions and their derivation from disposition events
Imputation rules for partial dates and missing data
Any non-standard variables (NSVs) requiring define.xml entries

Leave the following as code-documented:

Individual domain variable transformations from eCRF source
SUPPQUAL variable derivations
Controlled terminology lookups (document the CT version, not each term)

This hybrid approach takes roughly 4–6 days for the shared specification layer on a typical Phase III domain set, versus 15–25 days for a full workbook. The code review and QC process handles the remaining traceability requirement for individual variable derivations.

Tool Implications

The specification approach is better supported by SDTM mapping tools like Pinnacle 21 Enterprise, which can ingest mapping specifications and track them against the production datasets. The code-first approach works better in programming environments where the codebase is maintained in version control (Git) and code review is the primary QC mechanism.

MLPipeKit's SDTM mapping module supports both workflows. Define domain mapping specifications once in the mapping editor — at whatever level of granularity your team prefers — and the platform generates the transformation code and tracks specification-to-output consistency across study milestones. For teams transitioning from specification-heavy workflows, this reduces the workbook burden while maintaining the traceability required for regulatory submission.

As described in our article on the Pinnacle 21 errors that trigger FDA technical rejections, the conformance issues most likely to cause submission problems are not individual variable derivations — they are structural decisions about define.xml generation, CT version management, and cross-dataset consistency. Those are exactly the shared derivation components that a lightweight specification layer should document, regardless of which approach governs individual domain coding.

Conclusion

Specification workbooks are not wrong. Code-first approaches are not shortcuts. Each has a context where it performs well and a context where it generates more work than it saves. Phase III programs benefit from the traceability and consistency that a specification layer provides for shared derivations. They do not benefit from full workbook specification of every variable in every domain under the time pressure of a submission timeline. The hybrid approach respects both constraints.

Explore how MLPipeKit handles SDTM mapping for Phase II–III programs. View the platform or request a demo.

Back to Blog