About | MLPipeKit

Our Story

Why We Built MLPipeKit

Sarah Lindgren spent six years as a biostatistician at a mid-size CRO running Phase II and III oncology trials. The work that frustrated her most wasn't the statistics — it was the month she lost every database lock cycle waiting for clean data. EDC exports arrived late. Query listings were generated by hand in Excel. SDTM mapping specs lived in shared drives that three different people maintained independently.

She left in 2022 to build the tool she had been describing to colleagues for years: a pipeline that treated clinical trial data as structured data to be validated and transformed programmatically, not as documents to be reviewed by hand. MLPipeKit started as an internal toolset for two Boston-area sponsors and became a product when three more requested access within three months of hearing about it.

Today, MLPipeKit supports Phase I through Phase III data operations for seven pharmaceutical and biotech organizations. The team is five people. The product works because everyone on it came from clinical data management or biostatistics, not from generic enterprise software.

What We Believe

How we make decisions

Data provenance is non-negotiable

Every transformation MLPipeKit performs is traceable to a rule, a version, and a timestamp. We do not build features that obscure how a value changed. If you cannot explain a data point to an FDA reviewer, neither should your software.

Tools should fit the domain, not the reverse

We have seen too many clinical teams forced to adapt their workflows to generic ETL platforms that were not built for regulated data environments. MLPipeKit uses CDISC terminology, SDTM domain conventions, and ICH E6 compliance framing because that is the language clinical data management actually uses.

Validation is a process, not a checkbox

Regulatory validation of computer systems is often treated as a one-time event. We ship a living GAMP 5 validation package that updates with each platform release. Our customers do not need to re-validate from scratch when we add a feature — they receive impact assessments and pre-populated change control documentation.

The data represents patients

Clinical trial data errors are not just process problems — they delay regulatory decisions that affect patients waiting for approved therapies. We take data quality seriously because that is what the work ultimately means.

Results

What teams accomplish with MLPipeKit

18 days earlier database lock

Mid-size biotech — Phase III oncology

A 220-patient Phase III trial across 14 sites was running a 6-week data cleaning cycle. Manual query generation took 3 data managers two weeks per data cut. After integrating MLPipeKit, query listings were generated in under 4 hours. The final database lock arrived 18 days ahead of the original timeline.

0 TRCs on first NDA submission

Specialty pharma company — NDA submission package

A sponsor preparing their first NDA submission used MLPipeKit to run Pinnacle 21 conformance checks throughout the SDTM build process rather than at the end. Twenty-three conformance issues that would have triggered an FDA Technical Rejection Criteria notice were caught and corrected during mapping — not after submission.

5 days to first clean data cut

CRO — Phase II respiratory program

A CRO managing a 3-arm Phase II study needed to deliver interim analysis data to a DSMB with a compressed timeline. Using MLPipeKit's incremental validation mode, they ran a clean interim analysis data cut within 5 days of DSMB request — compared to the 3-week estimate under their prior workflow.

By the Numbers

Where we are today

2022 Founded in Boston, MA

7 pharma and biotech clients

23 Phase I–III studies processed

5 team members (all clinical data background)

About MLPipeKit