Building the data infrastructure that clinical trials should have had ten years ago.
Sarah Lindgren spent six years as a biostatistician at a mid-size CRO running Phase II and III oncology trials. The work that frustrated her most wasn't the statistics — it was the month she lost every database lock cycle waiting for clean data. EDC exports arrived late. Query listings were generated by hand in Excel. SDTM mapping specs lived in shared drives that three different people maintained independently.
She left in 2022 to build the tool she had been describing to colleagues for years: a pipeline that treated clinical trial data as structured data to be validated and transformed programmatically, not as documents to be reviewed by hand. MLPipeKit started as an internal toolset for two Boston-area sponsors and became a product when three more requested access within three months of hearing about it.
Today, MLPipeKit supports Phase I through Phase III data operations for seven pharmaceutical and biotech organizations. The team is five people. The product works because everyone on it came from clinical data management or biostatistics, not from generic enterprise software.
Every transformation MLPipeKit performs is traceable to a rule, a version, and a timestamp. We do not build features that obscure how a value changed. If you cannot explain a data point to an FDA reviewer, neither should your software.
We have seen too many clinical teams forced to adapt their workflows to generic ETL platforms that were not built for regulated data environments. MLPipeKit uses CDISC terminology, SDTM domain conventions, and ICH E6 compliance framing because that is the language clinical data management actually uses.
Regulatory validation of computer systems is often treated as a one-time event. We ship a living GAMP 5 validation package that updates with each platform release. Our customers do not need to re-validate from scratch when we add a feature — they receive impact assessments and pre-populated change control documentation.
Clinical trial data errors are not just process problems — they delay regulatory decisions that affect patients waiting for approved therapies. We take data quality seriously because that is what the work ultimately means.
A 220-patient Phase III trial across 14 sites was running a 6-week data cleaning cycle. Manual query generation took 3 data managers two weeks per data cut. After integrating MLPipeKit, query listings were generated in under 4 hours. The final database lock arrived 18 days ahead of the original timeline.
A sponsor preparing their first NDA submission used MLPipeKit to run Pinnacle 21 conformance checks throughout the SDTM build process rather than at the end. Twenty-three conformance issues that would have triggered an FDA Technical Rejection Criteria notice were caught and corrected during mapping — not after submission.
A CRO managing a 3-arm Phase II study needed to deliver interim analysis data to a DSMB with a compressed timeline. Using MLPipeKit's incremental validation mode, they ran a clean interim analysis data cut within 5 days of DSMB request — compared to the 3-week estimate under their prior workflow.
Every demo is with someone who has run clinical data operations, not a generic sales representative.
Get in Touch