Blog

Simplifying the Complex: Optimizing the Data Model for Clinical Data Cleaning

Part II of the series: “Taking a Modern Approach to Aggregate, Clean, and Transform Clinical Data”

Anyone working in clinical data management today knows how difficult it is to aggregate study data from multiple sources. Each source often uses its own unique combination of column names, column orders, and data types to represent the study data it has gathered.

In addition, the data itself is heterogeneous. Different study sources describe common study data elements differently. For example, visit information might be described as visits or events, while the values included in that field could range from “visit 1,” to “v1,” or “Visit01” for the same clinical encounter. This level of variability complicates downstream data analysis and data cleaning.

As a result, many data management teams have turned to the Study Data Tabulation Model (SDTM), designed to support data analysis to consolidate source data. Some data managers reason, if you’re going to have to transform your data to SDTM anyway, why not do it as soon as possible? That way, study teams can use the consolidated data to support data reconciliation and cleaning, as well as to generate near-term deliverables such as DMC tables listings and figures.

Unfortunately, producing SDTM data turns out to be hard, too. Although some teams claim to have automated the process for getting raw data into SDTM, typically only about 70% of the SDTM variables can be auto-generated. Programming skills and human artistry are needed to generate the remaining 30%. In the end, using the SDTM model takes time due to the number of transformations required. This can be felt again when changed data must be uploaded, because it must be transformed before data cleaning can even begin, resulting in a delay.

As dataset sizes increase and the number of data sources grow, the time required to complete the raw-to-SDTM transformation increases from minutes to multiple hours or days.

SDTM Lite: better, but it’s no cure-all

Many teams have mitigated delays by using “SDTM Minus,” an SDTM-like data model instead of full-fat SDTM to support data cleaning. But even here, the large number of data manipulations required creates delays in data-cleaning activities.

What happens when the mega-trial they’re working on, closely watched by C-level execs, is approaching a critical analysis date and the team needs to fix a source data issue? In large studies, it takes time to extract the data from EDC, load it to a server, re-run SDTM programs, re-run validation, and then bring the data back to the data management team. And communication is required at each of these steps, only increasing the timeline. As a result, the data management team can’t immediately see a small source-data change in their data cleaning data model. Instead, they must wait, typically for days, for the SDTM-like datasets to be produced, after which they’ll need to confirm that the source change has fed through as expected and fixed the issue.

Each data variable requiring some manipulation or transformation introduces more fragility to the process, particularly where the transformation code is not suitably defensive. When an unexpected data value turns up, the code used to produce the SDTM-like data falls over. To remedy this, programmers are called in to get the data-cleaning process up and running, further increasing the time required.

There is another challenge that has become more pronounced since the move to decentralized clinical data: Smaller providers (e.g., those offering device data) are not familiar with and therefore do not use the SDTM structure, resulting in more potential disconnects. In a typical scenario, less skilled data reviewers comment on discrepancies in the SDTM variables without referring to the raw data model, which is what the sites need in order to interpret the query correctly. At this point, someone has to come in and translate the SDTM back to raw form, to make sense of what needs cleaning. There has to be a better way.

AI/ML to the rescue? Not so fast …

To err is human, and even technically-minded programming types are fallible. Couldn’t we just use artificial intelligence and/or machine learning (AI/ML) approaches to generate SDTM-like data?

This might increase the reliability with which the data can be aggregated and reduce the effort required, but it wouldn’t reduce the data processing time, nor entirely eliminate the increased fragility introduced by relatively high data transformation counts. In addition, this black-box approach needs time to win the trust of its users as it’s often not possible to see what data transformations are being made or why.

Furthermore, findings generated on SDTM data are still managed in spreadsheets, which must then be entered as queries in EDC, or distributed to data providers. Ultimately, your SDTM or SDTM-like data will be easier to produce if your source data has already been cleaned.

Keep it simple

At Veeva, we’ve taken a fresh look at what’s needed to support data cleaning alone, and asked: Why should we force data cleaning onto a data model that was designed to support downstream data analysis? Data managers are involved in the study build, even if they are not handling the build themselves, so they should already be familiar with the raw data structure. The study backbone in Veeva CDB ensures that the bare-minimum data transformations are required to support cleaning and review, so that raw study data can be brought together in a faster, more durable way.

As data is loaded into CDB, it is immediately mapped to the study backbone. The EDC data is treated as the trusted source, since it typically receives the greatest level of scrutiny. Non-EDC data is then married up with EDC data using site, subject, and visit identifiers, which become common keys across all the study data.

Having established this backbone, Veeva’s Clinical Query Language (CQL), customized for clinical trials, can then be used to build or clone various flavors of listings, and even to create checks which function like edit checks in the EDC, only in CDB, and are therefore actionable against all study data including non-EDC sources. CQL checks can be cloned from a standard library and deployed, instantly, to new studies.

Because study backbone mappings are light-touch, there’s almost no chance that a data source will fail to load when an unexpected data value comes up. This, together with the configurable data load process gives durability to data aggregation in CDB. When a failure does occur, perhaps because a subject ID had been captured incorrectly, the bad record will be identified automatically, and reported to both the third-party data provider and the sponsor or CRO data management team for resolution as soon as data is processed.

Using a productized solution and a light-touch data transformation approach, megatrial data can be made available for data cleaning in minutes, not hours, with automated data reconciliation findings generated instream.

In tests, CDB checks and automation have led to a >30% reduction in the need to identify and handle queries manually. Meanwhile, CQL allows for bulk actions across all study data.

This new approach can aggregate data from any third-party data provider and allows third-party data providers to deliver patient data in the way most convenient to them, eliminating data cleaning delays, and speeding cross-source data cleaning and reconciliation. This process also eliminates spreadsheets that have commonly been used to track these activities.

Clinical data management will never be easy, but Veeva CDB offers a simpler way to free data management teams from unnecessary manual processes and provide study teams with the complete and concurrent data they need, to make data cleaning as quick and easy as possible.

Learn more about Veeva CDB.

Interested in learning more about how Veeva can help?