For example, consider a scenario where a data team in a life sciences organization needs to prepare a data product to be used by a machine learning model that detects potential adverse events associated with drugs and thus improves patient safety. Here, data might reside in a data lake where it is encrypted for security reasons and compressed to saudi arabia whatsapp number data save archival costs. an open format such as parquet, while the ML model requires CSVs. Additionally, the data may need to be pre-processed to remove duplicates and null rows or replace some special characters. Other details may need to be addressed as well, such as detecting partitions at the source or providing a custom schema to read the files. The data team might also choose to enable parallel processing to speed up data processing.
To ensure the data product created meets the requirements of the model, business-specific Data Quality rules need to be applied to detect problems as early in the pipeline as possible to help preserve the data team’s valuable resources. It is a complicated and time-consuming process. Now, imagine doing the same thing for millions of files, across scores of use cases, within ever-demanding timelines from the business. Writing code can take the team only so far, leaving business users to wonder why it takes so long to prepare data products for their ML models.