Daisy the Data Scientist
- Daisy needs to partner with clinical researchers and stakeholders to answer data-driven questions about Fred Hutch and its patients using real-world data.
- She struggles to efficiently explore data, conduct analyses, and distribute results on the organization’s data infrastructure, and to align her practices with data governance policies.
- We can help her by creating a secure platform with sufficient compute resources and clear data governance policies for doing data science with regulated, sensitive data.
Daisy needs a platform for data science that aligns with data classification policies
Clinical researchers and leaders at Fred Hutch have many questions about Fred Hutch’s patients and their health care. Enter Daisy the data scientist! Daisy is here to help answer those questions using data from the many clinical applications and programs used to deliver care, known as “real world data” (RWD). It’s a breeze for Daisy to work with data in structured, discrete formats. Unfortunately the real world is messy, and so are its data structures. Daisy struggles to make use of the plethora of unstructured data formats produced in a healthcare organization. It seems like many of the out-of-the-box NLP systems that she needs are not designed for healthcare. Who knew? Daisy does her best to work with them, but she has neither the required compute resources nor a secure workspace to adapt these models and evaluate their performance. And then there are the many stakeholders who are coming to her with questions about AI and looking for her help in applying LLMs to clinical data. Daisy wishes her institution would offer guidance for how to use these new large language models, a platform suitable for development, and a professional community to help adapt her skills to this new data science paradigm. As a data scientist, Daisy needs to make sure that there are good foundations in place for advanced analytics, model development, and experimentation before she starts applying fancy new models and tools. How can she help stakeholders develop sound epidemiological questions? Does she have the tools for reproducible analysis? How can she find the right data to apply to the right question at the right time?
Collaborators: Data Engineers, DJ the Data Governance Analyst, Alex the BI/Analytics Engineer, Melissa the Clinical Analyst
Downstream users: Carina the Clinical Researcher, Bobby the Biostatistician, Program and Service Line Managers
Key Challenges
Understanding the landscape of clinical data applications at Fred Hutch, where data is stored, and how to acquire access
Local machines are not the best computing environment for clinical data science; some clinical databases cannot be accessed from a Mac and many computing environments for reproducible analysis cannot be re-created on Windows
Educating and nudging researchers towards best practices for clinical data science
Lack of self-service tools and set of data governance policies for use of multimodal clinical data across the institution
There is no unified system with all the relevant data; data must be collated from multiple systems
Needs and Wants
- An efficient way to store and retrieve past models/queries for future reference
- A more efficient way to access, integrate, and analyze multimodal clinical data that…
- is PHI-approved
- displays information about provenance, lineage, and data governance (e.g., whether a column contains PHI, what access restrictions are on the data)
- supports best practices for dataset documentation
- Cloud computing environments for managing statistical/machine learning workflows
- Secure platform to publish and share deliverables (e.g. Quarto/Jupyter notebooks, dashboards, datasets)
- A way to help users help themselves to expand capacity of the department
Types of data used
- Structured and unstructured data from current and legacy EHR systems
- Cancer registry data
- Lab and radiation oncology data
- Clinical Trials Management system data
- Novel, non-clinically reported data is relevant such as research use only genetic assay results
- Survey and case report form type datasets
- Validated lists of genomic data such as tumor mutations or structural variants
Image attribution: “Women In Tech - 53” by wocintechchat.com is licensed under CC BY 2.0.
last updated July 2024