Hello Pratham Mahajan,
Extracting structured toxicology data from complex, unstructured reports is a tough challenge—especially when you’re dealing with scientific nuance and 250+ parameters. You’re right that out-of-the-box models like GPT-4o sometimes struggle with domain-specific data, especially when it’s highly detailed or technical.
Here are some ideas that could help improve your process:
Fine-tune on Domain Data: If you can, create a custom fine-tuning dataset with examples of unstructured input and the exact structured output you want (even if it’s just a few hundred rows). Fine-tuned models—whether on OpenAI, Azure OpenAI, or other providers—tend to output domain-specific results with much higher accuracy.
Consider Domain-Specific Models: You might want to explore models trained for biomedical or scientific text, such as BioGPT, SciBERT, or even open models hosted on Hugging Face. These are designed to deal with clinical and scientific language.
Break Down the Task: Instead of having the model pull all 250 parameters at once, break extraction down into smaller, more focused prompts. For instance, extract one section or a handful of related variables at a time. You could also use a two-stage process: first, let the model find relevant sections, then have another pass that extracts the precise values.
Use Post-Processing: Sometimes models will give you “almost right” results that can be cleaned up with rule-based scripts. Combine large language model output with regular expressions or Python code to standardize units, enforce value ranges, or match patterns known from your Excel’s structure.
Human-in-the-Loop: For especially critical or ambiguous fields, consider flagging low-confidence results for manual review. LLMs work best when they augment, not completely replace, expert judgment—especially in regulatory or scientific environments.
Check for HAWC, SRT, or OHAT Workflows: Investigate if these established toxicology workflows have open-source scripts, annotation sets, or guidance on automating Appendix E population. Sometimes, adapting what’s already out there saves weeks of work.
It sounds like you’re already piecing together a solid system. A bit of domain adaptation—via fine-tuning, smarter prompt design, or hybrid automation—should boost accuracy. Good luck! Your use case is important, and every incremental enhancement to quality here can have a big impact.
Best Regards,
Jerald Felix