Pesticide Biomonitoring & External Exposure Context Expansion Plan¶

Version: 0.2 (Draft) – Updated schema alignment & loader cascade clarification Generated: 2025-11-09 Related Instructions: .github/copilot-instructions.md

1. Purpose¶

Introduce a structured, testable ingestion + exploration layer for NHANES pesticide biomonitoring analytes and high-value external exposure context datasets (USGS agricultural use, PDP residue monitoring, TRI releases, etc.). Provide an integrated Streamlit tab enabling temporal trends, disparity views, and exploratory health linkage while preserving the project’s exploratory, non-inferential positioning.

2. High-Level Objectives¶

Internal ingestion of NHANES pesticide laboratory analytes (urine & serum) with harmonized schema.
Streamlit “Pesticide Biomonitoring” tab: multi-analyte trends, distribution heatmap, demographic disparities, and preliminary health correlations.
External context scaffolding for agricultural use and residue prevalence without premature geospatial microdata linkage claims.
Reusable data contracts + registry for each external source with schema versioning.
Transparent UI & docs disclaimers about simplified weighting, exploratory correlations, and absent complex survey design adjustments.

3. Scope (In / Out)¶

IN: - NHANES pesticide analyte ingestion (DAP metabolites, pyrethroids, glyphosate/AMPA, organochlorines). - Per-cycle concentration extraction + log transform + detection flag. - Basic summary metrics: geometric mean, median, percent detected, 95th percentile. - Demographic stratification: age range, gender, race/ethnicity, income (INDFMPIR quartiles). - External dataset ingestion (USGS use; PDP residues) with stable schemas. - Minimal exploratory correlation panels (e.g., analyte vs BMI / systolic BP). - Test coverage for ingestion helpers & external source normalization.

OUT (Future Phases): - Full survey design variance estimation (strata/PSU). - Restricted-use geographic linkage (county/ZIP) – requires separate data access. - Causal inference / regression modeling. - Mixture exposure modeling (weighted quantile sum, Bayesian kernels). - Occupation-level risk dashboards until occupational code mapping stabilized.

4. Phased Milestones¶

Phase	Milestone	Deliverables	Success Criteria
1	Internal analyte ingestion	`laboratory_pesticides.py`, schema docs	Load ≥8 key analytes across ≥5 cycles; no ingestion crashes
2	Biomonitoring tab (core)	New Streamlit tab with trends, heatmap, disparity chart	Tab loads <3s cached; user can select ≥4 analytes & cycles
3	External source scaffolding	`external/usgs_use.py`, `external/pdp_residues.py`, registry	USGS + PDP functions return non-empty normalized DataFrames
4	Context integration	Tab section: agricultural use overlay + commodity residue table	Overlay chart renders; detection table downloadable
5	Exploratory health linkage	BMI/BP correlation panel + disclaimers	Correlation updates on filter; clearly marked exploratory
6	Mixture / co-exposure matrix	Analyte correlation heatmap + network graph	Graph renders for ≥6 analytes; performance acceptable

Current Progress (2025-11-09): - Reference restructuring complete (data/reference/ hierarchical layout). - Core minimal analyte list established (108 analytes). - CAS verification implemented via PubChem synonyms endpoint (78 verified, 72%). - CDC Fourth Report classification enrichment integrated (35 classified, 32%). - Script consolidation under scripts/pesticides/ for maintainability. - Backward compatibility shim (pesticide_reference.csv) + legacy stubs to keep tests green. - Next active milestone: Implement laboratory ingestion module and first pass of per-cycle DataFrame assembly (Phase 1 still in progress).

5. Architecture Overview (Updated)¶

pophealth_observatory/
  laboratory_pesticides.py            # NHANES pesticide lab ingestion (planned)
  pesticide_context.py                # Analyte reference loading + lookup (updated with new fields)
  rag/                                # Retrieval scaffolding (future narrative integration)
  external/                           # (Planned) external contextual sources
    usgs_use.py                       # Agricultural use data (planned)
    pdp_residues.py                   # Commodity residue monitoring (planned)
    registry.py                       # Source registry & schema metadata (planned)
apps/
  streamlit_app.py                    # Will host 'Pesticide Biomonitoring' tab
scripts/
  pesticides/                         # Consolidated pesticide maintenance scripts
    build_minimal_pesticide_reference.py
    verify_minimal_reference_cas.py
    add_cdc_classifications.py
    derive_parent_pesticide_mapping.py
    discover_nhanes_pesticides.py
    curate_pesticide_reference.py
    validate_pesticide_reference.py
data/
  reference/
    minimal/                          # Core 108-analyte reference (78 CAS verified)
    classified/                       # CDC Fourth Report classifications (35 analytes)
    legacy/                           # Archived curated AI-derived reference
    discovery/                        # Raw NHANES variable discovery output
    evidence/                         # Parent mapping attempt artifacts
    config/                           # Source registry yaml
    pesticide_reference.csv           # Compatibility shim (copy of minimal)
tests/
  test_pesticide_context.py           # Reference loading & integrity
  test_laboratory_pesticides.py       # (placeholder / future expansion)

5.1 Reference Loader Cascade (New)¶

The analyte reference is resolved via an ordered cascade to maximize robustness when distribution artifacts omit nested files:

data/reference/classified/pesticide_reference_classified.csv (enriched classification)
data/reference/minimal/pesticide_reference_minimal.csv (canonical minimal)
data/reference/pesticide_reference.csv (flat shim for backward compatibility)
Legacy flat minimal / classified files if accidentally retained
Any glob-discovered pesticide_reference_*.csv (last resort)

If key analytes (e.g., 3-PBA, DMP) are absent due to packaging omission, temporary placeholder rows are injected at runtime (CI safeguard). SUCCESS CRITERIA includes removing this injection once packaging reliably includes nested reference CSVs.

6. Internal Data Model (Updated)¶

We now distinguish between:

Analyte Reference Schema (static descriptive metadata)
Laboratory Measurement Schema (per-participant concentration records – forthcoming)

6.1 Analyte Reference Schema (Aligned With `PesticideAnalyte.to_dict()`)¶

Field	Type	Description
variable_name	str	Original NHANES variable code (e.g. `URX3PBA`)
analyte_name	str	Canonical short name (e.g. `3-PBA`)
cas_rn	str	CAS Registry Number if verified
cas_verified_source	str	Verification source (`pubchem_api` or blank)
matrix	str	Biological matrix (`urine`, `serum`, `unknown`)
unit	str	Reporting unit (e.g. `ug/L`)
cycle_first	int	Earliest cycle year the analyte appears
cycle_last	int	Latest cycle year observed
cycle_count	int	Number of distinct cycles observed (0 if placeholder)
data_file_description	str	Short narrative describing the data file / analyte context
chemical_class	str	High-level class (CDC Fourth Report)
chemical_subclass	str	Subclass/group (CDC Fourth Report)
classification_source	str	Classification provenance (`cdc_fourth_report`)
metabolite_class	str	(Deprecated – kept blank) legacy compatibility field
parent_pesticide	str	(Deprecated – kept blank) legacy compatibility field
current_measurement_flag	bool	Legacy flag (always `True` placeholder)

Deprecation Notes: - metabolite_class, parent_pesticide, and current_measurement_flag are retained only to avoid breaking downstream legacy scripts/tests; they will be scheduled for removal once classification coverage ≥90%. - Placeholder analytes (3-PBA, DMP) are runtime-injected only when absent from packaged references; removal of injection is a stability milestone.

Coverage Metrics (2025-11-09): - Classification coverage: 35 / 108 analytes (32.4%). - CAS verification coverage: 78 / 108 analytes (72.2%).

6.2 Laboratory Measurement Schema (Planned)¶

Field	Type	Description
participant_id	int	NHANES SEQN identifier
cycle	str	Survey cycle (e.g. `2017-2018`)
analyte_name	str	Must join to reference analyte_name
matrix	str	`urine` or `serum`
concentration_raw	float	Reported concentration (original units)
unit	str	Measurement unit (e.g. `µg/L`, `ng/g lipid`)
log_concentration	float	ln(concentration_raw) where concentration_raw > 0
detected_flag	bool	concentration_raw > 0 or > LOD (when available)
lod	float	Limit of detection (if parseable)
source_file	str	Originating XPT filename

Removed (legacy) fields from planned measurement schema: parent_pesticide, metabolite_class (superseded by structured classification in reference layer).

7. External Data Contracts¶

USGS Use¶

{ year:int, state_fips:str, state_name:str, pesticide_active_ingredient:str, lbs_ai:float }

PDP Residues¶

{ year:int, commodity:str, analyte:str, detect_freq_pct:float, mean_detect_level:float|None, max_detect_level:float|None }

TRI Releases (Future)¶

{ year:int, cas_rn:str, state:str, release_lbs:float }

Sales (CA DPR - Future)¶

{ year:int, county:str, ai_name:str, lbs_sold:float }

8. Functions & Contracts (Initial)¶

# laboratory_pesticides.py
def get_pesticide_metabolites(cycle: str) -> pd.DataFrame:
    """Return harmonized pesticide analyte DataFrame for a cycle.
    Raises ValueError for malformed cycle; returns empty DataFrame if files missing."""

# external/usgs_use.py
def fetch_usgs_state_use(year: int) -> pd.DataFrame:
    """Download or load cached agricultural pesticide use estimates (state-level)."""

def normalize_usgs(df: pd.DataFrame) -> pd.DataFrame:
    """Ensure canonical columns & types; drop rows with missing active ingredient or lbs."""

# external/pdp_residues.py
def fetch_pdp_summary(year: int) -> pd.DataFrame:
    """Retrieve commodity-level residue frequency and concentrations."""

# external/exposure_context.py
def build_exposure_context(year_range: tuple[int,int], analytes: list[str]) -> dict[str, pd.DataFrame]:
    """Aggregate external context slices keyed by source name."""

9. Streamlit Tab Responsibilities¶

Inputs: analyte list, cycle range, demographics, metric type, weights toggle.
Data pipeline: multi-cycle stack → filter → summarize → visualize.
Visual components:
Line Trends: geometric mean or detection frequency over cycles.
Heatmap: analytes × cycles (metric).
Demographic Bars: exposure by race/income quartile.
Correlation Panel (experimental): analyte concentration vs BMI / BP.
Context Overlay: USGS use trend vs population biomonitoring trend.
Export: CSV (long format: cycle, analyte, mean, n, detected_flag_rate).

10. Edge Cases & Handling (Updated)¶

Scenario	Handling
Missing cycle file	Return empty DataFrame; log info message
Zero or negative concentrations	Exclude from log transform; keep raw
No analytes selected	Disable plots; show instruction message
Single cycle selected for trends	Show warning (need ≥2 cycles)
External source fetch timeout	Return empty DataFrame with schema; display caution banner
Weight application without weight column	Graceful fallback to unweighted aggregation
Detection limits unavailable in raw XPT	Placeholder `lod` null; future enhancement to parse doc pages
Partial classification coverage	Display "Unclassified" bucket; surface coverage % in docs
Partial CAS verification	Soft-fail with blank CAS; expose progress metric

11. Testing Strategy¶

Unit:
Synthetic XPT fixture ingestion (column rename, log transform correct).
External fetch mocked responses (schema check, normalization).
Integration:
build_exposure_context returns dict with expected keys.
Streamlit tab caching (simulate first vs second call).
Edge-case asserts: empty cycle, analyte not present in early cycles, zero-only concentration vector.

12. Performance Considerations¶

Cache per-cycle pesticide ingestion (@st.cache_data ttl=3600).
Avoid full multi-year reprocessing: incremental stacking on selection changes.
Limit default analyte set (3–5) to keep initial render fast (<1.5s).
Defer correlation heatmap rendering until user toggles advanced section.

13. Disclaimers (UI + Docs)¶

“Survey weights simplified (exam weights only); no strata/PSU variance estimation.”
“Correlations exploratory; no causal inference.”
“External agricultural and residue data are context indicators, not direct exposure determinants.”
“Detection frequency may shift due to analytical method changes; interpret longitudinal changes cautiously.”

14. PR Breakdown & Labels¶

PR	Title	Label Suggestions	Summary
1	feat: ingest pesticide lab analytes	feat, labs	Add `get_pesticide_metabolites` + schema docs
2	feat: pesticide biomonitoring tab	feat, ui	Add new Streamlit tab (trends + heatmap)
3	feat: external USGS ingestion	feat, external-data	Add USGS module + registry entry
4	feat: PDP residue ingestion	feat, external-data	Commodity residue loader + tests
5	feat: exposure context builder	feat, orchestration	`build_exposure_context` aggregator
6	feat: disparity & correlation views	feat, analysis	Add demographic bar + simple correlations
7	feat: mixture correlation matrix	feat, analysis	Add analyte correlation heatmap + network
8	chore: docs & disclaimers update	docs	README + release notes + tab disclaimers

15. Acceptance Criteria (Phase 1–2)¶

Running get_pesticide_metabolites("2017-2018") returns DataFrame with ≥5 analytes and required columns.
Biomonitoring tab displays multi-line trend for ≥3 analytes across ≥4 cycles.
Heatmap renders without error; missing analyte-cycle pairs show NA.
External USGS function returns non-empty DataFrame for a known recent year (mocked in CI).
All new tests pass; coverage for new modules ≥80%.

16. Risks & Mitigations¶

Risk	Impact	Mitigation
Inconsistent file naming across cycles	Missing ingestion	Build a small mapping table; fallback empty frame
Detection limits unavailable in raw XPT	Incomplete detection frequency	Introduce placeholder flag; future enhancement to parse doc pages
Large memory usage stacking many cycles	Slow UI	Limit default cycle selection; offer advanced multi-cycle toggle
External API schema drift	Break ingestion	Add schema validation & version pin in registry
Misinterpretation of exploratory correlations	Reputational risk	Prominent disclaimers + UI badge + docs alignment
Packaged distribution omits nested reference CSVs	Placeholder injection persists; risk of silent schema drift	Add explicit package data include; CI test asserts presence; remove runtime injection

17. Future Enhancements (Beyond Current Plan)¶

Creatinine adjustment helper for urinary analytes.
Automated analytic method change flag (LOD tracking per cycle) + store per-cycle LOD metadata table.
Occupational linkage (when occupation code ingestion stabilized).
RAG integration: embedding pesticide trend narrative with snippet retrieval.
Export Parquet snapshots for R survey design analysis.
Classification coverage completion pass (target ≥90%).
Semi-automated synonym expansion for unverified CAS resolution.

18. Implementation Order Rationale¶

Start with internal analyte ingestion (foundation). Progress to UI integration for immediate visible value. External data ingestion next to enrich context. Analytic overlays (disparities, correlations) deferred until baseline ingestion stable to avoid compounding debugging scopes.

19. Open Questions¶

Should glyphosate/AMPA cycles with partial missing data be excluded or flagged? (Decision: flag rows with partial_cycle=True).
Do we harmonize lipid-adjusted vs. raw serum concentrations into a single field? (Decision: keep unit explicit; no conversion yet.)
Is creatinine normalization required for all urinary analytes up-front? (Decision: optional later; display raw only initially.)

20. Next Immediate Steps (Actionable, Updated)¶

Implement laboratory_pesticides.get_pesticide_metabolites with cycle validation + empty-on-miss.
Add tests: ingestion happy path (mock single cycle), empty cycle edge case, reference join integrity.
Create mapping table for XPT filenames → analyte short names (minimize hard-coded branching).
Add classification & CAS coverage badges to docs (auto-generated snippet optional).
Begin Streamlit tab scaffolding (load + simple table) behind feature flag.
Extend CDC classification enrichment to attempt additional matching heuristics (e.g., case-insensitive substring, hyphen normalization) – log unresolved analytes.
Draft external source registry skeleton (USGS & PDP placeholders returning empty typed DataFrames).
Add packaging test ensuring classified & minimal CSV presence; assert no placeholder injection occurred.
Draft deprecation timeline for legacy fields (announce in CHANGELOG once coverage threshold set).

21. Classification Expansion Strategy (New)¶

Goal: Raise classification coverage from 32% → ≥90% while minimizing manual curation.

Heuristic Layers: 1. Case-insensitive exact name match against CDC list. 2. Hyphen/apostrophe normalization (p,p'-DDE → ppDDE) for resilient matching. 3. Synonym expansion using minimal curated synonym map (avoid uncontrolled third-party scraping initially). 4. Length + token similarity scoring for ambiguous abbreviations (retain low-confidence matches separately).

Workflow: 1. Generate unresolved analyte list after each enrichment pass. 2. Persist evidence in data/reference/evidence/ with columns: analyte_name, attempted_tokens, candidate_matches, confidence_score. 3. Manual review only for low-confidence group (< threshold, e.g. 0.6 similarity score).

Success Metrics: - Classification coverage ≥90% with zero false positive assignments in spot-check sample. - Evidence artifacts versioned; reproducible enrichment script documented.

Exit Criteria: - Placeholder legacy classification fields removable. - CHANGELOG entry outlines migration path for downstream consumers.

This plan is a living document. Version 0.2 now reflects aligned schema, loader cascade transparency, packaging risk addition, and classification expansion strategy.