Skip to content

PopHealth Observatory v0.6.0 Release Notes

Date: 2025-11-07 Package: pophealth_observatory Version: 0.6.0 Python compatibility: 3.10–3.13 (tested), requires-python will be >=3.10 going forward.

Overview

PopHealth Observatory provides a reproducible, survey-cycle–aware toolkit for: - NHANES component ingestion with resilient multi-URL fallbacks - Harmonization of participant identifiers and core measurement variables - Derived simple health metrics (BMI category, averaged blood pressure) - Structured metadata manifest generation for NHANES component pages - Pesticide analyte text snippet extraction (JSONL artifacts) for retrieval and experimental RAG - A minimal Streamlit application for exploratory inspection of selected NHANES and BRFSS indicators

This version continues to emphasize transparent data access and lightweight population-level pattern scanning. It is not designed for formal inferential statistics, regulatory submissions, or litigation support.

Added (0.6.0)

  • Streamlit exploratory app:
  • Cross-sectional tab with demographic filtering (age range, gender, race/ethnicity).
  • Multi-cycle trend tab with parallel loading and basic 95% CI approximation (mean ± 1.96 * SD / sqrt(n)).
  • Bivariate scatter (optional OLS trendline) for simple associations.
  • Geographic tab (BRFSS state-level choropleth, animated or single-year).
  • Optional exam weight application in summaries (simplified weighted mean/variance; does not account for strata/PSU or complex survey design—see Experimental section).
  • Robust import fallback for application environments where editable install fails (sys.path injection + warning).
  • Dynamic package version display in the app sidebar (v{__version__}).
  • Cached data layers separated by latency cost:
  • Raw multi-component NHANES merged slice
  • Filtered demographic subset
  • Aggregated summaries
  • Visualization helpers (uncached, low-latency)
  • Enhanced BRFSS indicator normalization (state code, state name, year coercion, numeric value parsing).
  • Pesticide snippet pipeline retained (no breaking changes) with JSONL output format stability.

Changed

  • Title and sidebar explanatory scope refined to explicitly note non-inferential intent.
  • Streamlit layout standardized to wide mode; consistent use of neutral white Plotly theme.
  • Box/violin plot captions clarifying what is shown; simpler phrasing for end-user interpretation.
  • Manifest generation remains manual invocation but internal parsing helpers (component table extraction) have stabilized.

Fixed

  • Import fallbacks prevent ModuleNotFoundError when the . line in requirements.txt is omitted or editable install constraints are present.
  • Streamlit API parameter harmonization for broader version compatibility (use_column_width usage).
  • Dependency pinning (e.g., SciPy/Statsmodels compatibility) to avoid runtime scientific stack import errors in hosted environments.

Data Coverage (Current)

Domain Status Notes
Demographics (DEMO) Ingested & merged Age, gender, race/ethnicity harmonized labels.
Body Measures (BMX) Ingested & merged BMI, height, weight, waist + BMI category.
Blood Pressure (BPX) Ingested & merged Averaged systolic/diastolic + categorical flag.
BRFSS Indicators App-level fetch (API, limited subset) Normalized prevalence fields; year filtering.
Pesticide Text Snippets Extracted (JSONL) CAS RN + analyte context sentences.
Laboratory Panels Manifest only No structured ingestion/derivation in this release.
Dietary Intake Planned (mapping stub) Not yet included in merged dataset.
Questionnaire (Medical History, Smoking, Alcohol, PAQ) Mapped identifiers only Getter methods not implemented yet.

Experimental

  • Retrieval-Augmented Generation (RAG) scaffold (pesticide context embeddings, index build) remains experimental; API subject to change.
  • Weighted summaries use simple weighted mean/variance without complex survey design variance estimation (no strata/PSU handling).

Known Limitations

  • No advanced survey design inference (no Taylor series linearization, replicate weights).
  • No laboratory normalization or unit harmonization beyond BMX/BPX scope.
  • No longitudinal reconstruction (NHANES is cross-sectional; panel logic not attempted).
  • No dependency on external databases; all network access is direct HTTPS/FTP to CDC endpoints.
  • Large accelerometer (PAXMIN) files are referenced but not ingested (size constraints).
  • Dietary second day (DR2TOT) and supplement merges absent.
  • Error handling currently prints/warns; structured logging not yet implemented.

API Surface (Stable Portions)

Class / Function Purpose
NHANESExplorer.download_data(cycle, component) Multi-pattern XPT fetch with fallback attempts.
NHANESExplorer.get_demographics_data(cycle) DEMO ingestion + label mapping.
NHANESExplorer.get_body_measures(cycle) Weight, height, BMI, waist + BMI category.
NHANESExplorer.get_blood_pressure(cycle) Individual readings + averaged values + category classification.
NHANESExplorer.create_merged_dataset(cycle) Merge DEMO, BMX, BPX on participant_id.
NHANESExplorer.generate_summary_report(df) Textual descriptive summary of merged dataset.
NHANESExplorer.get_detailed_component_manifest(...) Metadata harvesting of component file listings.
Pesticide ingestion module (pesticide_ingestion.py) Regex-based snippet extraction into Snippet dataclasses.
RAG pipeline (rag.pipeline.RAGPipeline) Preparation and retrieval interface (experimental).

Upgrade Notes (from 0.5.x)

  1. Regenerate your environment using the pinned requirements.txt if deploying the app (ensure the . line is preserved).
  2. No breaking change to existing snippet JSONL or manifest formats.
  3. If you previously referenced “Explorer” for BRFSS only, updated app now unifies NHANES + BRFSS views; no action needed.

Usage (Minimal)

from pophealth_observatory.observatory import NHANESExplorer

explorer = NHANESExplorer()
df = explorer.create_merged_dataset("2017-2018")
print(df.head())
print(explorer.generate_summary_report(df))

Run the Streamlit app (from repository root):

streamlit run apps/streamlit_app.py

Data Ethics & Interpretation Boundary

Outputs are descriptive and exploratory: - Not adjusted for complex survey design beyond basic weight averaging. - Not intended for clinical decision making or protocol-grade regulatory submissions. - Serve as directional RWE scanning for internal prioritization or early hypothesis shaping.

Roadmap (Indicative)

Planned near-term additions (subject to reprioritization): - Getter methods for dietary (DR1TOT/DR2TOT) and smoking/alcohol questionnaires. - Basic lab panel ingestion (lipids, glucose, renal markers) with derived ratios (TC/HDL, ACR). - Optional metabolic syndrome flag computation. - Expanded manifest persistence including Questionnaire & Dietary components by default. - Lightweight Arrow Parquet exports for R/Python interchange (≤10 MB artifacts, cycle-specific). - Structured logging and retry analytics for failed downloads.

Citation / Reference

If you reference PopHealth Observatory internally, use: “PopHealth Observatory (v0.6.0): NHANES/BRFSS exploratory data ingestion and descriptive analytics toolkit.”

Support & Issues

Issue tracking: GitHub repository issue tracker. No SLA; responses and enhancements are best-effort.


End of v0.6.0 release notes.