PopHealth Observatory v0.6.0 Release Notes¶
Date: 2025-11-07
Package: pophealth_observatory
Version: 0.6.0
Python compatibility: 3.10–3.13 (tested), requires-python will be >=3.10 going forward.
Overview¶
PopHealth Observatory provides a reproducible, survey-cycle–aware toolkit for: - NHANES component ingestion with resilient multi-URL fallbacks - Harmonization of participant identifiers and core measurement variables - Derived simple health metrics (BMI category, averaged blood pressure) - Structured metadata manifest generation for NHANES component pages - Pesticide analyte text snippet extraction (JSONL artifacts) for retrieval and experimental RAG - A minimal Streamlit application for exploratory inspection of selected NHANES and BRFSS indicators
This version continues to emphasize transparent data access and lightweight population-level pattern scanning. It is not designed for formal inferential statistics, regulatory submissions, or litigation support.
Added (0.6.0)¶
- Streamlit exploratory app:
- Cross-sectional tab with demographic filtering (age range, gender, race/ethnicity).
- Multi-cycle trend tab with parallel loading and basic 95% CI approximation (mean ± 1.96 * SD / sqrt(n)).
- Bivariate scatter (optional OLS trendline) for simple associations.
- Geographic tab (BRFSS state-level choropleth, animated or single-year).
- Optional exam weight application in summaries (simplified weighted mean/variance; does not account for strata/PSU or complex survey design—see Experimental section).
- Robust import fallback for application environments where editable install fails (
sys.pathinjection + warning). - Dynamic package version display in the app sidebar (
v{__version__}). - Cached data layers separated by latency cost:
- Raw multi-component NHANES merged slice
- Filtered demographic subset
- Aggregated summaries
- Visualization helpers (uncached, low-latency)
- Enhanced BRFSS indicator normalization (state code, state name, year coercion, numeric value parsing).
- Pesticide snippet pipeline retained (no breaking changes) with JSONL output format stability.
Changed¶
- Title and sidebar explanatory scope refined to explicitly note non-inferential intent.
- Streamlit layout standardized to wide mode; consistent use of neutral white Plotly theme.
- Box/violin plot captions clarifying what is shown; simpler phrasing for end-user interpretation.
- Manifest generation remains manual invocation but internal parsing helpers (component table extraction) have stabilized.
Fixed¶
- Import fallbacks prevent
ModuleNotFoundErrorwhen the.line inrequirements.txtis omitted or editable install constraints are present. - Streamlit API parameter harmonization for broader version compatibility (
use_column_widthusage). - Dependency pinning (e.g., SciPy/Statsmodels compatibility) to avoid runtime scientific stack import errors in hosted environments.
Data Coverage (Current)¶
| Domain | Status | Notes |
|---|---|---|
| Demographics (DEMO) | Ingested & merged | Age, gender, race/ethnicity harmonized labels. |
| Body Measures (BMX) | Ingested & merged | BMI, height, weight, waist + BMI category. |
| Blood Pressure (BPX) | Ingested & merged | Averaged systolic/diastolic + categorical flag. |
| BRFSS Indicators | App-level fetch (API, limited subset) | Normalized prevalence fields; year filtering. |
| Pesticide Text Snippets | Extracted (JSONL) | CAS RN + analyte context sentences. |
| Laboratory Panels | Manifest only | No structured ingestion/derivation in this release. |
| Dietary Intake | Planned (mapping stub) | Not yet included in merged dataset. |
| Questionnaire (Medical History, Smoking, Alcohol, PAQ) | Mapped identifiers only | Getter methods not implemented yet. |
Experimental¶
- Retrieval-Augmented Generation (RAG) scaffold (pesticide context embeddings, index build) remains experimental; API subject to change.
- Weighted summaries use simple weighted mean/variance without complex survey design variance estimation (no strata/PSU handling).
Known Limitations¶
- No advanced survey design inference (no Taylor series linearization, replicate weights).
- No laboratory normalization or unit harmonization beyond BMX/BPX scope.
- No longitudinal reconstruction (NHANES is cross-sectional; panel logic not attempted).
- No dependency on external databases; all network access is direct HTTPS/FTP to CDC endpoints.
- Large accelerometer (PAXMIN) files are referenced but not ingested (size constraints).
- Dietary second day (DR2TOT) and supplement merges absent.
- Error handling currently prints/warns; structured logging not yet implemented.
API Surface (Stable Portions)¶
| Class / Function | Purpose |
|---|---|
NHANESExplorer.download_data(cycle, component) |
Multi-pattern XPT fetch with fallback attempts. |
NHANESExplorer.get_demographics_data(cycle) |
DEMO ingestion + label mapping. |
NHANESExplorer.get_body_measures(cycle) |
Weight, height, BMI, waist + BMI category. |
NHANESExplorer.get_blood_pressure(cycle) |
Individual readings + averaged values + category classification. |
NHANESExplorer.create_merged_dataset(cycle) |
Merge DEMO, BMX, BPX on participant_id. |
NHANESExplorer.generate_summary_report(df) |
Textual descriptive summary of merged dataset. |
NHANESExplorer.get_detailed_component_manifest(...) |
Metadata harvesting of component file listings. |
Pesticide ingestion module (pesticide_ingestion.py) |
Regex-based snippet extraction into Snippet dataclasses. |
RAG pipeline (rag.pipeline.RAGPipeline) |
Preparation and retrieval interface (experimental). |
Upgrade Notes (from 0.5.x)¶
- Regenerate your environment using the pinned
requirements.txtif deploying the app (ensure the.line is preserved). - No breaking change to existing snippet JSONL or manifest formats.
- If you previously referenced “Explorer” for BRFSS only, updated app now unifies NHANES + BRFSS views; no action needed.
Usage (Minimal)¶
from pophealth_observatory.observatory import NHANESExplorer
explorer = NHANESExplorer()
df = explorer.create_merged_dataset("2017-2018")
print(df.head())
print(explorer.generate_summary_report(df))
Run the Streamlit app (from repository root):
streamlit run apps/streamlit_app.py
Data Ethics & Interpretation Boundary¶
Outputs are descriptive and exploratory: - Not adjusted for complex survey design beyond basic weight averaging. - Not intended for clinical decision making or protocol-grade regulatory submissions. - Serve as directional RWE scanning for internal prioritization or early hypothesis shaping.
Roadmap (Indicative)¶
Planned near-term additions (subject to reprioritization): - Getter methods for dietary (DR1TOT/DR2TOT) and smoking/alcohol questionnaires. - Basic lab panel ingestion (lipids, glucose, renal markers) with derived ratios (TC/HDL, ACR). - Optional metabolic syndrome flag computation. - Expanded manifest persistence including Questionnaire & Dietary components by default. - Lightweight Arrow Parquet exports for R/Python interchange (≤10 MB artifacts, cycle-specific). - Structured logging and retry analytics for failed downloads.
Citation / Reference¶
If you reference PopHealth Observatory internally, use: “PopHealth Observatory (v0.6.0): NHANES/BRFSS exploratory data ingestion and descriptive analytics toolkit.”
Support & Issues¶
Issue tracking: GitHub repository issue tracker. No SLA; responses and enhancements are best-effort.
End of v0.6.0 release notes.