Skip to content

PopHealth Observatory – Setup & Usage Guide

(Synced from root SETUP_GUIDE.md – do not edit this file directly. Update the root file and re-run sync.)

Human-facing setup instructions for using the Python toolkit. Internal agent / generation rules live in .github/copilot-instructions.md and are intentionally excluded here.


1. Prerequisites

Requirement Notes
Python 3.10+ Tested on 3.11 / 3.12
Git For cloning + version control
Quarto (required) Required for scientific authoring workflows and manuscript rendering
SciClaw 0.2.8+ Minimum supported version for agentic scientific authoring workflows
Optional: virtual env python -m venv .venv or conda create -n pophealth python=3.11
Optional: Streamlit For the interactive app (pip install streamlit)

R is not required. A future optional R layer will use Apache Arrow for file interchange (no Python–R bridging needed).

Verify system dependencies:

quarto check
sciclaw --version

SciClaw should report version 0.2.8 or newer.


2. Install

git clone https://github.com/paulboys/PopHealth-Observatory.git
cd PopHealth-Observatory
python -m venv .venv
./.venv/Scripts/Activate.ps1  # Windows PowerShell
pip install -e .[dev]

Minimal verification:

pytest -q
python -c "import pophealth_observatory as p; print(p.__version__)"

3. Core Workflow

from pophealth_observatory.observatory import NHANESExplorer

explorer = NHANESExplorer()

# Download & merge demographics, body measures, and blood pressure
# Note: create_merged_dataset() currently merges all three components by default
df = explorer.create_merged_dataset(cycle="2017-2018")

# Validate integrity against CDC metadata
report = explorer.validate(cycle="2017-2018", components=["demographics", "body_measures", "blood_pressure"])
print(f"Validation status: {report['status']}")

# Weighted mean (experimental survey helper - auto-detects survey weights)
result = explorer.calculate_weighted_mean(df, variable="body_mass_index")
print(f"Weighted BMI mean: {result['weighted_mean']:.2f}")

4. Key Concepts

Concept Description Output
Ingestion Robust multi-URL NHANES file download DataFrames
Harmonization Column selection + semantic renaming + derived metrics Standardized schema
Manifest Structured inventory of component listing tables JSON / DataFrame
Validation Row count & source checks for integrity Report object
Pesticide snippets Regex-based analyte sentence windows JSONL lines
RAG (experimental) Embedding + similarity retrieval of snippet context Ranked snippet dicts

5. Data Outputs & Locations

Artifact Path Format
Manifest JSON manifests/ .json
Pesticide snippets data/processed/pesticides/ .jsonl
Reference tables data/reference/ .csv / .yml
Raw pesticide text data/raw/pesticides/ .txt

Parquet caching for large multi-cycle merges is planned (will locate under a future shared_data/ or data/processed/ subdirectory with date-stamped filenames).


6. Validation Strategy

  1. Programmatic: validate() compares ingested data vs. CDC published metadata (rows, availability).
  2. Analytical (in progress): reproducibility/ notebooks re-derive published aggregate stats to confirm end-to-end correctness.

Edge cases handled: missing downloads → empty DataFrame; mismatch in participant IDs triggers warning; missing expected columns raise ValueError.


7. Survey Weights (Experimental Helpers)

Functions:

explorer.get_survey_weight(cycle: str, component: str) -> str
explorer.calculate_weighted_mean(data: pd.DataFrame, variable: str, weight_var: str = None, min_weight: float = 0) -> dict

The calculate_weighted_mean function auto-detects survey weights (exam_weight, interview_weight, or dietary_day1_weight) if not specified. Returns a dictionary with weighted_mean, unweighted_mean, n_obs, and sum_weights.

Currently covers standard 2-year weights. Planned: variance estimation & multi-cycle normalized weighting.


8. Development Tasks

# Lint & format
ruff check .
black --check .

# Tests & coverage
pytest -q
coverage run -m pytest && coverage report -m

# Build docs
mkdocs build

Install/update pre-commit hooks:

pre-commit install
pre-commit run --all-files

9. Contributing

Use the consolidated contributor guide in CONTRIBUTING.md for: - Branching and conventional commits - Required local quality checks (pre-commit, Ruff, Black, pytest) - Documentation and changelog expectations - PR review checklist and merge expectations


10. Planned R Layer (Future)

The R layer will: parse harmonized parquet outputs via arrow, produce advanced survey or longitudinal analyses, and optionally write derived metrics back to shared parquet. There is no S4 implementation yet; ignore any historical Bioconductor references found in older commits.


11. FAQ

Question Answer
Do I need R? No—pure Python usage today.
Why JSONL for snippets? Efficient line-wise streaming & indexing.
How do I regenerate a manifest? Use get_detailed_component_manifest or save_detailed_component_manifest.
Can I add new analyte domains? Follow the pattern in pesticide_ingestion.py (compile regex once, yield dataclass instances).
How do I trust weights? Helpers are early-stage; cross-check with NHANES analytic guidance.

12. Troubleshooting Quick Reference

Symptom Likely Cause Action
Empty DataFrame All URL attempts failed Re-run with network available; inspect cycle code
Validation mismatch Source metadata changed Open issue; update scraper logic
No snippet matches Patterns too strict Inspect reference file; broaden regex tokens
Slow merge Large multi-cycle join Future caching; consider subset of cycles

13. Ethical / Usage Notes

Not a clinical decision tool. Verify methodology when publishing. Cite NHANES appropriately.


14. Next Milestones

See ROADMAP.md for: harmonization registry, time trend utilities, caching backend, coverage gating.


15. License

MIT License – see the LICENSE file on GitHub or the in-site copy on the License page.


Last updated: 2025-11-03


Last sync (source mtime): 2026-04-22 12:23 UTC