PopHealth Observatory – Setup & Usage Guide¶
(Synced from root SETUP_GUIDE.md – do not edit this file directly. Update the root file and re-run sync.)
Human-facing setup instructions for using the Python toolkit. Internal agent / generation rules live in .github/copilot-instructions.md and are intentionally excluded here.
1. Prerequisites¶
| Requirement | Notes |
|---|---|
| Python 3.10+ | Tested on 3.11 / 3.12 |
| Git | For cloning + version control |
| Quarto (required) | Required for scientific authoring workflows and manuscript rendering |
| SciClaw 0.2.8+ | Minimum supported version for agentic scientific authoring workflows |
| Optional: virtual env | python -m venv .venv or conda create -n pophealth python=3.11 |
| Optional: Streamlit | For the interactive app (pip install streamlit) |
R is not required. A future optional R layer will use Apache Arrow for file interchange (no Python–R bridging needed).
Verify system dependencies:
quarto check
sciclaw --version
SciClaw should report version 0.2.8 or newer.
2. Install¶
git clone https://github.com/paulboys/PopHealth-Observatory.git
cd PopHealth-Observatory
python -m venv .venv
./.venv/Scripts/Activate.ps1 # Windows PowerShell
pip install -e .[dev]
Minimal verification:
pytest -q
python -c "import pophealth_observatory as p; print(p.__version__)"
3. Core Workflow¶
from pophealth_observatory.observatory import NHANESExplorer
explorer = NHANESExplorer()
# Download & merge demographics, body measures, and blood pressure
# Note: create_merged_dataset() currently merges all three components by default
df = explorer.create_merged_dataset(cycle="2017-2018")
# Validate integrity against CDC metadata
report = explorer.validate(cycle="2017-2018", components=["demographics", "body_measures", "blood_pressure"])
print(f"Validation status: {report['status']}")
# Weighted mean (experimental survey helper - auto-detects survey weights)
result = explorer.calculate_weighted_mean(df, variable="body_mass_index")
print(f"Weighted BMI mean: {result['weighted_mean']:.2f}")
4. Key Concepts¶
| Concept | Description | Output |
|---|---|---|
| Ingestion | Robust multi-URL NHANES file download | DataFrames |
| Harmonization | Column selection + semantic renaming + derived metrics | Standardized schema |
| Manifest | Structured inventory of component listing tables | JSON / DataFrame |
| Validation | Row count & source checks for integrity | Report object |
| Pesticide snippets | Regex-based analyte sentence windows | JSONL lines |
| RAG (experimental) | Embedding + similarity retrieval of snippet context | Ranked snippet dicts |
5. Data Outputs & Locations¶
| Artifact | Path | Format |
|---|---|---|
| Manifest JSON | manifests/ |
.json |
| Pesticide snippets | data/processed/pesticides/ |
.jsonl |
| Reference tables | data/reference/ |
.csv / .yml |
| Raw pesticide text | data/raw/pesticides/ |
.txt |
Parquet caching for large multi-cycle merges is planned (will locate under a future shared_data/ or data/processed/ subdirectory with date-stamped filenames).
6. Validation Strategy¶
- Programmatic:
validate()compares ingested data vs. CDC published metadata (rows, availability). - Analytical (in progress):
reproducibility/notebooks re-derive published aggregate stats to confirm end-to-end correctness.
Edge cases handled: missing downloads → empty DataFrame; mismatch in participant IDs triggers warning; missing expected columns raise ValueError.
7. Survey Weights (Experimental Helpers)¶
Functions:
explorer.get_survey_weight(cycle: str, component: str) -> str
explorer.calculate_weighted_mean(data: pd.DataFrame, variable: str, weight_var: str = None, min_weight: float = 0) -> dict
The calculate_weighted_mean function auto-detects survey weights (exam_weight, interview_weight, or dietary_day1_weight) if not specified. Returns a dictionary with weighted_mean, unweighted_mean, n_obs, and sum_weights.
Currently covers standard 2-year weights. Planned: variance estimation & multi-cycle normalized weighting.
8. Development Tasks¶
# Lint & format
ruff check .
black --check .
# Tests & coverage
pytest -q
coverage run -m pytest && coverage report -m
# Build docs
mkdocs build
Install/update pre-commit hooks:
pre-commit install
pre-commit run --all-files
9. Contributing¶
Use the consolidated contributor guide in CONTRIBUTING.md for:
- Branching and conventional commits
- Required local quality checks (pre-commit, Ruff, Black, pytest)
- Documentation and changelog expectations
- PR review checklist and merge expectations
10. Planned R Layer (Future)¶
The R layer will: parse harmonized parquet outputs via arrow, produce advanced survey or longitudinal analyses, and optionally write derived metrics back to shared parquet. There is no S4 implementation yet; ignore any historical Bioconductor references found in older commits.
11. FAQ¶
| Question | Answer |
|---|---|
| Do I need R? | No—pure Python usage today. |
| Why JSONL for snippets? | Efficient line-wise streaming & indexing. |
| How do I regenerate a manifest? | Use get_detailed_component_manifest or save_detailed_component_manifest. |
| Can I add new analyte domains? | Follow the pattern in pesticide_ingestion.py (compile regex once, yield dataclass instances). |
| How do I trust weights? | Helpers are early-stage; cross-check with NHANES analytic guidance. |
12. Troubleshooting Quick Reference¶
| Symptom | Likely Cause | Action |
|---|---|---|
| Empty DataFrame | All URL attempts failed | Re-run with network available; inspect cycle code |
| Validation mismatch | Source metadata changed | Open issue; update scraper logic |
| No snippet matches | Patterns too strict | Inspect reference file; broaden regex tokens |
| Slow merge | Large multi-cycle join | Future caching; consider subset of cycles |
13. Ethical / Usage Notes¶
Not a clinical decision tool. Verify methodology when publishing. Cite NHANES appropriately.
14. Next Milestones¶
See ROADMAP.md for: harmonization registry, time trend utilities, caching backend, coverage gating.
15. License¶
MIT License – see the LICENSE file on GitHub or the in-site copy on the License page.
Last updated: 2025-11-03
Last sync (source mtime): 2026-04-22 12:23 UTC