PopHealth Observatory – Setup & Usage Guide¶

(Synced from root SETUP_GUIDE.md – do not edit this file directly. Update the root file and re-run sync.)

Human-facing setup instructions for using the Python toolkit. Internal agent / generation rules live in .github/copilot-instructions.md and are intentionally excluded here.

1. Prerequisites¶

Requirement	Notes
Python 3.10+	Tested on 3.11 / 3.12
Git	For cloning + version control
Quarto (required)	Required for scientific authoring workflows and manuscript rendering
SciClaw 0.2.8+	Minimum supported version for agentic scientific authoring workflows
Optional: virtual env	`python -m venv .venv` or `conda create -n pophealth python=3.11`
Optional: Streamlit	For the interactive app (`pip install streamlit`)

R is not required. A future optional R layer will use Apache Arrow for file interchange (no Python–R bridging needed).

Verify system dependencies:

quarto check
sciclaw --version

SciClaw should report version 0.2.8 or newer.

2. Install¶

git clone https://github.com/paulboys/PopHealth-Observatory.git
cd PopHealth-Observatory
python -m venv .venv
./.venv/Scripts/Activate.ps1  # Windows PowerShell
pip install -e .[dev]

Minimal verification:

pytest -q
python -c "import pophealth_observatory as p; print(p.__version__)"

3. Core Workflow¶

from pophealth_observatory.observatory import NHANESExplorer

explorer = NHANESExplorer()

# Download & merge demographics, body measures, and blood pressure
# Note: create_merged_dataset() currently merges all three components by default
df = explorer.create_merged_dataset(cycle="2017-2018")

# Validate integrity against CDC metadata
report = explorer.validate(cycle="2017-2018", components=["demographics", "body_measures", "blood_pressure"])
print(f"Validation status: {report['status']}")

# Weighted mean (experimental survey helper - auto-detects survey weights)
result = explorer.calculate_weighted_mean(df, variable="body_mass_index")
print(f"Weighted BMI mean: {result['weighted_mean']:.2f}")

4. Key Concepts¶

Concept	Description	Output
Ingestion	Robust multi-URL NHANES file download	DataFrames
Harmonization	Column selection + semantic renaming + derived metrics	Standardized schema
Manifest	Structured inventory of component listing tables	JSON / DataFrame
Validation	Row count & source checks for integrity	Report object
Pesticide snippets	Regex-based analyte sentence windows	JSONL lines
RAG (experimental)	Embedding + similarity retrieval of snippet context	Ranked snippet dicts

5. Data Outputs & Locations¶

Artifact	Path	Format
Manifest JSON	`manifests/`	`.json`
Pesticide snippets	`data/processed/pesticides/`	`.jsonl`
Reference tables	`data/reference/`	`.csv` / `.yml`
Raw pesticide text	`data/raw/pesticides/`	`.txt`

Parquet caching for large multi-cycle merges is planned (will locate under a future shared_data/ or data/processed/ subdirectory with date-stamped filenames).

6. Validation Strategy¶

Programmatic: validate() compares ingested data vs. CDC published metadata (rows, availability).
Analytical (in progress): reproducibility/ notebooks re-derive published aggregate stats to confirm end-to-end correctness.

Edge cases handled: missing downloads → empty DataFrame; mismatch in participant IDs triggers warning; missing expected columns raise ValueError.

7. Survey Weights (Experimental Helpers)¶

Functions:

explorer.get_survey_weight(cycle: str, component: str) -> str
explorer.calculate_weighted_mean(data: pd.DataFrame, variable: str, weight_var: str = None, min_weight: float = 0) -> dict

The calculate_weighted_mean function auto-detects survey weights (exam_weight, interview_weight, or dietary_day1_weight) if not specified. Returns a dictionary with weighted_mean, unweighted_mean, n_obs, and sum_weights.

Currently covers standard 2-year weights. Planned: variance estimation & multi-cycle normalized weighting.

8. Development Tasks¶

# Lint & format
ruff check .
black --check .

# Tests & coverage
pytest -q
coverage run -m pytest && coverage report -m

# Build docs
mkdocs build

Install/update pre-commit hooks:

pre-commit install
pre-commit run --all-files

9. Contributing¶

Use the consolidated contributor guide in CONTRIBUTING.md for: - Branching and conventional commits - Required local quality checks (pre-commit, Ruff, Black, pytest) - Documentation and changelog expectations - PR review checklist and merge expectations

10. Planned R Layer (Future)¶

The R layer will: parse harmonized parquet outputs via arrow, produce advanced survey or longitudinal analyses, and optionally write derived metrics back to shared parquet. There is no S4 implementation yet; ignore any historical Bioconductor references found in older commits.

11. FAQ¶

Question	Answer
Do I need R?	No—pure Python usage today.
Why JSONL for snippets?	Efficient line-wise streaming & indexing.
How do I regenerate a manifest?	Use `get_detailed_component_manifest` or `save_detailed_component_manifest`.
Can I add new analyte domains?	Follow the pattern in `pesticide_ingestion.py` (compile regex once, yield dataclass instances).
How do I trust weights?	Helpers are early-stage; cross-check with NHANES analytic guidance.

12. Troubleshooting Quick Reference¶

Symptom	Likely Cause	Action
Empty DataFrame	All URL attempts failed	Re-run with network available; inspect cycle code
Validation mismatch	Source metadata changed	Open issue; update scraper logic
No snippet matches	Patterns too strict	Inspect reference file; broaden regex tokens
Slow merge	Large multi-cycle join	Future caching; consider subset of cycles

13. Ethical / Usage Notes¶

Not a clinical decision tool. Verify methodology when publishing. Cite NHANES appropriately.

14. Next Milestones¶

See ROADMAP.md for: harmonization registry, time trend utilities, caching backend, coverage gating.

15. License¶

MIT License – see the LICENSE file on GitHub or the in-site copy on the License page.

Last updated: 2025-11-03

Last sync (source mtime): 2026-04-22 12:23 UTC