Feature Status¶
Comprehensive breakdown of implemented vs planned capabilities.
β Implemented Features¶
Data Acquisition¶
- Multi-URL NHANES Download: Resilient XPT file retrieval with automatic fallback patterns across CDC hosting changes
- In-Memory Caching: Session-level cache for downloaded component data (avoids redundant network calls)
- Cycle/Component Mapping: Letter suffix resolution for NHANES cycles (1999-2000 β A, 2017-2018 β J, etc.)
Data Harmonization & Derivation¶
- Demographics (DEMO): Download, column selection, semantic renaming, gender/race labels
- Body Measures (BMX): Weight, height, BMI with categorical bins (Underweight/Normal/Overweight/Obese)
- Blood Pressure (BPX): Multi-reading averages + hypertension staging (Normal/Elevated/Stage 1/Stage 2)
- Merged Datasets: Participant-level merge across DEMO, BMX, BPX via
participant_id - Pesticide Laboratory (UPHOPM / OPD / PP): Multi-series file discovery, analyte harmonization (parent pesticide, metabolite class, matrix, unit), derived
log_concentration+detected_flag.
Metadata & Manifesting¶
- Component Table Parsing: Extract file listings (XPT/ZIP/FTP) from Demographics, Examination, Laboratory, Dietary, Questionnaire pages
- Schema Versioning: Manifest outputs include
schema_version(semver) andgenerated_at(UTC ISO timestamp) - Filtering: Year range overlap + file type subsetting
- Local Filename Derivation: Canonical naming with cycle years appended (e.g.,
DEMO_2017_2018.xpt) - Summary Aggregation: Nested counts by component and file type
- Manifest Persistence: JSON serialization with optional flattened DataFrame attachment
Analytics Helpers¶
- Demographic Stratification: Group-wise descriptive stats (count, mean, median, std, min, max) for any metric by demographic variable
- Summary Report Generation: Text-based participant count, age distribution, gender/race breakdowns, health metric summaries
- Visualization: Boxplots + bar charts for metric distributions by demographic groups (lazy matplotlib/seaborn import)
- Survey Weight Support: Helper methods to identify correct survey weights (
get_survey_weight) and calculate weighted means (calculate_weighted_mean)
Testing & Quality¶
- Programmatic Validation:
validate()method to verify data integrity against official CDC metadata (URL correctness, row counts) - Analytical Validation Framework: Reproducibility notebooks to validate tool output against published research (
reproducibility/) - Pytest Suite: Expanded tests (observatory coverage 30% β 81%) covering HTML parsing, manifest filtering, weighted means, merged dataset assembly, pesticide ingestion.
- NumPy-Style Docstrings: Comprehensive Parameter/Returns/Raises documentation across all modules
- Lint/Format Config: Ruff + Black with notebook exclusion, 120-char line length
- Pre-commit Hooks: Automated code formatting and linting with Black, Ruff, and file hygiene checks
Documentation¶
- MkDocs Site: Material theme with navigation sections
- Getting Started Guide: Installation, first manifest, Streamlit app launch
- Usage Examples: Manifest generation, quick start snippets, data validation guide
- API Reference: High-level method listing (inline docstrings authoritative)
- Copilot Instructions: Global, Python-specific, and R-specific (future) guidance files
Applications¶
- Streamlit App: Interactive cycle selection, metric/demographic aggregation, manifest sampling, raw data preview
- Reproducibility Notebooks: Executable studies that validate tool correctness against published statistics
π§ Planned Features¶
Near-Term (Q4 2025)¶
- Laboratory Panel Expansion: Lipids, glucose tolerance, inflammatory markers with dedicated loaders
- Parquet/DuckDB Caching: Persistent local backend for multi-cycle assemblies (optional)
- CLI Utility: Command-line interface for manifest generation, data download, component listing
- Manifest Delta: Compare manifests across dates to detect new/updated files
Mid-Term (Q1 2026)¶
- Cross-Cycle Harmonization Registry: Variable name mapping + recoding rules for longitudinal analysis
- Automated Data Dictionary Merger: Extract variable documentation from PDF/HTML component pages
- Time Trend Utilities: Multi-cycle joins with alignment & weighting
- Additional Components: Dietary day 2, accelerometer, environmental exposures (dedicated loaders)
- Retention Policy: Configurable cache artifact cleanup (size/time-based)
Long-Term¶
- Multi-Dataset Adapters: Unified API for BRFSS, NHIS, other public health surveys
- Interactive Cohort Builder: Criteria β derived dataset manifest with provenance
- Plugin Interface: Register custom metric calculators and derivation functions
- Cloud Deployment Recipe: Serverless manifest builder + cache API
- Provenance Tracking: Content hashing, reproducibility metadata, lineage graphs
Quality & Tooling¶
- Auto API Reference: MkDocs integration with docstring extraction (partiallyβsite exists, automation pending)
- Coverage Gating: Fail CI builds below threshold
- Example Notebooks Gallery: Binder/Codespaces links for interactive demos
Stretch Ideas¶
- Web UI: Next.js + FastAPI for manifest browsing
- ML Feature Extraction: Standardized pipelines from harmonized datasets
- Synthetic Data Generator: Teaching/demo datasets with privacy preservation
π¦ Component Loader Status¶
| Component | Code Mapped | Loader Method | Column Harmonization | Derived Metrics |
|---|---|---|---|---|
| Demographics (DEMO) | β | β
get_demographics_data() |
β | Gender/race labels, survey weights |
| Body Measures (BMX) | β | β
get_body_measures() |
β | BMI categories |
| Blood Pressure (BPX) | β | β
get_blood_pressure() |
β | BP staging, averages |
| Cholesterol (TCHOL) | β | β | β | β |
| Diabetes (GLU) | β | β | β | β |
| Dietary (DR1TOT) | β | β | β | β |
| Physical Activity (PAQ) | β | β | β | β |
| Smoking (SMQ) | β | β | β | β |
| Alcohol (ALQ) | β | β | β | β |
Legend:
- β
Implemented
- β Planned (code path exists for generic download via download_data(), but no dedicated convenience method)
π§ͺ RAG Pipeline Maturity¶
| Capability | Status | Notes |
|---|---|---|
| Text ingestion | β Implemented | Sentence segmentation, regex token matching |
| Snippet serialization | β Implemented | JSONL format |
| Reference analyte loading | β Implemented | CSV + YAML source registry |
| Embedding abstraction | β Implemented | DummyEmbedder + SentenceTransformerEmbedder |
| Vector index | β Implemented | In-memory NumPy cosine similarity |
| Retrieval | β Implemented | Top-k snippet ranking |
| Prompt assembly | β Implemented | Length-capped context formatting |
| Generator integration | β Implemented | External callable pattern |
| FAISS backend | π§ Optional | Partial support via dependency marker |
| Hybrid retrieval (lexical+vector) | π§ Planned | BM25 + embedding fusion |
| Streaming answers | π§ Planned | Token-by-token generation helpers |
| Multi-document sources | π§ Planned | Expand beyond PDP excerpts |
Legend: - β Implemented and tested - π§ Planned or partially available
ποΈ Data Exchange¶
| Protocol | Status | Notes |
|---|---|---|
| JSONL (snippets) | β Implemented | Text snippet artifacts |
| JSON (manifests) | β Implemented | Component metadata |
| Parquet (cross-language) | π§ Planned | shared_data/ directory reserved; Arrow interchange protocol documented |
| CSV | β Not planned | Discouraged for structured exchange |
| R/Python Interop (Arrow) | π§ Planned | Future parquet-based exchange; no reticulate bridging |
π Documentation Coverage¶
| Artifact | Status | Location |
|---|---|---|
| README | β Complete | README.md |
| Getting Started | β Complete | docs/getting-started.md |
| Quick Start | β Complete | docs/usage/quickstart.md |
| Data Validation Guide | β Complete | docs/usage/validation.md |
| Manifest Reference | β Complete | docs/usage/manifest.md |
| API Overview | β Complete | docs/api.md |
| Feature Status | β Complete | docs/features.md (this page) |
| Inline Docstrings | β Complete | All public functions/classes (NumPy style) |
| Copilot Instructions | β Complete | .github/copilot-instructions.md, scoped files |
| CHANGELOG | β Current | CHANGELOG.md |
| ROADMAP | β Current | ROADMAP.md |
| Pesticide Biomonitoring Plan | β Current | docs/pesticide_biomonitoring_plan.md |
| Auto API Reference | π§ Planned | MkDocs plugin integration pending |
π Continuous Integration¶
| Step | Status | Notes |
|---|---|---|
| Lint (Ruff) | β Implemented | Via pre-commit hooks and autofix-pr workflow |
| Format (Black) | β Implemented | Via pre-commit hooks and autofix-pr workflow |
| Test (Pytest) | β Passing | 18 tests (basic, context, RAG, validation) |
| Coverage | π§ Configured | coverage tool installed; gating not enforced |
| Build Artifacts | β Implemented | release.yml workflow handles tagged build/publish |
| Pre-commit Hooks | β Implemented | .pre-commit-config.yaml with Black, Ruff, etc. |
| Auto-Versioning | β Implemented | auto-version.yml bumps version on merge to main |
π Usage Readiness¶
| Use Case | Readiness | Requirements |
|---|---|---|
| Explore single-cycle demographics + anthropometrics | β Production-ready | Install from source or PyPI |
| Generate component file manifests with filtering | β Production-ready | BeautifulSoup4 optional for HTML parsing |
| Build interactive Streamlit dashboard | β Production-ready | Streamlit installed |
| Perform weighted survey analyses | π§ͺ Experimental | Helper methods implemented; complex variance not yet supported |
| Cross-cycle trend analysis | β Not ready | Harmonization registry + time utilities pending |
| Pesticide RAG question answering | π§ͺ Experimental | Functional but API may evolve; test coverage limited |
| Export harmonized data for R analysis | π§ Partially ready | Parquet protocol documented; no R source yet |
Legend: - β Production-ready: Stable API, tested, documented - π§ͺ Experimental: Functional but evolving API - π§ Partially ready: Infrastructure exists, full workflow incomplete - β Not ready: Planned but not implemented
Last Updated: 2026-04-22 Version Coverage: 1.0.0
For implementation timelines, see ROADMAP.md. For change history, see CHANGELOG.md. For pesticide domain planning details, see Pesticide Biomonitoring Plan.