A modern, daily-runnable CLI for mirroring bulk biomedical datasets. It tracks every file in a SQLite state DB so re-runs do the minimum work: already-verified immutable files are skipped with no network request beyond the directory/manifest listing.
pip install -e .Or use the Makefile:
make install
make devlitsync --data-root /data/literature --email you@institute.orgCommon options:
litsync --data-root /data/literature --email you@institute.org \
--sources pubmed pmc fda clinicaltrials \
--fda-endpoints drug/event drug/label--sources pubmed pmc fda clinicaltrials # which corpora (default: all four)
--fda-endpoints drug/event drug/label # default: all openFDA endpoints
--pmc-groups oa_comm oa_noncomm oa_other
--pmc-formats xml txt # default: xml
--workers 4 # concurrent downloads (keep modest; be polite)
--dry-run # plan only, download nothing
--reverify # re-download local files (integrity audit)
--prune # delete local files no longer on the server
--count-articles # count articles in already-downloaded files (no network)
--no-rich # disable Rich progress bars / tables/data/literature/
pubmed/baseline/ pubmed26nXXXX.xml.gz (+ .md5 verified)
pubmed/updatefiles/ daily citation deltas
pmc/oa_bulk/<group>/<fmt>/ baseline + dated incremental .tar.gz
pmc/oa_file_list.csv PMCID <-> PMID id map
fda/<category>/<endpoint>/ openFDA bulk snapshot zips + extracted JSON
clinicaltrials/ctg-public-xml.zip ClinicalTrials.gov full XML dump
clinicaltrials/ctg-public-xml/ extracted study XML files
_state/state.sqlite file ledger (status, size, mtime, md5, etag, attempts)
_state/logs/ dated run logs
_state/litsync.lock run lock (prevents overlapping cron runs)
30 2 * * * /path/to/venv/bin/litsync --data-root /data/literature --email you@institute.org >> /data/literature/_state/cron.log 2>&1litsync-extract --data-root /data/literature --out /data/corpus \
--sources pubmed pmc fda clinicaltrialsOr with Make:
make extract DATA_ROOT=/data/literature CORPUS_OUT=/data/corpus
make extract-test DATA_ROOT=/data/literature
- PubMed: every
.xml.gzis verified against its NCBI.md5sidecar. - PMC: bulk packages have no md5 sidecar, so they are verified by
Content-Lengthand anETagis recorded for change detection. - openFDA / ClinicalTrials.gov: these sources publish full snapshots. The downloader
detects changed snapshots via
ETag/Last-Modified/Content-Lengthand only re-downloads when the snapshot changes. When a snapshot changes it is extracted again next to the zip file. - Downloads are atomic (
.part-> rename) and resumable via HTTP Range. - Exit code is non-zero if any file failed, so cron/monitoring can alert.
- openFDA bulk data is zipped JSON. The manifest is fetched from
https://api.fda.gov/download.json. Each endpoint partition becomes one downloaded/extracted unit. - ClinicalTrials.gov bulk data is the full public XML dump from
https://clinicaltrials.gov/api/legacy/public-xml?format=zip. One XML file per study. - Both sources are snapshots, not daily deltas. Daily runs are still cheap because unchanged snapshots are skipped; changed snapshots are replaced in full.