Last updated June 18, 2026

Methodology

GDELT Cloud structures and enriches public event data so APIs, dashboards, and agents can use it without re-deriving the pipeline. This page explains where the data comes from, how often it updates, how we process it, and where it should not be the sole source.

Source universe

GDELT Cloud's primary raw inputs are the public GDELT 2.0 datasets: Events, the Global Knowledge Graph (GKG), and the Global Entity Graph (GEG). These are sourced from Google BigQuery, which hosts the official GDELT public datasets.

The hourly pipeline can also ingest native-source metadata from selected Arabic and Chinese RSS/news sitemap sources. These rows retain source language, country/region, compliance status, and provenance separately from the GDELT raw tables; compliance-gated sources are logged but not admitted to product intake.

Energy infrastructure and heavy-industry data are sourced directly from Global Energy Monitor's published registries — power and fuel assets (coal plants, renewable capacity, pipelines, LNG terminals, nuclear, and related categories) alongside GEM's Heavy Industry trackers (iron & steel, cement, chemicals, and iron ore). We obtain GEM data offline, separately from the GDELT ingest pipeline, and resolve asset owners through GEM's ownership graph into our unified entity registry so an entity's page shows the plants and mines it owns.

Corporate filings data (admin preview) is sourced from SEC EDGAR — the public-domain US securities filing system. We ingest the public filing index and structured XBRL financials hourly, within the SEC's published fair-access limits. EDGAR content is public domain; we cite the SEC as source. Corporate relationships (subsidiaries, suppliers, customers, jurisdictions) surfaced from filing text are derived through our proprietary AI pipeline and resolved into our entity registry — these derived relations are GDELT Cloud's interpretation, not statements by the filer.

Macro-economic time series (admin preview) are official U.S. economic statistics drawn from the Federal Reserve (FRED / ALFRED) and the originating agencies that publish them — the Bureau of Labor Statistics, Bureau of Economic Analysis, Census Bureau, and Treasury, among others. Observations are stored point-in-time (vintaged), so an as-of query returns the value as it was known on that date and never looks ahead. A minority of series carry third-party copyright (e.g. S&P, ICE, Dow Jones); we identify these and never store, serve, or use them, and macro text is never added to any embedding or fine-tuning corpus. This product uses data from the Federal Reserve Bank of St. Louis (FRED) but is not endorsed or certified by it.

Maritime vessel-flow signals (admin preview) are derived from public and commercial AIS vessel-tracking sources and expose only derived measures — chokepoint transit counts, dwell time, AIS-dark gaps, and last-known vessel positions and identity — across 11 maritime chokepoints (Hormuz, Bab-el-Mandeb, Malacca, Suez, Panama, Bosphorus, Gibraltar, Dover, Kerch, Taiwan, and the Danish Straits). We do not redistribute a raw AIS position feed or full track history. Maritime signals accrue from launch (June 2026) forward and are not backfilled; queries for dates before launch return no maritime coverage. Coverage is terrestrial-AIS — vessels are observed when within range of shore receivers, so open-ocean segments (which would require satellite AIS) are not captured. Vessels are matched to Global Energy Monitor's LNG-carrier registry. The ports reference behind the port-disruption and proximity queries is the U.S. National Geospatial-Intelligence Agency's World Port Index (Pub 150) — a public-domain U.S. Government publication of roughly 3,800 ports with coordinates, harbor characteristics, and terminal facilities — with UN/LOCODEs from the UNECE code list.

AI-compute data (admin preview) exposes Epoch AI's published research datasets — notable AI models, ML hardware, AI data centers, chip-sales estimates, and AI-company metrics (funding, revenue, staff, and compute spend). These are loaded offline, separately from the GDELT ingest pipeline, and Epoch's organizations are resolved into our unified entity registry so an entity's page can show the models it developed, the hardware it makes, the data centers it owns, and its cumulative chip sales. Epoch AI data is published under CC-BY 4.0 and attributed to Epoch AI; it is a data source, not a subprocessor.

GDELT Cloud does not directly observe world events. We structure and enrich what reporters, public registries, and the GDELT Project's automated systems have already published.

Update frequency

The main ingest runs hourly on the top of the hour. Supporting jobs reconcile entity links, cluster membership, and GEG entity coverage on the same cadence, with a daily and weekly entity reconciliation pass that widens the window.

Native-source metadata is admitted under a capped incremental latency budget so it does not slow the main hourly cycle.

Typical end-to-end latency from GDELT publication to availability in GDELT Cloud is about 15 minutes — the duration of the ingest pipeline. Variable upstream lag in GEG entity coverage can extend this for some entity-linked fields. Real-time freshness for each pipeline stage is published on the data status page.

Public entity, event, and story pages are cached for speed: data for the current UTC day refreshes on the hourly cadence, while pages for prior-day events and stories — which are settled — may be served from cache for up to 24 hours and refresh when reconciliation revises them. The REST API and MCP reflect the same hourly ingest cadence.

Data ingestion status

Event construction

Within each UTC ingest day, articles are grouped into stories and coded into structured events through a proprietary pipeline that combines AI classification, entity resolution, and our CAMEO+/ACLED taxonomies.

Coding spans both ACLED-style conflict events and CAMEO+ structured events across ten domains (political, crime, economic, corporate, technology, infrastructure, environment, health, demographic, and information). The taxonomies and output fields are documented; the fusion and classification logic that produces them is proprietary.

Event taxonomy reference

Taxonomy and scoring

GDELT Cloud uses two event taxonomies: ACLED-aligned conflict event types (political violence, protests, riots, strategic developments) and a CAMEO+ taxonomy that extends CAMEO across ten domain categories.

Scoring fields include the Goldstein scale (-10 to +10, populated for political and conflict events), magnitude (0-10, domain-specific severity), and the systemic_importance, propagation_potential, and market_sensitivity scores. Quad class and event-root codes are inherited from the underlying CAMEO/CAMEO+ structure. The fields and their ranges are documented; how each score is computed is proprietary.

Event taxonomy reference

Story scope and clustering

Each story (cluster) carries a scope: local, national, regional, or international, derived from the geographic and source distribution of articles in the cluster.

Clustering operates on a UTC ingest-day window. Articles published on different days do not merge into the same cluster today, even when they describe the same ongoing incident. Cross-day continuity is on the roadmap and is flagged in the known limitations section below.

Known limitations

Historical coverage is strongest from March 2026 forward. Earlier history is being backfilled but is incomplete; queries that span pre-March 2026 dates may return uneven coverage.

Source bias is real. GDELT's source universe over-indexes English-language and globally indexed outlets relative to local-language reporting, and some regions are systematically under-covered. GDELT Cloud inherits this bias.

Classification and deduplication are imperfect. CAMEO+ and ACLED coding rely on automated systems and may misclassify edge cases. Cross-cluster duplicates are flagged via a duplicate field but are not always perfectly resolved.

Cluster labels are generated by language models and consistency is not guaranteed across days or topics.

The story_count field is a cardinality of distinct stories within the queried window; it is not a sum across dates. Aggregating story_count across multiple dates without deduplication will overcount.

Appropriate use

GDELT Cloud is built for monitoring, triage, alerting, research acceleration, signal discovery, country and sector trend analysis, and agent workflows that benefit from structured event data with clean schema, entity links, and category coverage.

Not appropriate as the sole source for

Legal determinations, sanctions screening, emergency-response decisions, sole-source investment decisions, individual-level risk decisions, or any claim requiring evidentiary certainty.

Treat GDELT Cloud output as analytical signal, not ground truth. High-impact conclusions should be corroborated with primary sources, official statements, or trusted reporting.

Schema and classifier changes

We may update schemas, classifiers, scoring logic, deduplication, clustering, and enrichment methods over time. Material changes that affect integrators are documented in the changelog. Where feasible we preserve backward-compatible fields and provide migration notes.

Changelog

Methodology questions?

For deeper questions about data lineage, schema changes, or appropriate use, email us.