Content-Length: 260572 | pFad | http://github.com/GSA/data.gov/issues/5027

4B Metrics "Number of Datasets per org" not updating · Issue #5027 · GSA/data.gov · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics "Number of Datasets per org" not updating #5027

Closed
tdlowden opened this issue Jan 6, 2025 · 4 comments
Closed

Metrics "Number of Datasets per org" not updating #5027

tdlowden opened this issue Jan 6, 2025 · 4 comments
Assignees
Labels
bug Software defect or bug metrics Stats, metrics and data visualizations of catalog

Comments

@tdlowden
Copy link
Member

tdlowden commented Jan 6, 2025

The "Number of Datasets per org" table on /metrics is not up to date with the actual CKAN records.

Example, on 1/6 on /metrics:

Image

But on Harvest info section in CKAN:

Image

How to reproduce

  1. View a number of datasets by org on data.gov/metrics
  2. Compare it to the corresponding Harvest metrics page in CKAN

Expected behavior

These numbers should match and the /metrics table should update monthly with the rest of the reports

Actual behavior

The table is stale

Sketch

@tdlowden tdlowden added bug Software defect or bug metrics Stats, metrics and data visualizations of catalog labels Jan 6, 2025
@rshewitt rshewitt moved this to 🏗 In Progress [8] in data.gov team board Jan 6, 2025
@rshewitt rshewitt self-assigned this Jan 6, 2025
@rshewitt
Copy link
Contributor

rshewitt commented Jan 7, 2025

  • total datasets in the metrics dashboard may get calculated here in harvester.
  • both the data.gov/metrics and org page for noaa display 105,246 datasets.
  • the number of harvest sources (71) for noaa is the same between ckan and the metrics dashboard
  • on staging, noaa has 76617 packages. running the equivalent sql command (see below) of the harvester code i linked to in the first bullet produces a dataset count of 92284
SELECT COUNT(*)
FROM package
JOIN harvest_object
ON package.id = harvest_object.package_id
WHERE harvest_object.harvest_source_id in (select id from harvest_source where url like '%noaa%')
  AND harvest_object.current = TRUE
  AND package.state = 'active'
  AND package.private = FALSE;

 count 
-------
 92284
(1 row)

my hunch is more records are returned as a result of the join between packages and harvest objects ( i.e. harvest objects is the culprit ). comparing the above result to...

select count(*) 
from package 
where owner_org = '5f4f1195-e770-4a2a-8f75-195cd98860ce'  -- noaa
and private = FALSE 
and state = 'active';

 count 
-------
 76688
(1 row)

this still isn't the same as what's on staging metrics( off by 71 ) but closer

interestingly, when I count the number of unique package IDs using the following query I get 76614 packages which is 3 less than what's to be "expected"

SELECT COUNT( DISTINCT package.id )
FROM package
JOIN harvest_object
ON package.id = harvest_object.package_id
WHERE harvest_object.harvest_source_id in (select id from harvest_source where url like '%noaa%')
  AND harvest_object.current = TRUE
  AND package.state = 'active'
  AND package.private = FALSE;

 count 
-------
 76614
(1 row)

running ^ this same command for census-gov produces the same result count found on the census bulk process page which is different from what is displayed on the organization page

@tdlowden
Copy link
Member Author

tdlowden commented Jan 8, 2025

Ah super fascinating, thanks @rshewitt! Ok, I am comparing the November & December downloads with some I have from July and October

December: global__datasets_per_org.2024-12-31.csv

November: global__datasets_per_org.2024-11-30.csv

October: Oct_2024_global__datasets_per_org.csv

July: global__datasets_per_org.csv

It is WILD to me that they all have 113 orgs with datasets. The numbers do vary, but in that time we've had movement on active orgs from our org audit in Oct/Nov. Can it be coincidence that there are 113 rows of data for every month?

@FuhuXia
Copy link
Member

FuhuXia commented Jan 8, 2025

One big factor for the count difference is collection data. Take GSA for example, we have 338 datasets in total and excluding collection datasets it is 167.

338 is what you see on the Harvest report on catalog.data.gov metrics dashboard, 167 is what you see on data.gov metrics.

For NOAA data however, there is more to it. We can open another ticket on it based on the query result, but for the count difference between catalog.data.gov metrics dashboard and data.gov metrics, it is expected. We can modify one end to make the number consistent if that is important.

@rshewitt
Copy link
Contributor

rshewitt commented Jan 8, 2025

@FuhuXia thanks for the info!

@rshewitt rshewitt closed this as completed Jan 8, 2025
@github-project-automation github-project-automation bot moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug metrics Stats, metrics and data visualizations of catalog
Projects
Status: ✔ Done
Development

No branches or pull requests

3 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/GSA/data.gov/issues/5027

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy