We have a series of status updates for projects and the last update for a given project is the one we want to report on in several ways. For instance:
ProjectID | DateTime | EventDescription |
---|---|---|
001 | 2024-12-07 11:34 | New |
001 | 2024-12-07 11:36 | Submitted |
002 | 2024-12-07 11:40 | New |
003 | 2024-12-07 12:34 | New |
001 | 2024-12-07 14:02 | Approved |
002 | 2024-12-07 14:55 | Submitted |
004 | 2024-12-07 15:02 | New |
004 | 2024-12-07 15:44 | Submitted |
001 | 2024-12-07 16:03 | Completed |
In our actual data, there are, of course, thousands of projects and many more status updates.
THE GOAL: We have to use an aggregation to grab the last status (the current status) for each project, but we want a summary of the project status and count of each in a datetime range.
For the data above, we want to get:
Status | Project Count |
---|---|
New | 1 |
Submitted | 2 |
Completed | 1 |
We are looking for a means to do this in a single query. We have several places where we need this. This is just one example, and using a transformation is not a viable option at this time.
In addition to simple counts, we next hope to figure out how to aggregate these status updates into bucket counts by day to show a status graph across a series of days. How many each day are New, Submitted, etc... But we would be thrilled just to get status counts accurately.
We believe this requires a pipeline aggregation, but have not been able to get it working.
Our working aggregation query to get the latest project status for project:
GET journaling*/_search
{
"query": {
"bool": {
"filter": [
{ "range": {
"DATETIME": {
"gte":"2024/11/01 00:00:00.000",
"lte":"2024/11/30 23:59:59.000"
}
}},
{
"match": {
"ACCOUNT": "12345"
}
}
]
}
},
"size": 0,
"aggs": {
"ProjectStatusSummary": {
"terms": {
"field": "PROJECTID"
},
"aggs": {
"group": {
"top_hits": {
"size": "1",
"_source": {
"includes": [
"DATETIME",
"PROJECTID",
"EVENTDESCRIPTION",
"PROJECTSTART"
]
},
"sort": {
"DATETIME": {
"order": "desc"
}
}
}
}
}
}
}
}