It can be challenging to satisfy users’ running time expectations when designing report processes that require accessing data from a remote database. In certain situations, even a latency of a few milliseconds can lead to running times that leave users frustrated.
This challenge is particularly relevant for recursive batch reports. By recursive reports, I mean that data is retrieved by a recursive function. By batch reports, I mean that the report covers a large number of inputs.
Recursive Reporting
For many reporting requirements, it is possible to produce the report output by scanning over one or more data sets once or at least some relatively small number of times. Much financial reporting, for example, involves summing part of a data set, possibly across different groups:
select
customer_id,sum(sale_amount) total_sales
from customer_sales
group by customer_id
Of course, a query like this can be executed directly by any typical RDBMS. But queries like these can also be implemented efficiently on a remote server. In such a scenario, the reporting process can simply iterate over all rows from customer_sales
and maintain a sum for each customer_id
it encounters, reporting out the final sums after iteration is complete. As long as data is retrieved from the remote database in reasonably-sized batches (to minimize database roundtrip latency) and there is sufficient memory to hold the intermediate sums, this query should perform about as well as it would if executed directly on the database server.
However, there are some reporting requirements that follow unpredictable data access patterns. One class of such reports is recursive reports. In a recursive report, data access is controlled by a recursive function. Where the customer sales query above required a simple iteration over all of the rows in the table, a recursive report cannot predict ahead of time in which order rows will need to be accessed. Instead, a recursive function takes some information based on the last row that was retrieved and returns the next row to be retrieved based on that.
A concrete example should help. Suppose we need to produce a report that simply outputs the nodes of a bill of materials (BOM) hierarchy as a tree. Sample output might look like:
- Part A
- Part B
- Part C
- Part C
- Part D
- Part C
- Part B
Without any information about the part we are producing the tree for, do you know anything about which parts you will need information about, or in what order?
If the only piece of information we know about the data set in general is that the hierarchies are trees (and not graphs), then we know that we will not need to access information about a particular part more than once going down a single branch of the tree, but we do not know which parts we will encounter or in what order. Nor do we know if we will need to access information about a particular part more than once in a single hierarchy.
There are a few possible approaches to minimize the effects of data access latency in a case like this. We will cover those a future post. But first, we look at the second half of the problem.
Recursive Batch Reporting
When producing recursive reports for a single input, the effects of data access latency may be relatively small. In a small hierarchy like the one above, with only six nodes, our running time will be small, even if we naively access data about each part every time we encounter it. Even if the roundtrip time between the database server and the reporting server is 20 ms, we only spend 120 ms on data access.
Where the latency becomes a problem is when we try to produce a report that covers thousands of hierarchies. If the hierarchies average about six nodes each, we spend 120 × 1000 = 120000 milliseconds, or two minutes, just on data access to produce a report for 1000 hierarchies.
For some requirements, this is just too long to run what is a relatively small report.
In a future post, we will cover the possible approaches in dealing with these challenges to designing performant recursive batch reports.