data and software architecture for calculations from year 0 - year n

https://stackoverflow.com/questions/7115753

31-12-2020
|

Question

For example, our application tracks animal movements and prices for a farm. To get the current stock count the simplest solution is to have a starting number, then add up all the movement in and out until we have a current number. But this is memory intensive and gets slower and slower as the number of movements grows year after year.

We don't have the luxury of "freezing" a year so it can no longer accept changes, the system must be able to handle changes to movements at any point in time, then show the updated numbers in real time.

This is not just stock numbers; we have to track a large number of variables like this and write reports for each period (day, week, month, year) that include summary calculations based on these variables.

What is the most common, preferred, "best", fastest, elegant way to handle data streams that cross multiple years for calculation and reporting purposes? How would the database design and the architecture relate in this scenario (i.e. would using an ORM be fine as long as the database schema was well designed?). The critical requirements here are optimal performance and real-time availability.

I have seen in large scale systems thus kind of work is split up into time slices, e.g. week, month, year aggregate tables. I am particularly interested if there is a common design pattern for solving this problem.

La solution

I would aggregate in the DB, as that's usually something they are very good at.

Have a look at OLAP (vs OLTP) database design.

Autres conseils

I'd go with an SQL database (PostgreSQL). RDBMS are quite fast with these things :)

Pulling all the history as ORM objects and then summing it the application might not work in the long run. You'll have to go with SQL queries that do most of the statistics work inside the RDBMS. You can off course still use ORM for displaying and editing objects though.

I think the solution should be quite safe with expected amount of data and RDBMS can be made to scale with proper indexing and more memory.

You can also make crazy amounts of random data and test scalability beforehand.

There is probably only one general approach - split the work.

You can split it in time and compute the aggregates periodically during some period with low load and store them in separate tables. For some aggregation functions you can even compute the long-period aggregates from the short-period ones without loose of precision.

You can split it also in space - there are solutions using combination of distributed database and map-reduce engine - look at Apache Pig for example. This approach would need a lot of learning and unlearning but you should get better scalability.

First thing you should know is your read:write ratio and the kind of queries you'd like to run.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow