Ancho is for modeling systems, including their uncertainties, so you can simulate the range of probable outcomes. But what would an Ancho model look like? How would you specify your model in code? What features do you need in the framework to make that job easier?
I took a stab at writing that article. I started from the use cases I've talked about so far. I listed out all the features they might require, identified the similarities I saw between them, and tried to cobble that together into an API.
Following that path taught me two things:
- Descriptions of the object oriented design process make for truly awful prose.
- Nothing expresses an object-oriented design quite as well as program code.
So, I scrapped that draft. Instead, I'll list out the features I came up with, try to explain how they would work, and why I think they're needed. Further detail is going to have to wait until I publish code examples, because these ideas are still in flux.
The Use Cases
I do, however, want to briefly touch on the use cases I started from. They are, in no particular order:
- Risk Management. Modeling a company's future losses, as well as the implications of various financing and risk control strategies.
- Personal Finance. Modeling your own financial future, starting from presently known information and guesses about what is to come.
- Business Planning. Similar, but not identical, to personal finance modeling. Source data is probably richer, and uncertainties are definitely higher.
- Project Management. in some respects the one that's most "different" than the others, because we're modling how long things will take more than how much they will cost.
There are other use cases that are perfectly valid (for instance, I talked to someone recently who is using Monte Carlo methods to price bond portfolios) but these will give me a lot to start with.
Terms and Structure
To quickly review, here are the terms that I came up with before in a post called You Know, For Kids! An execution of your Model is called a Run, which consists of multiple Sequences, each of which consists of (potentially) multiple Frames in a time series. Each Sequence has a State that persists between Frames. Each Sequence is independent of others, except for some initial parameters and shared setup code.
That part has stayed the same, but my conception of how you'd write the models themselves has changed quite a bit over the last several weeks. Initially, I thought that models would be mostly unstructured bundles of variables and functions, but the more I thought about it, the less that seemed to make sense. I looked at having functions return increasingly complicated data structures, such as maps or multidimensional arrays, but that didn't make much sense either. There are some classes of behavior that would just be too complicated to implement that way, and you'd be giving up a lot of the inherent power of Python.
Instead, I now believe that most of your model will probably live inside things that look more like normal Python modules and class definitions. Inside those, your code will access Ancho's randomization system and other services by reaching out to them in a more explicit, declarative fashion. This way of doing things seems more Pythonic.
These are the main structural concepts I've hit upon (though I'm not settled on the terminology.) All the other features build on these.
In each Sequence, there is one State object. If you're familiar with web frameworks, you could think of it as being like a Session object. (If not, don't sweat it.) Basically, you put something into it in your setup code, or in Frame 0, and it's there in Frames 1, 2, 3, and so on.
For each Frame, the pieces of data that Ancho actually records for further analysis are Values. Not every aspect of your model will be recorded for analysis; only these emitted Values.
OK. If I haven't lost you yet, these are the features I'm thinking we'll need.
That's a ten-dollar term, but it means specifically what I want to say: something that could be different in every frame, and is specified in your model as a probability distribution (or probability function) instead of a single value. You might specify such a variable with something like "this value will be normally distributed, with a mean of 100 and a standard deviation of 10." NumPy provides an easy ways to specify variables like this and generate the randomized values.
Unfortunately, when you start to think about how to specify any of our use cases in those terms, it gets a lot more complicated. You need to model things like:
- Additional transformations beyond the probability function. If you're modeling future expenses for a business, you'll need to account for inflation, interest rates, and other time-value factors. The real effect of one randomized variable on your simulation will often depend on the effects of other randomized variables, each of which has their own peccadilloes.
- Fluctuations. Sometimes the value won't truly be random, but the difference between the value at time index n and time index n + 1 will be. We'll need to model things like this often enough that the framework should provide explicit support.
- Limits. Sometimes you'll encounter variables that simply cannot be outside a given range. Interest rates never go below zero; efficiency never goes above 100%. We'll need some way of specifying that in our model.
- Trends and time-dependent variables. Sometimes the randomness will be fluctuations around a trend; that trend could be specified by any linear, polynomial or other function. Sometimes the value will vary based on absolute time, or in some cases the value might vary based on time relative to some other event.
To make a long story short, pure probabilistic variables are almost never useful on their own. More often than not, we'll need to use them as inputs in some other function, or do some other kind of transformation on the randomized value before we use it.
Not all of the unknown elements in our model are things that we can assign a probability to. Sometimes we'll be using our models to try and decide between alternatives, so they're not "independent" variables anymore. Then again, sometimes we will just have no earthly idea what the probabilities might be.
Modeling variables like this as if they were independent variables, or assigning probabilities to them when in fact those probabilities are unknown, would generate erroneous results when the models are run. (When you are accounting for probability, you have to actually account for it, everywhere.)
Therefore, sometimes we're going to need to explicitly declare these variables as "scenario variables" so they can be handled correctly by the framework.
Some use cases will need to access source data. I can think of at least two broad categories of such use:
- Using source data to identify probability distributions. Eventually I would like this to be a somewhat automatic feature of Ancho: looking at past data, asking you some questions about it, and identifying (or at least suggesting) the trend, distribution type, parameters, etc.
- Using historical data for back-testing. This is not something that would be directly useful in the use cases identified above, but imagine modeling something like an investment strategy. In that case, you'll want to test your model against real-world data as well as randomized model data. Ancho could select a historical start time at random, and transpose a "window" of historical data for evaluating your strategy. Perhaps your strategy that works perfectly starting in 1982 would fail miserably if you had done the same thing starting in 1968.
I haven't really considered issues of file formats, database access, etc. in this regard. My inclination is to support any kind of data that is supported by NumPy, or any kind of data source that Python can connect to and read. These features are probably pretty far down the roadmap, anyway.
Complex State Objects
In a time-series model, we're almost always going to have some "state" of the present that needs to be passed along in the Sequence from one Frame to the next. Often these are not simply going to be key-value pairs; they're going to be big, complicated data structures with behavior that needs to be enforced. These are not so much parts of your simulation, per se, but ways to keep track of the results of the simulated system.
For instance, in the risk management, business plan or personal finance use cases, you might need a fairly full accounting package as a state object -- a general ledger, chart of accounts, reporting, the whole works. If you're modeling more than one legal entity, you might need more than one instance of such an accounting object.
I'm not sure if "agent" is really the right term, but sometimes what you're modeling is not so much the randomized behavior of a system, but the decisions you will make in response to that behavior. You want to see if a certain set of decision rules will work in the various scenarios that may arise. The concept of a software agent comes fairly close to what I'm thinking. I may also have inadvertently stolen some of these ideas from Marvin Minsky's Society of Mind.
Your agents would get called in each Frame. They would be able to read the current State and the system's history, and would have the opportunity to generate actions that affect the State -- including spawning new Agents or telling other Agents what to do. For instance, you might have an Agent in your business plan model that monitors some accounting ratio. When it exceeds a given number, it executes a trade, or makes a decision to expand and buy a second building, or it tells your Business Plan agent "move to Phase 2."
In your model, these could be implemented as a single monolithic object, or as a list of more "flyweight" objects -- it would be up to you. The smaller the pieces can be, the more reusable they will probably be in other contexts.
These decision-making rules could be arbitrarily complex. It could be a simple If-Then statement in Python, or it could be a neural network module -- it's up to you as the modeler.
This is a critical feature for Ancho. Without libraries of reusable functions, you'd be reinventing the wheel every time you wanted to build a model, and Ancho would be useless to its intended audience. But imagine, for instance, that there was a reusable library that implemented my "accounting" complex state object as described above, and another one that implemented several functions for modeling financial variables such as interest rates, consumer price indexes, educational and medical inflation, etc. If all those pieces were already done for me, I could build very rich models of my own without doing a lot of my own programming work.
As Jeff Jarvis might say, "Do what you do best, and link to the rest."
Value Type Support
Finally, I believe it's going to be necessary to include some type support in the values that we're collecting from each Sequence. Time, Currency, Integers, choices from some enumerated set… not everything can be modeled with floating point values.