Skip to content
← Back to Community
Empirical
Profile icon
chrisaycock

Empirical is a language for time-series analysis. It has builtin Dataframes (tables) and integrated queries. It's fully interactive and already works on Repl.it.

Empirical version 0.6.5 Copyright (C) 2019--2020 Empirical Software Solutions, LLC Type '?' for help. Type '\help' for magic commands. >>> let trades = load("trades.csv") >>> trades symbol timestamp price size AAPL 2019-05-01 09:30:00.578802 210.5200 780 AAPL 2019-05-01 09:30:00.580485 210.8100 390 BAC 2019-05-01 09:30:00.629205 30.2500 510 CVX 2019-05-01 09:30:00.944122 117.8000 5860 AAPL 2019-05-01 09:30:01.002405 211.1300 320 AAPL 2019-05-01 09:30:01.066917 211.1186 310 AAPL 2019-05-01 09:30:01.118968 211.0000 730 BAC 2019-05-01 09:30:01.186416 30.2450 380 CVX 2019-05-01 09:30:01.639577 118.2550 2880 ... ... ... ...

Empirical is a normal language with variables, types, functions, etc. Dataframes are just values, but there is plenty of syntactic sugar for using them. For example, here is a simple aggregation on stock trades:

>>> from trades select volume = sum(size) by symbol symbol volume AAPL 135760 BAC 223590 CVX 507580

I can run any expression, including user-defined functions. This computes the weighted average (wavg) given a set of weights (ws) and values (vs):

>>> func wavg(ws, vs) = sum(ws * vs) / sum(ws)

Now I can compute the volume-weighted average price (VWAP), a common metric in finance. I'm going to do this for every five minutes (5m).

>>> from trades select vwap = wavg(size, price) by symbol, bar(timestamp, 5m) symbol timestamp vwap AAPL 2019-05-01 09:30:00 210.305724 BAC 2019-05-01 09:30:00 30.483875 CVX 2019-05-01 09:30:00 119.427733 AAPL 2019-05-01 09:35:00 202.972440 BAC 2019-05-01 09:35:00 30.848397 CVX 2019-05-01 09:35:00 119.431601 AAPL 2019-05-01 09:40:00 204.671388 BAC 2019-05-01 09:40:00 30.217362 CVX 2019-05-01 09:40:00 117.224763 ... ... ...

I can also perform joins across time series. Let's take a look at some quotes (bid and ask). Notice how these timestamps don't line-up with the trades; a stock exchange will change its quotes constantly before a trade occurs.

>>> let quotes = load("quotes.csv") >>> quotes symbol timestamp bid ask AAPL 2019-05-01 09:30:00.410166 210.450 211.02 AAPL 2019-05-01 09:30:00.491133 210.800 211.15 CVX 2019-05-01 09:30:00.544263 117.760 118.34 BAC 2019-05-01 09:30:00.585684 30.240 30.27 CVX 2019-05-01 09:30:01.096591 118.220 118.54 AAPL 2019-05-01 09:30:01.131702 210.925 211.20 AAPL 2019-05-01 09:30:01.185615 210.980 211.21 AAPL 2019-05-01 09:30:01.349968 210.730 211.34 AAPL 2019-05-01 09:30:01.404082 211.150 211.40 ... ... ... ...

Empirical can line-up timestamps automatically. Here I ask for the latest quote for each trade:

>>> join trades, quotes on symbol asof timestamp symbol timestamp price size bid ask AAPL 2019-05-01 09:30:00.578802 210.5200 780 210.80 211.15 AAPL 2019-05-01 09:30:00.580485 210.8100 390 210.80 211.15 BAC 2019-05-01 09:30:00.629205 30.2500 510 30.24 30.27 CVX 2019-05-01 09:30:00.944122 117.8000 5860 117.76 118.34 AAPL 2019-05-01 09:30:01.002405 211.1300 320 210.80 211.15 AAPL 2019-05-01 09:30:01.066917 211.1186 310 210.80 211.15 AAPL 2019-05-01 09:30:01.118968 211.0000 730 210.80 211.15 BAC 2019-05-01 09:30:01.186416 30.2450 380 30.24 30.27 CVX 2019-05-01 09:30:01.639577 118.2550 2880 118.26 118.37 ... ... ... ... ... ...

The time-series joins can go in different direction. Here are some made-up events that occurred through-out the trading day:

>>> let events = load("events.csv") >>> events symbol timestamp code CVX 2019-05-01 09:30:03 a1 BAC 2019-05-01 09:30:04 e3 AAPL 2019-05-01 09:30:06 f7 CVX 2019-05-01 09:30:07 h9

I want to know the closest event for each trade. I'm also going to limit the search to within three seconds (3s).

>>> join trades, events on symbol asof timestamp nearest within 3s symbol timestamp price size code AAPL 2019-05-01 09:30:00.578802 210.5200 780 AAPL 2019-05-01 09:30:00.580485 210.8100 390 BAC 2019-05-01 09:30:00.629205 30.2500 510 CVX 2019-05-01 09:30:00.944122 117.8000 5860 a1 AAPL 2019-05-01 09:30:01.002405 211.1300 320 AAPL 2019-05-01 09:30:01.066917 211.1186 310 AAPL 2019-05-01 09:30:01.118968 211.0000 730 BAC 2019-05-01 09:30:01.186416 30.2450 380 e3 CVX 2019-05-01 09:30:01.639577 118.2550 2880 a1 ... ... ... ... ...

The result is a lot of blanks (null values) for items that do not line-up. Empirical handles missing data automatically, so any future activity with these results will cascade forward.

Static typing

What makes Empirical unique is that it is statically typed. The compiler knows before running user code whether it is allowed.

>>> 1 + 2 3 >>> 'a' + 'b' Error: unable to match overloaded function + candidate: (Int64, Int64) -> Int64 argument type at position 0 does not match: Char vs Int64 candidate: (Float64, Float64) -> Float64 argument type at position 0 does not match: Char vs Float64 candidate: (Int64, Float64) -> Float64 argument type at position 0 does not match: Char vs Int64 ... <53 others>

This is extremely useful for catching typos. Suppose I want to sort my quotes by the bid-ask spread:

>>> sort quotes by (asks - bid) / bid Error: symbol asks was not found

Here it caught the misspelled asks. I can correct it to ask:

>>> sort quotes by (ask - bid) / bid symbol timestamp bid ask BAC 2019-05-01 09:32:46.313487 30.5650 30.5650 BAC 2019-05-01 09:32:53.738446 30.6124 30.6124 BAC 2019-05-01 09:39:24.459415 31.0600 31.0600 AAPL 2019-05-01 09:45:51.931597 206.9400 206.9500 AAPL 2019-05-01 09:43:59.903292 206.3200 206.3300 BAC 2019-05-01 09:32:50.369746 30.6400 30.6417 CVX 2019-05-01 09:32:57.242072 119.7732 119.7800 AAPL 2019-05-01 09:38:18.980026 205.1100 205.1222 AAPL 2019-05-01 09:38:19.978890 205.1100 205.1251 ... ... ... ...

All of this error checking is performed before running the user's code. This is beneficial for writing large scripts.

I used to run overnight simulations back during my quantitative finance days, only to find-out in the morning that the program had crashed after a few hours because of a typo later in the script. I made Empirical specifically to prevent this from ever happening again.

How it works

The real "magic" here is that Empirical can infer a Dataframe's type from an external source at compile time. Let's look back at how the table is loaded:

let trades = load("trades.csv")

The path to the file is known at compile time. Empirical samples the file to figure-out what's in it. In fact, any value that can be solved at compile time is acceptable.

let filename = "trades.csv" let trades = load("./" + filename)

The load() function is actually a macro that invokes a templated function on a templated type:

csv_load{CsvProvider{"trades.csv"}}("trades.csv")

The CsvProvider invokes an internal function that determines the types:

>>> _csv_infer("trades.csv") "{symbol: String, timestamp: Timestamp, price: Float64, size: Int64}"

And it is this inferred type that is automatically compiled into the user's code. The entire process maintains static typing even though the user didn't explicitly list the types!

So what happens when the file path cannot be determined at compile time? This occurs when using an external variable, like argv when running a script from the command line. (As with many programming languages, argv is the list of the user's command-line arguments.)

For example running this in a script:

let trades = load(argv[1])

gives the following error:

Error: macro parameter filename requires a comptime literal Error: unable to determine type for trades

So now we must specify the type and invoke the templated function directly.

data Trade: symbol: String, timestamp: Timestamp, price: Float64, size: Int64 end let trades = csv_load{Trade}(argv[1])

I can run my overnight simulation with the confidence that my script doesn't have common typos.

Under the hood

Fuller details can be found on a pair of blog posts (1, 2), but broadly speaking Empirical compiles to a virtual machine, which dispatches to a runtime.

Launching Empirical on the command line with --dump-vvm will show what the Vector Virtual Machine (VVM) is doing. VVM has its own assembly language that you can code in (it's how I do regression tests).

Empirical's

load("trades.csv")

becomes VVM's

$0 = {"symbol": Sv, "timestamp": Tv, "price": f64v, "size": i64v} @1 = "trades.csv" load @1 $0 %0

This may look funky, but it simply defines the type ($0), sets a global register for the string that represents the filename (@1), and then invokes the load opcode. The result is saved to the local/temporary register %0.

The VWAP example

func wavg(ws, vs) = sum(ws * vs) / sum(ws) from trades select vwap = wavg(size, price) by symbol, bar(timestamp, 5m)

becomes a ton of VVM:

$1 = {"symbol": Sv, "timestamp": Tv} $2 = {"symbol": Sv, "timestamp": Tv, "vwap": f64v} @3 = def wavg([Int64], [Float64])("ws": i64v, "vs": f64v) f64s: mul_i64v_f64v %0 %1 %2 sum_f64v %2 %3 sum_i64v %0 %4 div_f64s_i64s %3 %4 %5 ret %5 end alloc $1 %1 member @1 0 %2 member %1 0 %3 assign %2 Sv %3 member @1 1 %4 unit_m_i64s 5 %5 bar_Tv_Ds %4 %5 %6 member %1 1 %7 assign %6 Tv %7 group $0 @1 $1 %1 $2 %8 %9 %10 assign 0 i64s %11 lt_i64s_i64s %11 %10 %12 bfalse %12 86 member %9 %11 %13 member %13 3 %14 member %13 2 %15 call @3 3 %14 %15 %16 member %8 2 %17 append %16 f64s %17 add_i64s_i64s %11 1 %11 br 47 repr %8 $2 %18 save %18

This monster groups the Dataframe according to the user's criteria (the by clause) and then repeatedly invokes the wavg function on the individual groups.

VVM executes everything in a runtime. Each procedure is vector-aware by default, meaning that the entire column of the Dataframe is handled by one function call in the runtime. Ie., the virtual machine performs just one dispatch for each column routine, which amortizes the cost of using a VM.

Empirical's entire stack is written in C++. Because of the amortized cost, Empirical is about as fast as hand-written C++ if the Dataframes are large enough.

Where to get it

You can run Empirical on Repl.it right now, or download it for your own machine. Be sure to read the tutorial.

The source code is available under the AGPL with the Commons Clause.


Info for the Jam

This submission is on behalf of the Empirical Software team. I am the creator of Empirical and Andrew is my beta tester.

The timing of the Jam straddled my previous sprint (metaprogramming (1, 2)) and my next sprint (streaming computation).

The change log and more granular commit history should serve as proof of work for the Jam period.

Voters
Profile icon
AnhTuong
Profile icon
zclarke
Profile icon
matthewproskils
Profile icon
Vandesm14
Profile icon
dillonjoshua68
Profile icon
rediar
Profile icon
XroixHD
Profile icon
ZippeyKeys12
Profile icon
AndrewCarr2
Profile icon
mkhoi
Comments
hotnewtop
Profile icon
HahaYes

DANG beats Cookeylang any day
maybe our lang :(

Profile icon
JDOG787
Profile icon
TheDrone7

Moving to share since this is just an existing project being ported to repl.it. Even though it has been worked on in the jam duration, it doesn't satisfy the requirements as specified in the blog post.

Profile icon
chrisaycock

@TheDrone7 I went by what was in the blog's FAQ:

Can I remix or improve on an existing language?

Yes, as long as you're adding original ideas and putting an effort to meaningfully change or improve the language.

Profile icon
TheDrone7

@chrisaycock we know that but the changes made within the jam duration didn't seem to impact the language enough.

Profile icon
chrisaycock

@TheDrone7 As I stated in the linked change log and commit history, my work during the Jam period added metaprogramming components that allowed for some of the features highlighted in my post here. For example:

  • extensions to generic functions allowed for wavg() to omit explicit types
  • inlining and macros allowed for the seamless load() (it used to be a hardcoded function up until a couple weeks ago)
  • global variable didn't even exist

It's your call to disqualify, but I find very strange when you claim I didn't make changes to the language over the last three weeks.

Profile icon
TheDrone7

@chrisaycock I never said you didn't make changes, but majority of the submissions are entirely new languages built within the duration, we expect you to make changes that impact the language that much. Compared to entirely new languages being built, your changes weren't enough.

Profile icon
[deleted]

@TheDrone7 I'm not a participant or anything, but as an outsider of this contest, I must say I find it weird how these submissions are disqualified. This project has literally been worked on every day during the contest period.

Profile icon
TheDrone7

@alfredbirk these submissions are great projects that did require a lot of effort, yes. They are only being disqualified for not being what we were looking for, we set up some requirements and these don't meet them is all. Being disqualified from the jam doesn't mean you didn't make something good, it just means you worked on something we weren't aiming for in the jam.

Profile icon
[deleted]

@TheDrone7 Hmm, I would say those changes are an effort to meaningfully improve the language, but your call..

Profile icon
[deleted]

@TheDrone7 Possible not very impactful changes, but definitely an effort to meaningfully improve the language. But I guess it's okay to change the rules?

Profile icon
TheDrone7

@alfredbirk I cannot change the rules. The disqualification is due to lack of impact but it will still be showcased as a submission. And possibly, as an honorable mention because I do like the idea behind this.

Profile icon
chrisaycock

@TheDrone7 I'm not asking you to change the rules. I'm asking you to judge me by the merits of what I did during the Jam. Generics are such a difficult feature in statically typed languages that Go and Zig don't have them.

The whole promise of this contest was that I would be judged by language experts.

Profile icon
TheDrone7

@chrisaycock you would normally be judged by the experts but I'm just here to make sure the submissions being forwarded to them fulfil the requirements for the jam.
Moreover, I realised this when I went over all your GitHub commits, even if I were to allow this submission, you would've still failed to satisfy the requirement of working as a team of at least 2 members through the jam. All of the commits were made by you. And you're also the only member of the team on repl.it
So I'm afraid, despite all your hard efforts, this would still not be a valid submission.

I do appreciate all that hard work, trust me. But rules are rules and I'm only here to enforce them.

Profile icon
chrisaycock

@TheDrone7 The rules don't say that commits have to be done by two people. It literally does not say that.

My teammate is @AndrewCarr2, who is on Repl.it. He tests my work using the binaries. He used to file issues on GitHub (eg. 1, 2); now he messages me directly and I just CC him.

I hate to belabor this point since you've made-up your mind, but you clearly are not "enforcing the rules" when you are making this up.

Profile icon
TheDrone7

@chrisaycock you did not mention your teammate in the post as the submission guidelines suggested. You're clear there then.

As for the language though, I still cannot allow it as it was a mixed decision of 3 people not just mine alone.

Profile icon
chrisaycock

@TheDrone7 I did mention my teammate in my post. Let me quote it for you:

This submission is on behalf of the Empirical Software team. I am the creator of Empirical and Andrew is my beta tester.

Profile icon
TheDrone7

@chrisaycock mentioning refers to using the at symbol @ followed by the username, such as - @TheDrone7 it gives us a link to the user profile of your teammates which we can use to ensure that it's not your alternate account since we've had people do that in older jams we organised like this one.

Profile icon
AndrewCarr2

I never thought my existence would be questioned. I am definitely real and happy to prove it. @isaiah08

https://andrewnc.github.io

Profile icon
chrisaycock

@AndrewCarr2 Yeah, this whole experience has been a disappointment.

Profile icon
TheDrone7

@chrisaycock I would like to apologise on their behalf. I've issued a warning against their behaviour. Also a simple note, try to not take any comments strike throughed seriously. It usually means they're saying it as a joke and don't mean it. But as we can both see it can offend others sometimes so I have decided to take action in this case.

Profile icon
HahaYes

@AndrewCarr2 oof its just a kid

Profile icon
TheDrone7

@HahaYes I wouldn't recommend relating it to repl.it in any way.

Profile icon
HahaYes

@TheDrone7 Sorry! I'll delete it.

Profile icon
isaiah08

I am extremely sorry, I didn't mean to offend anyone. @AndrewCarr2 @chrisaycock

Profile icon
AndrewCarr2

@isaiah08 I know you didn't 👍🏻 No problem here. Feelings are pretty high in this thread, but no harm done.

Profile icon
HahaYes

yo you are into stocks too? Same!

Profile icon
chrisaycock

@HahaYes After my PhD, I spent a decade working for hedge funds and proprietary trading firms. I specialized in statistical arbitrage and high-frequency trading.

Profile icon
HahaYes

@chrisaycock whoa. I'm a teen that invests. (Accidentally did a options trade so glad I didn't lose money XD)

Profile icon
fuzzyastrocat

@HahaYes I wAtcH thE StoCKs seGmEnt oN thE nEWs!

Profile icon
HahaYes

ok this is too good

Profile icon
chrisaycock

@HahaYes If you want some more information, I announced the first beta on Hacker News here. I put the project on hold for over a year while I dealt with other things. I picked it up again recently to implement some things I've been obsessively thinking about.

Profile icon
HahaYes

@chrisaycock lol nice. Teens vs Adults. What a showdown

Profile icon
fuzzyastrocat

@HahaYes Go teens!

Profile icon
hg0428

This language is almost as good as ours, I am going to need to hurry up on the object-oriented stuff.

Profile icon
Navinor

underrated af