Following up on last week’s post, one of the areas we see (saw!) our integration service running CPU hot was when it was doing the core part of it what it does: diffing the list of data we receive from an EHR integration with our own knowledge of the data (aka a sync process). When the data set was in the 1000s of records, the diff calculations were effectively a couple of milliseconds, but as the data sets reached 10k+ records, we often saw in production that the diffs could take over 50/60+ seconds.
Our original implementation of this diff algorithm was pretty simple. It took the inbound list and did an Array filter against one list, and then an Array find on the other to see if there were matches. Here’s a snippet of the code:
const onlyInInbound = inboundList.filter(currentInbound => { return lumaList.filter(currentLuma => { return currentLuma.externalId.value == currentInbound.id; }).length === 0; });
The operation was basically O(n*m). In one customer’s account, that implementation ran on average of 54,844ms to run. Not good. In synthetic tests we’d see the function run faster over time as the JIT caught up to the work but it was pathetically slow.
Our first pass at optimizing this was to use a technique similar to fast.js‘s array methods, which is to not use the built in Array functional operators and switch to more vanilla for loops. From reading a bunch, the built in iterators have to detail with things like spare arrays so you end up spending a lot of type in edge case checking. We know for sure what the input data sets look like, so we eventually moved to an implementation that looked like this:
function filter (subject, fn) { let result = []; for (var i = 0; i < subject.length; i++) { if (fn(subject[i])) { result.push(subject[i]); } } return result; } const onlyInInbound = filter(inboundList, currentInbound => { return filter(lumaList, currentLuma => { return currentLuma.externalId.value == currentInbound.id; }).length === 0; });
This implementation was much much faster, and brought the operation in that same customer account down to 20,316ms on average. Not amazing by any stretch, but far faster than before. As we kept writing synthetic tests, one of the big things we noticed was the JIT wasn’t able to fully lower these functions if the comparisons weren’t on the same data type. If the comparisons were mixed presentations of the same value (e.g.. compare ‘1’ to 1), we’d get no JIT benefit (on Node 10). Unfortunately, due to the dirty nature of the data we ingest from all the EHRs we integrate with, we have to assume a level of variable typing in our data pipeline, so the JIT could only save us so much.
The last and final implementation we made (which is what is running in production now) was to do the classic tradeoff of memory versus CPU. The final implementation iterated through both lists and converted them to objects so we could do direct lookups instead of iterations of the data. Here’s a snippit of the final implmentation:
const newInboundList = {}; for (var i = 0; i < inboundList.length; i++){ newInboundList[inboundList[i].id] = inboundList[i]; } const newLumaList = {}; for (var i = 0; i < lumaList.length; i++){ newLumaList[lumaList[i].externalId.value] = lumaList[i]; } const onlyInInbound = []; for(const inbound in newInboundList) { if (!newLumaList[inbound]) { onlyInInbound.push(newInboundList[inbound]); } }
As you can see, we trade a little bit of time to do the setup (by creating a two object based representations of the data) and then do an O(n) iteration through the list of comparison data. And viola! The final implementation went to 72.5ms, a 761x improvement over the original implementation.