Performance Implications When Comparing Types in Node.js

Like in any language that is weakly typed, you can’t avoid the fact that performing comparisons across types will cost you CPU cycles.

Consider the following code which does a .filter on an array of 5M entries, all of which are Numbers:

let arrOfNumbers = Array(5000000).fill(1);
console.time('eqeq-number')
arrOfNumbers.filter(a => a == 1)
console.timeEnd('eqeq-number')
console.time('eqeqeq-number')
arrOfNumbers.filter(a => a === 1)
console.timeEnd('eqeqeq-number')

On my Mac, they’re roughly equivalent, with a marginal difference in the performance in the eqeq and eqeqeq case:

eqeq-number: 219.409ms
eqeqeq-number: 225.197ms

I would have assumed that the eqeqeq would have been faster given there’s no possibility of data type coercion, but it’s possible the VM knew everything was a number  in the array and the test value, so, meh, about the same.

Now, for the worst case scenario, consider this following code: the same .filter, but the array is now full of 5M strings of the value “1”:

let arrOfStrings = Array(5000000).fill('1');
console.time('eqeq-string')
arrOfStrings.filter(a => a == 1)
console.timeEnd('eqeq-string')
console.time('eqeqeq-string')
arrOfStrings.filter(a => a === 1)
console.timeEnd('eqeqeq-string')

The eqeq costs about the same as the original example with the weakly typed Number to Number comparison, but now the eqeqeq is significantly faster:

eqeq-string: 258.572ms
eqeqeq-string: 72.275ms

In this case it’s clear to see that the eqeqeq case doesn’t have to do any data coercion since the types don’t match, the evaluation is automatically false without having to muck the String to a Number. If you were to continue to mess around and have the .filters compare eqeq and eqeqeq to a String ‘1’ the results again are the same as the first few tests.

Conclusion? Same the VM work if you can. This is a really obtuse example as the eqeqeq can quickly shortcut the comparison to “false” since the types don’t match, but anywhere you can save effort when working on large data sets, it’s helpful to do so, and typing is an easy win when you can take it.

Optimizing Array Lookups in Node.js

Following up on last week’s post, one of the areas we see (saw!) our integration service running CPU hot was when it was doing the core part of it what it does: diffing the list of data we receive from an EHR integration with our own knowledge of the data (aka a sync process). When the data set was in the 1000s of records, the diff calculations were effectively a couple of milliseconds, but as the data sets reached 10k+ records, we often saw in production that the diffs could take over 50/60+ seconds.

Our original implementation of this diff algorithm was pretty simple. It took the inbound list and did an Array filter against one list, and then an Array find on the other to see if there were matches. Here’s a snippet of the code:

const onlyInInbound = inboundList.filter(currentInbound => {
	return lumaList.filter(currentLuma => {
		return currentLuma.externalId.value == currentInbound.id;
	}).length === 0;
});

The operation was basically O(n*m). In one customer’s account, that implementation ran on average of 54,844ms to run. Not good. In synthetic tests we’d see the function run faster over time as the JIT caught up to the work but it was pathetically slow.

Our first pass at optimizing this was to use a technique similar to fast.js‘s array methods, which is to not use the built in Array functional operators and switch to more vanilla for loops. From reading a bunch, the built in iterators have to detail with things like spare arrays so you end up spending a lot of type in edge case checking. We know for sure what the input data sets look like, so we eventually moved to an implementation that looked like this:

function filter (subject, fn) {
	let result = [];
	for (var i = 0; i < subject.length; i++) {
		if (fn(subject[i])) {
			result.push(subject[i]);
		}
	}
	return result;
}

const onlyInInbound = filter(inboundList, currentInbound => {
	return filter(lumaList, currentLuma => {
		return currentLuma.externalId.value == currentInbound.id;
	}).length === 0;
});

This implementation was much much faster, and brought the operation in that same customer account down to 20,316ms on average. Not amazing by any stretch, but far faster than before.  As we kept writing synthetic tests, one of the big things we noticed was the JIT wasn’t able to fully lower these functions if the comparisons weren’t on the same data type. If the comparisons were mixed presentations of the same value (e.g.. compare ‘1’ to 1), we’d get no JIT benefit (on Node 10). Unfortunately, due to the dirty nature of the data we ingest from all the EHRs we integrate with, we have to assume a level of variable typing in our data pipeline, so the JIT could only save us so much.

The last and final implementation we made (which is what is running in production now) was to do the classic tradeoff of memory versus CPU. The final implementation iterated through both lists and converted them to objects so we could do direct lookups instead of iterations of the data. Here’s a snippit of the final implmentation:

const newInboundList = {};
for (var i = 0; i < inboundList.length; i++){
	newInboundList[inboundList[i].id] = inboundList[i];
}
const newLumaList = {};
for (var i = 0; i < lumaList.length; i++){
	newLumaList[lumaList[i].externalId.value] = lumaList[i];
}
const onlyInInbound = [];

for(const inbound in newInboundList) {
	if (!newLumaList[inbound]) {
		onlyInInbound.push(newInboundList[inbound]);
	}
}

As you can see, we trade a little bit of time to do the setup (by creating a two object based representations of the data) and then do an O(n) iteration through the list of comparison data. And viola! The final implementation went to 72.5ms, a 761x improvement over the original implementation.

Monitoring the Node.js Event Loop with InfluxDB

One of our services (our integration engine) at Luma Health has recently been encountering odd timeouts when making outbound connections to another service it depends on. The receiving service has plenty of resources to spare, so we’ve been working through the theory that the event loop in Node might be starved before the callbacks and timers loops cycles are able to be hit.

To test this, we’ve been playing with monitoring timer performance putting the data in to InfluxDB in order to aggregate and monitor it. To do that, we simply set up a setInterval and use a high resolution timer to watch the results and write the delta of when we expected to be called versus when the interval was actually called.

const Influx = require('influx');
// snapshot the package's name
const packageName = require(process.cwd() + '/package.json').name;
const measurement = 'event_loop_interval_delay';
const fs = require('fs');
const influx = new Influx.InfluxDB(process.env.INFLUXDB);
const { exec } = require('child_process');

let serviceVersion = null;

// snap out the gitsha
exec('git rev-parse HEAD', (err, version) => {
	serviceVersion = version.toString().trim();
});

// and the docker container ID
const hostname = fs.existsSync('/etc/hostname') ?
	fs.readFileSync('/etc/hostname').toString().trim() :
	'localhost';

let startAt = process.hrtime();
const intervalDelay = 500;

// set up an interval to run every 500ms
setInterval(() => {
	const calledAt = process.hrtime(startAt);
	const nanoseconds = calledAt[0] * 1e9 + calledAt[1];
	const milliseconds = nanoseconds / 1e6;
	influx
		.writePoints([{
			measurement,
			tags: {
				service: packageName,
				serviceVersion,
				hostname
			},
			fields: {
				delayTime: (milliseconds - intervalDelay).toFixed(4)
			},
		}])
		.then(() => {})
		.catch(() => {});
	startAt = process.hrtime();
}, intervalDelay);

I thought it’d be fun to share how we’re using Influx to monitor Node internals. We’ve been monitoring the data and generally seeing Node able to keep up but there are times when the integration engine is under high load and the intervals come anywhere from 500ms to multiple seconds (!!!) late.

Swift: println isn’t NSLog

After banging my head against this the last few days, I thought I might share a little insight as I delve deeper into Swift. As tempting as it may sound to believe it, println IS NOT NSLog.

If you’ve looking to use the Devices function in the iOS Simulator to test an application launching via a push notification, or perhaps via a location update, you’re stuck trying to race to connect the debugger or better yet rely on printed statements and watching them.

The magic trick here is to remember that println does not show up in the iOS Simulator Console, but NSLog does. Further NSLog also works on device so can play with app while disconnected from your dev machine, then plug it back in and pull the logs via the Devices tool in Xcode.

Some additional details are here, which I sadly found far after figuring this out myself.

Music Top 10 from 2014

Like last year and the year before, here’s the list of music I’ve listened to most as scrobbed to Last.fm.

Top Artists

  1. Above & Beyond – 120 listens
  2. Tensnake – 67 listens
  3. Brika – 57 listens
  4. Chromeo – 56 listens
  5. She & Him – 54 listens
  6. Pink Martini – 50 listens
  7. Phantogram – 45 listens
  8. Rise Against – 42 listens
  9. Porter Robinson – 39 listens
  10. Tycho – 38 listens

Top Songs

  1. Brika – Options – 27 plays
  2. Indifferent Guy – Danger – 15 plays
  3. Above & Beyond – We’re All We Need (feat. Zoë Johnston) – 13 plays
  4. Styles – Good Times – 12 plays
  5. Ed Sheeran – Sing – 12 plays
  6. Mystique – Brand New  – 12 plays
  7. Brika – Expectations – 11 plays
  8. Naughty Boy – La La La – 10 plays
  9. Kiesza – Hideaway – 10 plays
  10. Chromeo – Over Your Shoulder – 9 plays

Observations

  • While I’ve been continuing to listen to house and electronica, the style has changed quite a bit to be less four on the floor and dancey and more low key and mellow. Examples such as Mystique’s Brand New and Danger by Indifferent Guy.
  • There’s only one hip hop song this year and that’s quite a throw back with Style’s Good Times.
  • Brika was one of my favorite new artists last year and I’m excited that she just released a full album which includes two songs from the most played list: Expectations and Options.

Loading Shapefiles in to MongoDB

I’ve been playing a bit recently with a small geospatial/location based app. After haggling with a bunch of tools and MongoDB a bunch, here are a few tips on importing a set of ESRI Shapefiles in to a MongoDB. I’m looking at SF Street Sweeping data but you can use any Shapefiles you wish.

Get the shapefiles

For example, grab the SF Street Sweeping data, and download and unzip those.

Convert Shapefiles to WGS84 Projection

The SF shapefiles are in Northern California specific projection (2227). Lat/long coordinates are in WGS84 projection (4326). Download the GDAL tools to get access to the ogr2ogr tool to directly convert them. Or use QGIS to load the shapefile as a vector, then export it out using the WGS84 projection.

Convert Shapefile to GeoJSON

ogr2ogr -f geoJSON sweeping.json sfsweeproutes_in_wgs84.shp

Clean up the resulting GeoJSON

MongoImport doesn’t like the ogr2ogr generated GeoJSON. Remove the first two lines:

{
"type": "FeatureCollection",

and the last line:

}

and save that to `sweeping_clean.json`

Import the data to Mongo

mongoimport --db sfstreets --collection streets < sweeping_clean.json

Create a 2dsphere spatial index

Mongo needs an index to query on geospatial data. To create it from the mongo command line:


db.streets.ensureIndex({"geometry":"2dsphere"})

Where streets is your collection name and `geometry` is the object in your document that contains the GeoJSON location data.

That’s it!
You now have in the sfstreets database a streets streets collection. I’ll follow up in the next post on how to query this data.<

We’ve raised $15M in funding from Kleiner Perkins

What an exciting morning — and really an exciting year here at Remind101. This morning we announced that we’ve raised $15M in funding Kleiner Perkins and John Doerr has joined our board of directors. This was the first time I was at all involved in a fund raise and I’d be lying if I didn’t say it was a learning experience. From the process of helping to put together the pitch deck, the models, help out on pitches and follow up it’s a pretty daunting task for any team to go through. I’m thrilled how it turned out for us and to have such an amazing partner at Kleiner in John Doerr.

If you’re interested in reading any of the coverage, here’s a list I’ll try to keep up to date:

Music Top 10 from 2013

Like last year, here’s the list of music I’ve listened to most, as recorded by my scrobbling to Last.fm.

Top Artists

  1. Above & Beyond – 135 listens
  2. Avicii – 98 listens
  3. Daft Punk – 96 listens
  4. Jay-Z – 77 listens
  5. The Delfonics – 71 listens
  6. Alt-J – 69 listens
  7. Delorean & Hardwell – tied at 60 listens
  8. Morgan Page & Toro y Moi –  tied at 54 listens
  9. Hall and Oats & CHVRCHES – tied at 52 listens
  10. Bastille – 50 listens

Top Songs

  1. Bastille – Pompeii – 31 plays
  2. Hardwell – Spaceman – 28 plays
  3. Zedd – Clarity – 27 plays
  4. Swedish House Mafia – Don’t You Worry Child – 16 plays
  5. Avicii – Addicted To You – 16 plays
  6. Hardwell – Apollo – 15 plays
  7. Stevie Nicks – Edge of Seventeen – 13 plays
  8. CHVRCHES – The Mother We Share and  Walking Def – Running All My Life – tied at 12 plays
  9. Fastball – The Way and Frank Sinatra – I Love Paris and Swedish House Mafia – Greyhound and Alt-J – Something Good and Hall and Oats – Private Eyes and Avicii – Wake Me up – tied at 11 plays
  10. Kansas – Carry on Wayward Son – 10 plays

Observations:

  • Clearly the trend that started towards more electronica (specifically house and trance) continues this year.
  • Jay Z made a strong showing, largely due to me going to the Manga Carta world tour which sparked a re-interest in his epic catalog
  • Couple of really old songs and artists showed up, like Kansas, Hall and Oats and Frank Sinatra. This is because I started using Spotify in addition to iTunes/Amazon which has made getting to older music much easier and fun to rediscover stuff I haven’t heard in a long time.

Static Site Hosting on Heroku with Node.js

I’ve been moving a lot of my web content off of a personal server which has been kept in to my apartment to various hosting services while on break this year.  Sites like Ask An Asian Person and other small inside jokes I used to host on a Windows 2003 Server with IIS on a Dell machine that ran in my closet. That setup is/was so very, well, 2003. In addition, it’s always a good move to reduce and remove any ingress points to my home network.

So for a bunch of the silly small sites I have, I’ve moved them over to one-dyno free hosting on Heroku. To do that, I made up a little template to use called static-heroku-node. It’s a tiny 10 line Node.js + Express application that deploys applications out of the /public/ folder in the app. Quick and easy to use, I managed to move a few sites over in short order.

As an aside, I moved my blog over to DreamHost. I looked at Heroku for hosting WordPress — there are a bunch of options on how to do it, but any production setup (e.g. > 1 dyno and any of their production level Postgres databases) would cost something like $25-$50 per month which is a bit rich for just a blog. DreamHost’s 1-click WordPress setup is much cheaper and more flexible than trying to scaffold the same thing up on Heroku/Dotcloud/etc.

Track of the Week: Hey Now by London Grammar

This week’s track is Hey Now by London Grammar off of Metal and Dust, and the Arty Remix of the same. Not a new song by any measure, but I heard it for the first time on KCRW earlier last week, and then again in a house podcast in a house remix. So today’s track of the week is presented in two parts. First, the haunting yet slightly electro-poppy original, and then the dance remix which ups the tempo, adds in a classic four on the floor dance line and synths.