melreams.com

Nerrrrd

How does a hash map work, anyway?

I was reading this article about programming interviews a little while ago and one of the things they mentioned was that “A startlingly high percentage of interview questions reduce to breadth-first search or the use of a hash table to count uniques. You need to be able to write a BFS cold, and you need to understand how a hash table is implemented.” I saw that and started thinking wait a minute, how is a hash table implemented? I’m sure we covered that in college, but that was quite a while ago and I’ve forgotten a lot of stuff.

Unrelated photo from pexels.com
Unrelated photo from pexels.com

Quick note: Hash tables are also known as hash maps, dictionaries, associative arrays and probably more. I usually call them hash maps because that’s what they’re called in java.

For data structures, hash maps are really kind of cool. Under the hood, the data is actually stored in an array of linked lists. The interesting part is how the hash map decides where to put new elements. In a hash map, each array index is a bucket to put elements in, not the location of a single element. It uses a hash function (according to wikipedia, “any function that can be used to map data of arbitrary size to data of fixed size”, basically you give it an object and it gives you a number in a certain range) to decide which bucket to put any given item in. Hash % array length = array index to add item to. If there’s already an item in that bucket, it walks down the linked list to the last item and adds the new item to it. Because there can be multiple items in each “bucket,” the hash map creates an entry element that contains both the key and the value so it can find the right value for a given key when you ask for an element back.

Dividing the whole collection of keys and values into buckets is how hash maps work so quickly. If you had to search the whole list of keys each time it would much longer to get an element back from the hash map, but because of the bucketing system you only have to look at the elements in one bucket. It’s also very fast to add elements because you don’t need to worry about resorting the list, you just find the right bucket and add your new element to the end of the list.

There is a complication, though: the more items in each bucket the longer it takes to find an item. If your “buckets” fill up, you’ll need to expand the hash map by making a longer array and recalculating which bucket to put each and every item into. That can be a pretty bad performance hit if you have enough items in your hash map already. You also don’t necessarily want to avoid that problem by using a very large array, that just eats up memory for no good reason if you have so many buckets that you never end up putting anything in most of them.

Because of the bucketing system, hash maps are a great way to count uniques – being able to quickly find the right bucket and look through only a few items means you can add items or see whether the hash map already contains the item you want to add. On the other hand, hash maps aren’t very useful if you care about the order of your items or if you’re just going to process all of them in a loop. Getting a list of every element in a hash map involves walking each linked list in each bucket, which can take some time (and memory!) if you have enough items. If you’re going to process everything in your collection, skip the overhead of a hash map and just use an array list or linked list.

Okay, so given all of those implementation details, why are hash maps interesting to ask about in an interview? While it’s generally a good idea to understand data structures because they’re so core to what we do as programmers, I’m suspicious they get asked about in interviews because those interviewers were asked about them. Unless you’re interviewing a recent grad and want to make sure they paid attention in class, I’m not convinced that you really learn anything interesting about someone’s ability to code by asking for more details than what a hash map is good for and when you should use one. I mean, it’s been years since I forgot how a hash map actually works and I manage to write code that does what I meant most of the time :)

Mongo tip of the day

Mongo can be very weird to adjust to if you’re used to “normal” (SQL) databases. One thing that tripped me up a little was discovering that mongo throws a DuplicateKeyException when you try to insert a duplicate into a field that has a unique index but is not a key. If you see that exception and there’s nothing obviously wrong with your _id field (mongo’s version of a primary key), have a look at any fields you have with a unique index.

Bonus tip: if you’re new to mongo I recommend their recorded Thinking In Documents webinar. It’s a pretty quick (1 hour) overview of how documents and querying work in mongo, I wish I’d found it before I spent so long fumbling around figuring mongo out on my own.

Rubber duck debugging

Rubber duck debugging is one of those things that sounds completely ridiculous and is actually really helpful. To summarize the wikipedia page quickly, rubber duck debugging is when you figure out what’s wrong with your code by explaining it very carefully to an inanimate object.

You’re probably wondering why you would bother explaining your code to an object like a rubber duck or a teddy bear if you work as part of a team and you could just ask one of the other devs for help. Sometimes you should go directly to another developer. If there’s a serious bug in production that affects a lot of users, worrying about interrupting someone would be silly, it’s much more important to get production working again than to let Hypothetical Amy finish off her refactoring task before you bother her.

In other situations, it’s more important to give it a try yourself before interrupting anyone. The reason interruptions are such a big deal is that context switching is expensive and it’s even worse when switching to a programming task because of the number of details programmers have to “reload” into their working memory before they can get back into their work. Numbers on exactly how long it takes to get back up to speed after you’ve been interrupted vary, this Fast Company article says it takes a little over 23 minutes to get back on task but this New York Times article says it’s more like 25 minutes. If you ask someone to help you for just five minutes, it’s really not just five minutes, it’s also the time it takes for them to get back into what they were doing. That’s why programmers tend to get so cranky when you interrupt them :)

One way you can try to solve your own problem without interrupting anyone is rubber duck debugging. Having to explain all of the context around your problem, like what you’re trying to accomplish, what your code is supposed to do to get to that end result, what seems to be going wrong, where you think the problem is, what you’ve already tried, etc is one of the most useful parts of asking another person for help, and often the only part you need to solve your problem. Something about the process of explaining a problem to someone else helps you see parts that you missed before, whether it’s a log file you didn’t check or a logic error you didn’t catch. That explaining process doesn’t actually require a person to explain things to, it can work just as well if you explain it to the rubber duck, or the teddy bear, or a voice recording app on your phone or whatever works for you.

Personally, I like to write an email to the person I would ask for help if I really couldn’t figure it out myself. Putting a name at the top seems to help me get into the mindset of thinking about what that person would ask me about the problem and what they would likely suggest I try next. Most of the time I figure it out before I finish the email, but when I don’t, hey, I already have a nice tidy explanation of the problem that I can send to the person I would’ve asked for help anyway :)

If you’re stuck and don’t have anyone around to ask or just want to try everything you can before you ask for help, give rubber duck debugging a try. The worst case scenario is you end up with a good description of your problem that you could send to a friend of post in a forum.

Programming is actually a creative field

There’s this stereotype that programming isn’t a creative field, that programmers do nothing but mechanically assemble code all day. I find that really sad, I think if we did a better job of explaining how creative programming actually is many more people would be willing to give it a shot.

If you’ve only ever had a total beginner intro to programming, it might be really hard to see where the creativity comes in. Honestly, variables are pretty boring and conditionals aren’t much better. Programming gets a lot more interesting once you’ve mastered the basics, I swear :)

Programming is a bit like building things with lego blocks, except that you have to make all the blocks yourself. Building the blocks – that is, writing an if statement or creating a variable – isn’t that interesting, I’ll be honest. But once you get good at creating those blocks, that’s when you get to be creative. Just like you can build a castle or a siege engine or an entire lego Westeros out of simple little blocks, you can build amazing things out of variables and loops and conditionals if you’re patient.

It’s also a bit like pottery or wood working. Just because you’re constrained to making a usable cup or chair doesn’t mean you can’t be incredibly creative within those constraints. Even in visual art or writing you need sentences that make sense and a combination of shapes and colours that work. Jokes about modern art aside, you can’t just throw paint in the direction of the canvas and expect anyone to care what you’ve done.

There are always constraints on anything you make, programming just has particularly rigid ones. An imperfect sentence is still intelligible, but a missing semicolon will keep your code from compiling at all. For some people that’s more frustration than they care to deal with, for others it’s an interesting challenge.

Building your own project that does anything you want it to is pretty obviously creative, but what about programming at work where you’re given assignments?

Even the least interesting internal application still requires creative problem solving to add new features or fix bugs. There have been times when I’ve had to be very creative to change an application in a way that meets the new requirements without breaking anything that used to work and without making a horrifying mess of the code. There are often quick and dirty ways to make a change that just leave you with more trouble in the long run, and sometimes (in an emergency, for example) they’re the least bad option, but usually you think about not just how you can make the change that’s necessary right now but how you can set things up so that you can make more changes in the future without tearing your hair out.

There’s never just one way to do that, either. When you’re working through a beginner tutorial it will probably look like there’s one right way to build any application and you can’t be creative at all once you’ve decided what you’re going to build and what the user interface should look like. That’s totally untrue. Once you’re working on anything larger than a simple assignment to write a for loop, you enter the world of trade-offs. There’s never only one way to solve a problem and each solution has its own pros and cons.

For example, optimizing code so it runs faster involves a lot of decisions about trade-offs and a lot of creative problem solving. Optimizing code usually makes it harder to read, which makes it harder to update if you need a new feature or find a bug. This matters a whole lot when you’re running a business because programmer time is so expensive. On the other hand, if your program runs so slowly that no one wants to use it (and pay for it!), it doesn’t matter how nice the code is to work with. Sometimes you absolutely need your code to run as fast as possible and unclear code is worth it to get your game to run at a decent frame rate. Other times performance isn’t your highest priority and what you really need is to be able to read and change your code quickly because you get requests for new features all the time.

In short, I build things all day. How is that not creative?

Development is maintenance

Professional programming and the kind of programming you learn in college/university/bootcamp/etc are actually very different things. Despite what you learned in school, development is really maintenance. In other words, I’m here to crush your dreams :)

So, you know how in school you started new projects from the ground up all the time? Yeah, you’ll hardly ever do that at work.

Now, sometimes you will need to research new technologies and/or frameworks and starting a new prototype project is usually a part of that, and sometimes even the largest and slowest moving organization needs to start something completely new, but that’s generally a very small part of the job.

What you’ll actually spend most of your time doing as a professional developer is adding features and fixing bugs in an existing product. That’s such a large part of the job that I wish we’d done more (or any) of it in school. For anyone reading this who is learning to code, I strongly recommend taking sample projects or open source projects or whatever you can get your hands on and adding new features or fixing bugs. Learning to read other people’s code is hugely important and you may only barely touch on it in your studies.

To be fair, just learning to code takes a lot of time and you can only cram so much into any one program without keeping students there for years and years, but I wish we put a little more emphasis on what software developers actually do at work most of the time. I also wish we’d spent more time on why design is such a big deal.

One of the consequences of hardly ever starting completely new projects at work is that the few projects that do get started are extremely long lived. Instead of a tiny throw-away project that you spend maybe a week building and then never touch again, you’ll work with applications that live for years or even decades. This can be really weird to adjust to since the lifespan of those projects means every tiny decision you made in five minutes can come back to haunt you for years to come :)

On the other hand, long lived projects have a much greater impact than tiny little throw-aways. If you do a good job, the code you write can make people’s lives a little bit easier for years and years. You can also build much larger things, whether they’re applications, games, frameworks, or something else entirely, when you have years to work on something. Corporate software development isn’t all bad, you get to work on things that you could never build on your own.

Another way professional development is different from school projects is that requirements always, always change. Even if every feature you add is perfect and bug free (ha!), your users are going to ask for new things and/or discover that the feature they asked for isn’t actually what they needed and the business might expand into new areas and the laws that your business has to follow might change. Sometimes technical requirements even drive changes: if a new version of your database or framework or a library you depend on comes out, eventually you’re going to want to switch to that.

The requirements changes can be infuriating, I’m not going to sugar-coat that. But at least you get to work on something that people care enough about to ask for changes, even if sometimes it seems like they have no idea what they actually want. If you never had to change a piece of software, all that would mean was that absolutely nobody was using it. I  don’t know about you but I think it’s pretty cool that people actually use the stuff I build.

Real world software development is very, very different from what you do in school, so don’t be surprised if it takes you a little while to find your feet. As much as it can be frustrating sometimes, there are some really cool upsides too.

Dev tool of the day

You know what’s incredibly helpful? RequestBin! Why is it so great? Because testing webhooks sucks and RequestBin makes it easy. Logging your output is a good start but that can’t tell you which IP your request is actually coming from. RequestBin can, which is awesome when you’re trying to figure out whether the Elastic IP you set up in AWS is working correctly. It also shows you all of your output (in a nice human-readable format, no less), which is handy if you really want to know exactly what your client receives.

It’s also free! On the downside your destination url only lasts for 48 hours and your data may be wiped out at any time (if you need a permanent solution look at Runscope’s Request Captures – seems only fair to plug the paid version when I’m talking about how helpful the free one is), but the price is right :)

If it’s hard to explain, it’s probably a bad idea

One of the things I struggled with when I was new to programming was how to tell whether a given piece of code is good or not. When everything is new and confusing, how do you tell bad confusing from normal confusing?

One thing that will give you a very helpful hint is if you code is hard to explain. Like this Python style guide puts it:

If the implementation is hard to explain, it’s a bad idea. This is a general software engineering principle — but applies very well to Python code. Most Python functions and objects can have an easy-to-explain implementation. If it’s hard to explain, it’s probably a bad idea. Usually you can make a hard-to-explain function easier-to-explain via “divide and conquer” — split it into several functions.

Basically, if something you’re working on is hard to explain, that’s a sign that it needs to be re-thought. Some problems are just unavoidably complex, but it should still be possible to explain what you’re doing at a high level.

That applies to higher level application logic just as much as to individual functions. At a previous job I worked on a multiplayer game project that involved putting groups of players into rooms together for each round, then closing down that room when the round was over and creating a new one. Our first implementation seemed like a good idea at the time, but when we got the game to a point where the team could play together, we had a terrible time explaining how players were sorted into rooms the to the artists and project manager.

The non-programmers on the team were by no means stupid people and had been working on the game for quite a while by that point. The fact that we couldn’t explain our room selection scheme to them was a very strong sign that what we were doing just didn’t make sense. As we kept playing our in-progress game, it also turned out that it was extremely difficult to get the whole team into the same room. There were less than a dozen of us working on that game, so there was really no good reason for it to be that hard to play together.

In the end, we admitted our room selection logic wasn’t working and rewrote it to be much simpler. Players sometimes had to wait a bit longer for a round to start, but they could play with their friends more easily and stay in the same room with the people they played with last round. The simpler logic that was easier to explain was also a better experience for the players.

I’m not going to pretend you’ll never run into anyone who is invested in not understanding what you’re trying to explain to them, but if you give someone an overview of what you’re up to and they don’t follow it, think about whether you’re doing something overly complicated before you assume the person you’re trying to explain it to is just dumb. Complicated code is harder to test, harder to debug, and harder to change when you get a new feature request. It’s worth paying attention to seemingly unimportant signs like having a hard time explaining your code to someone else because it can save you so much time in the long run.

What does SOLID really mean? Part 5

First, a quick recap: the SOLID principles of object oriented design are, to quote Uncle Bob:

The Single Responsibility Principle A class should have one, and only one, reason to change.
The Open Closed Principle You should be able to extend a classes behavior, without modifying it.
The Liskov Substitution Principle Derived classes must be substitutable for their base classes.
The Interface Segregation Principle Make fine grained interfaces that are client specific.
The Dependency Inversion Principle Depend on abstractions, not on concretions.

Last time I talked about the fourth letter in SOLID, the Interface Segregation Principle. Now I’m moving on to the Dependency Inversion Principle.

The Dependency Inversion Principle says you should depend on abstractions, not concrete classes. Great, what does that mean? Basically that you want to hide the details of what you’re doing not just behind a separate class but behind an interface so you don’t even have to know which class is actually doing the work.

If you only have one class that actually does the work, this probably seems like a total waste of time. Honestly, for some situations it probably is. If there are business reasons that you’re never, ever going to change database vendors, then don’t worry too much about hiding which database driver you’re using. In other situations where things might change or will definitely change (which is most situations, if requirements would just stay put software would be easy), dependency inversion can really help you out.

Unrelated photo from Pexels
Unrelated photo from Pexels

Let’s take sending email as an example. In a web app that you sell to other businesses, you often need to notify their customers of things directly – if you sell an appointment reminder system the entire point is that your customer doesn’t have to manually send emails to their customers, your app takes care of that for them. Sending email sounds simple enough, right? Either you set up your own SMTP server and send emails directly or you use a service like MailChimp or SendGrid or Amazon SES or Mailgun or ___ and you leave it alone.

Not so fast! What if some of your customers want to send email through their own SendGrid account so they can customize their own emails without going through you and see all their stats and everything? What if other customers already have their own SMTP server and want to send email through that? Now you’ve really got to hide all the details so that your code can trigger an email without even knowing whether that email is going to be send directly to a mailserver or to a service like Mailgun.

If you built your app following the dependency inversion principle from the get-to, this is going to be really simple. All you have to do is add another implementation of the email handler interface you already have and you’re set. Best of all, you know you didn’t break your existing email handling because it’s in a separate class that you don’t have to mess with.

If you let your app depend directly on one email service, though, you’ve got a mess to deal with. Not only do you have to add another email handler, but you have to make pretty major changes to your code to pull your existing email handling into a separate class. This can really suck if you let your code deal with too many implementation details, like how to react to different error codes. It also makes the change riskier and more expensive (in both time and money) because any time you change existing code you might introduce new bugs and because you’ll need to retest all of the existing email handling as well as the new feature to make sure everything still works.

Even if you doubt a certain feature is going to change, it’s still worth thinking about dependency inversion. If the code that triggers an email can only talk to an interface, that’s going to change the way you pass along data like the to and from addresses. It’s also going to change how you report and recover from errors. You might still decide to let your code depend directly on your email service, which is perfectly fine if you’ve thought that decision through. The Dependency Inversion Principle isn’t mean to be an ironclad rule, it’s just a principle to help you avoid painting yourself into a corner.

That’s it for SOLID! If there’s a particular design principle you’d like me to cover next, let me know in the comments.

What does SOLID really mean? Part 4

First, a quick recap: the SOLID principles of object oriented design are, to quote Uncle Bob:

The Single Responsibility Principle A class should have one, and only one, reason to change.
The Open Closed Principle You should be able to extend a classes behavior, without modifying it.
The Liskov Substitution Principle Derived classes must be substitutable for their base classes.
The Interface Segregation Principle Make fine grained interfaces that are client specific.
The Dependency Inversion Principle Depend on abstractions, not on concretions.

Last time I talked about the third letter in SOLID, the Liskov Substitution Principle. Now I’m moving on the the Interface Segregation Principle.

Another way to state the Interface Segregation Principle is that no client should be forced to depend on methods it does not use (thanks wikipedia). That is, if you have methods in your interface that are different enough that no single client would use both of them, those methods probably belong in separate interfaces.This is similar to but not quite the same as the Single Responsibility Principle – a class can have a single responsibility and still have public methods that will be used by some clients but not others.

Take this example of a job class from the wikipedia page on the Interface Segregation Principle.

The ISP was first used and formulated by Robert C. Martin while consulting for Xerox. Xerox had created a new printer system that could perform a variety of tasks such as stapling and faxing. The software for this system was created from the ground up. As the software grew, making modifications became more and more difficult so that even the smallest change would take a redeployment cycle of an hour, which made development nearly impossible.

The design problem was that a single Job class was used by almost all of the tasks. Whenever a print job or a stapling job needed to be performed, a call was made to the Job class. This resulted in a ‘fat’ class with multitudes of methods specific to a variety of different clients. Because of this design, a staple job would know about all the methods of the print job, even though there was no use for them.

The solution suggested by Martin utilized what is called the Interface Segregation Principle today. Applied to the Xerox software, an interface layer between the Job class and its clients was added using the Dependency Inversion Principle. Instead of having one large Job class, a Staple Job interface or a Print Job interface was created that would be used by the Staple or Print classes, respectively, calling methods of the Job class. Therefore, one interface was created for each job type, which were all implemented by the Job class.

Just because the Job class only changes when when we have a new or different type of job doesn’t mean the interface isn’t a mess. Of course, you could also argue that “Job” is too broad and that the Job class does have multiple responsibilities because a a staple job and a print job are separate things, but I think there’s still something to be gained from looking at the breadth of your interface and thinking about whether it needs to be broken up into separate interfaces.

Even if you have a single class that implements all of those interfaces, it’s still cleaner for the clients of that class only to know about the methods they actually need. The more things your interface does, the more likely that separate clients accidentally get tangled up because it’s so easy to just call another method on an interface you already have access to. Splitting your interface into separate pieces forces you to think about what each client really needs to have access to and whether you’ve split your clients up the right way.

In most cases, it’s probably better to let separate subclasses implement the different parts of each single purpose interface. If two clients are different enough to use completely separate interfaces, then a single change probably should not affect them both. Sometimes the change you need to make is at such a fundamental level that it is reasonable for all clients to be affected, but that’s something you should avoid if at all possible. Programming: where there’s never a simple right answer.

Another reason to have smaller interfaces is to make life easier for maintenance programmers :) The more methods you have in an interface, the harder the maintenance programmer has to work to figure out which one is actually right for what they’re doing. That might sound silly, but take a look at the Java Collections API. Collections are meant to be generic so they do need a pretty broad interface but that’s still a lot of stuff to dig through when you just want to know which method to use to update some of the elements in your collection.

Next up, the last letter in SOLID: the Dependency Inversion Principle.

What does SOLID really mean? Part 3

First, a quick recap: the SOLID principles of object oriented design are, to quote Uncle Bob:

The Single Responsibility Principle A class should have one, and only one, reason to change.
The Open Closed Principle You should be able to extend a classes behavior, without modifying it.
The Liskov Substitution Principle Derived classes must be substitutable for their base classes.
The Interface Segregation Principle Make fine grained interfaces that are client specific.
The Dependency Inversion Principle Depend on abstractions, not on concretions.

Last time I talked about the second letter in SOLID, the Open Closed Principle. Now I’m moving on to the Liskov Substitution principle.

The Liskov Substitution Principle is named that because it was created by Barbara Liskov, currently an institute professor at the Massachusetts Institute of Technology and Ford Professor of Engineering in its School of Engineering‘s electrical engineering and computer science department. The principle also doesn’t lend itself well to a short description, so it’s just easier to name it after the person who invented it.

The Liskov substitution principle says that it must be possible to substitute a subclass for the base class without changing anything. Basically, your object hierarchy

Unrelated photo by Zukiman Mohamad
Unrelated photo by Zukiman Mohamad

should make sense :) If you have a subclass that requires special handling and can’t just be dropped in where the base class is used, something is wrong with your design. Why is that so bad? Because it means every time you use that subclass you have to remember to add the special handling bit and/or remember which subclass has which side effects.

If you need some of the functionality of the base class but you have to do some special stuff that means you can’t just create a subclass that is substitutable, create a new class and give it an instance of the base class to use. If it can’t behave like a real subclass, don’t try to force it to be one, it’s just going to cause trouble in the long run.

The typical example of a Liskov Substitution Principle violation is a Square class and a Rectangle class. If they both have setters for width and height, then you can get yourself into trouble if the calling code got a rectangle when it expected a square or vice versa. Say you’re trying to lay out a screen and you know you have a space left that’s x by y so you set your screen object’s width to x and its height to y. If your screen object is a rectangle, everything is cool. But if your object is a square, suddenly its width also got set to y when you set its height to y. Now your layout is all messed up and you’re frustrated because your code looks perfectly reasonable even though it’s clearly not working correctly.

Another way to state the Liskov Substitution Principle is that your code shouldn’t contain surprises. No matter how sensible and obvious something seems while you’re writing it, in six months when you come back to add a new feature you will have forgotten all the details. If your code doesn’t have surprise side effects or special handling, then you’re much more likely to be able to add that new feature quickly and move on. If you run into a surprise, you could spend ages figuring out why the code behaves that way.

If you have class hierarchies in your code, be nice to your future self and obey the Liskov Substitution Principle.