Skip to main content

Working Through The Efficient Three Sum Algorithm

This morning I spent an hour or so working on recreating the Three Sum algorithm in Java (I know because, for the sake of motivation, I timed myself).  I wanted to recreate the solution suggested in the class I'm following: go through all the possible pairs in the array and select, from the remaining values, all elements that sum to zero when added together with the pairs[1] -- using binary search.

That's the solution that takes you from O(n^3) to O(n^2 log n) -- if I'm getting the notation right.  Because binary search for N elements is a log N algorithm: if you look through N elements by dividing them into halves, you can do that at worst log N times.  (8 elements = 2 halves of 4 elements = 4 parts of 2 elements = 8 singletons, or 8 => 4 => 2 => 1 = 3 steps -- I guess, not including 8 itself = log 8).

Again, I had a few problems:

  • I forgot to put my midpoint value inside the loop (it was not being recalculated each time the high and low points changed).
  • I had the loop running while the low was less than the high, not less than or equal to the high.
The last part screwed up my results by quite a bit.  Where I was supposed to find 80 triples, for instance, I only found 40.

The hard thing about debugging algorithms in Java, for me, is that there's a lot more infrastructure to set up before you can actually run the algorithm.  Java still feels clunkier than JavaScript -- just because it's less permissive.  One simple example: I wanted to feed an array into a function as test data.  In JavaScript you would just write:

foo([1,2,3,4,5,6])

but after a half-hearted Google search and consultation with a friend, I learned that Java wants you to write

foo(new {1,2,3,4,5,6})

-- an anonymous object, if I remember from C#.  But I didn't think of that beforehand.  So it's one of these languages that's just picky.  And then there are different methods inside of a class that have to be called. And then there's all this data going through the methods.  So I also have a lesson for today:

If you want to debug an algorithm running on a big data set, find a representative portion of the data that you understand and work with that.

Otherwise you'll just get lost.

I'll still have to think a bit more to figure out how to debug recursive functions -- because in that case, the call-stack kind of becomes your big data set.  And it's hard to think about what things might be going wrong inside a very big call stack.

One more obvious moral for the day as a parting gift:

Don't run binary search on arrays that haven't been sorted.

LOL.

[1] Another thing I thought about was how you make sure you get all the unique triples that sum to zero.  That's why, in the algorithm I'm following, you define each inner loop over one plus the value of the outer loop.  I think a long time ago I tried to solve this problem by deleting elements from the array once the program had found mates for them -- but that couldn't have been the most efficient way to do it.  So in the final version of the algorithm, look at each element in the array and find pairs for it from all the elements following it, then accept only binary search results that are later than the pair.  And that raises another issue -- you can only run this version of the algorithm on a set.  If there are multiple items that sum to zero with a pair, binary search won't get you the right one.  That's when you would need to start thinking about deleting things from the array -- but it would screw up the rest of the book-keeping.

Comments

  1. "If you want to debug an algorithm running on a big data set, find a representative portion of the data that you understand and work with that."

    Yes that is a great lesson to learn. Always start with the most simple representation of the problem and master that before adding additional layers to it.

    ReplyDelete

Post a Comment

Popular posts from this blog

Getting Geodata From Google's API

The apps I'm going to be analyzing are part of Dr. Charles Severance's MOOC on Python and Databases and work together according to the following structure (which applies both in this specific case and more generally to any application that creates and interprets a database using online data). The data source, in this case, is Google's Google Maps Geocoding API.  The "package" has two components: geoload.py  and geodump.py .  geoload.py  reads a list of locations from a file -- addresses for which we would like geographical information -- requests information about them from Google, and stores the information on a database ( geodata.db ).  geodump.py  reads and parses data from the database in JSON, then loads that into a javascript file.  The javascript is then used to create a web page on which the data is visualized as a series of points on the world-map.  Dr. Severance's course focuses on Python, so I'm only going to work my way through ...

Compiling and Executing Java Files With -cp

I decided I was going to "man up" and figure out how to compile a java program with an external dependency from the command line instead of relying on an IDE-- the DOS command line, to be more specific. I ran into a few problems: 1.  The external dependency was given to me as a java file.  I experimented compiling it as a .jar, but I wasn't sure how to import a class from a .jar, so I ended up compiling it into a class. 2.  When I tried to run the file, I got an error saying that the class had been compiled with a different version of Java than my JRE.  The Internet told me to check my path variable for Java.  It sure looked like it was pointing to the latest JRE (and the same version of Java as my compiler).  I asked the Internet again and found the following command: for %I in (java.exe) do @echo %~$PATH:I I'm not exactly sure what the syntax of that magic command is (intuitively it's returning the path that executes when I run the "java" com...

Throughput, Latency, and Pipelines: Diagnosis Of A Fallacy

Source here . Latency is the time it takes for a job to complete from start to finish -- for example, if we're downloading a file from a server, we might define the latency of the download as the amount of time it takes from the initial request for the file to the time the last byte is received. Throughput is a measure of how much can be completed in a given time.  Following the same example, how many files could we download in an hour? What is the relationship between latency and throughput?   It make take, again, 1 hour to download a file.  Notice already that we have a relationship between some amount of work that can be completed and time -- if the file is, say, 2 GB, it takes 1 hour to download 2 GB in our example.  What we really want to do here is generalize: how long does it take to download 10 GB?  How many GB can we download in ten hours or one minute?  Generalizing over time, we derive latency; generalizing over work completed, we derive the...