Esquire Theme by Matthew Buchanan
Social icons by Tim van Damme

hit counter
hit counter



Sum it Up


Problem: you are given a sequence of numbers from 1 to n-1 with one of the numbers repeating only once. (example: 1 2 3 3 4 5). how can you find the repeating number? what if i give you the constraint that you can’t use a dynamic amount of memory (i.e. the amount of memory you use can’t be related to n)?
what if there are two repeating numbers (and the same memory constraint?)

Read More



Road to Data Science

Post by Adriano Stephan:

Sort of like the Paris metro map for Machine Learning :-) see the full-blown version here:
View Post on Quora

Scalable machine learning :: video lectures by Alex Smola

Post by Adriano Stephan:

Scalable machine learning :: video lectures by Alex Smola
View Post on Quora

How can I use Nate Silver’s methods to accurately predict future events?

Answer by Jay Wacker:

1.) gather an immense amount of data
2.) perform relatively simple statistical analyses
3.) add expert knowledge to fix naive use of statistics

The only thing that Nate Silver (and a dozen or so different groups) showed is that the polls were accurate and treating them in a straightforward honest manner gives a more accurate answer than any single poll.

To be fair, Nate Silver corrected for numerous subtleties of the polls having to do with systematic bias, whether possibly intentional or not. They deweighted polls that were historically off or systematically disagreed with other polls.   However, all of this expert knowledge corrections they introduced was independently created by several other groups, showing that standard practices gave the right result.

The key thing about taking this method forward is that without the polls (he had hundreds of polls, each of which with thousands of properly sampled and corrected for demographic populations), Nate Silver had exactly bumpkis, which he would say as well.  The polls were his data.  Getting the data is expensive and hard.  Analyzing it correctly, if not easy, is something that hundreds of thousands of scientists are trained in doing.

So if you have extensive polling, I’d advise averaging them together combining them weighting them inversely proportional to their error.  If you have historical information, identify ways that the polls systematically skew and correct for that.   However, these events are few and far between.
View Answer on Quora

What is a simplified explanation and proof of the Johnson-Lindenstrauss lemma?

Answer by Alex Clemmer:

Why are random projections probably nearly as effective as carefully designed ones?

This question you have asked in the “details” section is the key question of the JL lemma. And the unfortunate truth is that they aren’t. Or rather, they are, but only when your data are all over the place.

To see why, consider this toy example I drew. (Note that this example was picked because we can visualize 3 dimensions, and not because the math of the JL lemma is useful here! It is for intuition only.)

This brings us to our first point: in order to understand a “simplified” version of the JL lemma, we must understand:

What is the JL lemma really saying about projections?

Intuitively, the JL lemma says is this: if you pick a random subspace and project onto it, the scaled pairwise distances between points will probably be preserved.

This will be true regardless of the pointset you have. But you’ll notice that in the example on the right, some planes seem to be “better” than others. In the pointset on the left, you could probably project onto any plane, and it would almost certainly be equally bad. But the data on the right seem to lie close to a plane, so intuitively, the planes “close” to the data seem to be “less bad.”

So on the one hand, the JL lemma is telling us that pairwise distances are probably not distorted. But on the other hand, geometry is telling us that some projections are “better” than others. And this mismatch tells us something interesting about random projections:
  • Pairwise distance does not tell us everything there is to know about dimensionality reduction. The JL lemma by itself is not able to tell us why some projections are worst than others in the dataset on the right. All it tells us is that the scaled pairwise distance is not distorted too much.
  • But it’s still pretty useful. For example, if you were running approximate nearest neighbors or something, then you could pick a random projection and reduce the dimension significantly, but still be mostly correct.

So in some sense, the JL lemma seems to work because pairwise distances are not quite as important for dimensionality reduction as we might have hoped. Still, it is interesting that they would be this resilient to random projection, and it seems worth wondering why.

A more formal definition of the JL lemma

First I’ll state the JL lemma, and then demonstrate the intuition using a caricature of a proof, which should give you a good idea why it is true.

Proposition 1 (the JL lemma): For some k \ge O(\log m / \varepsilon^2) (where \varepsilon is our chosen error tolerance), with high probability, map f : \mathbb{R}^d \rightarrow \mathbb{R}^k does not change the pairwise distance between any two points more than a factor of (1 \pm \varepsilon), after scaling by \sqrt{n/k}).

There are a couple of things that are good to notice about the JL lemma that might help you to know when it’s applicable to your problems.
  • It makes statements for reduction from high space to “medium” space. It does not really work in the same way for extremely small space, like \mathbb{R}^1.
  • According to the JL lemma, our choice of k (recall we’re mapping from \mathbb{R}^d to \mathbb{R}^k) should depend only on the number of points m and our error tolerance \varepsilon.

A Caricature of a Proof

[Note: I believe this “caricature” is due to Avrim Blum, but I can’t find it, and so can’t be sure.]

The gist of the proof is this: the JL lemma follows from the fact that the squared length of a vector is sharply concentrated around its mean when projected onto a random k-dimensional subspace. What we aim to show in this section is why this would even be true or useful for proving the JL lemma.

We will begin as something that doesn’t really resemble the JL lemma, but by the end it will hopefully become clear what the relationship is.

First, let’s say we randomly sample m points that lie on the surface of a d-dimensional sphere. These points can also be seen as random unit-length vectors.

We would like to look at how individual coordinates behave. So we will simplify the picture somewhat.

Since they lie on a d-dimensional spere, they are points in \mathbb{R}^d. But, notice that, regardless of the size of d, we can say that these points lie in an m-dimensional subspace, because there are only m points. So we’ll say that they lie in \mathbb{R}^m.

Let’s start by looking at the first coordinate. By standard probability theory, we know that \mathbb{E}[x^2_1] = 1/m. Note that as m gets bigger, the concentration rises precipitously, meaning that the actual values of the coordinates will be sharply concentrated around this value.

Now we’d like to extend this intuition to look at all of the coordinates. Since the value of these coordinates will be sharply concentrated around 1/m for large m, we can see that the coordinates kind of sort of look iid. They’re not really, because if one coordinate is large, the others are necessarily small, but this “sharpness” means that they are “almost” iid. This is not a real proof, right? So let’s say they’re iid for illustrative purposes.

If they’re iid, then we can apply our favorite Chernoff/Hoeffding bound to say that, with really high probability, all the coordinates will be really really close to being 1/m in size. For example, p(|(x_1^2 + \ldots + x_k^2) - k/n| \ge \varepsilon k/n] \le 1/(e^{O(k \varepsilon^2)} ). Remember, this is if they’re iid, which of course they aren’t, but they kind of look like they are. This is intuition.

At this point we’re ready to look at random projections. The basic idea is, we’re going to take a random plane and project our unit vectors onto it. Here’s are a series of “random” examples that I made up (note that the plane of projection is translated to the origin).
But it turns out that we can look at vector projections in a more interesting way. Projecting from \mathbb{R}^m to \mathbb{R}^k is basically the same thing as randomly rotating the vector, and then reading off the first k coordinates.

Now we get to the JL lemma: in order for the JL lemma to be basically true, the distance between the two vectors must be almost the same after we scale it accordingly! In the picture above, the projected vectors each have a dotted line going between them, depicting the distance. So, in other words, that vector needs to be almost the same scaled length as it would be in the original example.

Unsurprisingly, it turns out that this is true. Here’s the math that justifies it.

Using the above Chernoff-Hoeffding bound, we see that at k = O(\frac{1}{\varepsilon^2} \log n), then with probability 1 - O(n^p) (for almost whatever positive integer choice of p you like), the projection to the first k coordinates has length \sqrt{k/n} \cdot (1 \pm \varepsilon).

Now, let’s look at the distance between some pair of vectors. Let’s say our regular non-projected vectors are called \vec{v}_1 and \vec{v}_2. Then the vector that represents the dotted line between the original vectors would be \vec{d} = \vec{v}_2 - \vec{v}_1.

And here’s the punch line. We can take that “distance vector” that goes between our original vectors, and use the same projection argument as above. Thus, with high probability (by the union bound), the length of all these “distance vectors” \vec{d} project to length \sqrt{\frac{k}{n}} \cdot (1 \pm \varepsilon) ||\vec{d}||_2.

Toward more general arguments.

So that’s the intuition for why the JL lemma is true. As we said before,  the  JL lemma follows from the fact that the squared length of a vector is  sharply concentrated around its mean when projected onto a random k-dimensional subspace. This last section, I hope, explains roughly both why this would help us prove the JL lemma, and is a beginning on why it’s true on a more general basis.

If you want to know more, I hope this arms you nicely to figuring out the proofs floating around out there. I would say that Alexandre Passos's answer is a reasonable start. An excellent treatment exists in Foundations of Machine Learning, by Mohri et al. It’s an excellent book anyway, and you should read it just because.
View Answer on Quora

What are some good Undergraduate projects in Quantum Computing and algorithms?

Answer by Igor Markov:

Figure out how Shor’s algorithm works and implement a simulator for it in Java or C++. Then try to reduce the size of the circuit by various modifications and see how this affects the result of the algorithm (there are some modifications that would only slightly degrade the performance).

Figure out how the stabilizer formalism (Heisenberg representation) works and implement a simulator for stabilizer circuits, following the work of Daniel Gottesman or his 2004 paper with Scott Aaronson.  Find recent papers by Richard Jozsa on arxiv that propose a different (simpler) simulation technique, implement it, and compare empirically to the original algorithm in terms of runtime and memory usage.

Given a quantum circuit (either a circuit design or a physical device - different cases), how do you check that it does what it should do? If it misses one gate, can you find which gate is missing ? There isn’t much literature on this, but you can find a few reasonable papers.
View Answer on Quora

What is the most efficient way for a programmer to get good at algorithms without participating in programming competitions?

Answer by Igor Markov:

First, you need to be clear on the syntax of the language you are using. Even when you plan to focus on CS theory, it seems impractical to focus on algorithms w/o having a realistic language to express them (and use pseudocode only).  When teaching algorithms with C++, I see a lot of students who don’t understand pointers, arrays, references, templates, etc in C++, and this limits what they can do in practice (i.e., in projects).

Second, you need to understand standard algorithms.
The Cormen-Leiserson-Rivest-Stein book (3rd+ ed) is good for this (but does not explain how to implement anything in a particular language). Knuth’s TAOCP is the nuclear option for studying algorithms - even if you finish, it will take many years (also note that Knuth is still working on new volumes).

Third, you need to understand how to implement and use standard algorithms with a given language, e.g., using the STL in C++. The book by Josuttis on the C++ standard library (2nd ed) is excellent. You may also get a book on algorithms that shows C++ implementations, such as the one by Weiss (3rd ed). There are other sources that use Java or C#. However, experience shows that students who went through a C++ based course can migrate to Java and C#, but not as easily the other way around.

Fourth, practice algorithmic problem solving. This is harder, but the CLRS book has a number of great exercises. Solve them, implement them, and test your solutions. There are also books with algorithmic/programming problems, including interview questions. Should be easy to find on amazon. But becoming good at algorithmic problem-solving will take time.

Fifth, algorithm analysis. The CLRS book is also good for this.

Sixth, large-scale algorithm design and development. This comes with experience, including a lot of reading (research papers), trying out different ideas (research projects) and/or graduate studies. Probably beyond the scope of your question.

Update: in teaching algorithms and data structures (e.g., in the current semester at the University of Michigan), an extremely efficient tool is an autograder. Students submit their project online and get the results of testing on hidden testcases within 10 minutes. This provides a feedback loop that is just not possible otherwise. Sometimes students cannot get the right output, sometimes their algorithms/programs are too slow. However, when they know just how good /poor their solution is, they can focus on improving their solution. If you are not taking a course at a university, you can get a similar environment in online programming competitions in training mode. Google Code Jam would be one of those.
View Answer on Quora

What data structures/matching algorithims does vimdiff use?

Answer by Anirudh Joshi:

Diffs are part of the Longest common subsequence problem ( This is an NP-hard problem that basically tries to find continuous sequences in arbitrary data - i.e. many source files/DNA shot gun sequencing.

However diffs only operates on 2 sequences reducing the complexity of the search to O(m*n) with m/n being the length of the files (thanks to Peter Scott).

I’m pretty sure vimdiff uses the standard diff tool ( to attack this problem:

This only works when a standard “diff” command is available.  See ‘diffexpr’.


On Unix-based systems, Vim should work without problem because there should be a “standard” diff program available


Diff mainly uses the algorithm described in the paper, An O(ND) Difference Algorithm and its Variations (, found from

Other references are included in the source code found here :

The basic algorithm is described in “An O(ND) Difference Algorithm and its Variations”, Eugene W. Myers, ‘Algorithmica’ Vol. 1 No. 2, 1986, pp. 251-266; and in “A File Comparison Program”, Webb Miller and Eugene W. Myers, ‘Software—Practice and Experience’ Vol. 15 No. 11, 1985, pp. 1025-1040. The algorithm was independently discovered as described in “Algorithms for Approximate String Matching”, E. Ukkonen, `Information and Control’ Vol. 64, 1985, pp. 100-118.

Code for various languages can be found at

More discussion here:
View Answer on Quora

How does golang’s memory management compare to Java?

Answer by William Ting:

Go’s garbage collector is fairly rudimentary for now. It is a stop-the-world, mark-and-sweep, non-generational garbage collector. One of the biggest complaints are the periodic pauses when the GC runs.[0] This lag is an issue for latency sensitive applications likes games, audio processing, etc.

On the other hand, Java provides a few different GC implementations that allow for different performance optimizations (e.g. throughput vs latency) including:

Also, tuning the JVM / GC is a fairly common Java practice while Go’s GC lacks any tuning knobs.[1]

Keep in mind that Go is a relatively young language with a single company backer. Go 1.1 was released in the last month, and they’re showing >30% improvements in certain benchmarks. That’s a huge flag that the implementation is still immature and there is a lot of room for growth.

In contrast, Java has been heavily used in enterprise for >15 years and the JVM has received a lot of corporation support over that time. I don’t have hard numbers, but there’s easily been 10-100x more developer time and money spent on Java compared to Go.

[0]: go-language-discuss Garbage collecting stop the world for about 10 seconds
[1]: Java SE 6 HotSpot[tm] Virtual Machine Garbage Collection Tuning
View Answer on Quora

Why doesn’t Google use Ruby?

Answer by User:

Google writes almost all of its infrastructure from scratch. We cant even use jQuery (not because its third party software because it is approved for production use, and has been used), but because of the infrastructure. It would need to support testing (in the google way), parsing protocol buffers, creating Google UI elements (inspect their drop down boxes and you will notice they aren’t native html inputs), xsrf, internationalization, etc. etc. etc.

Closure is a POS. I don’t really care what anyone else at Google says. It’s slooooooooow and extremely verbose. The benefits you get from the compiler are negated by the dependencies of that one package you include which has a dozen other dependencies. Still I would choose closure over Jquery because there would be a mountain of work I would need to do to get Jquery to work on Googles stack. Also all the plugins for jquery would not be authorized, and further negate the benefits.

Ruby is in the same boat. There’s a butt load of infrastructure it would need to support, and Google simply does not care enough to authorize Ruby as a usable language. If one wanted to use Ruby at Google you would have to be in charge of adding support for all the services and infrastructure Google has.

Ruby also does not scale as well as Java/c++. Engineers at Google like to scale one order of magnitude beyond what they anticipate their products load being. This means doing things like writing back-ends in C++ for services with < 100 qps because some day we might have 1000qps and our java backend which is easier to maintain, and much more pleasant to code won’t be able to handle the 1000qps unless we deploy one more cluster….

The final nail in the coffin is the fact that Ruby does not compile to a binary. In order for code to be deployed in Googles data centers, it must be in a binary form. So you would need some sort of binary (like the ruby interpreter).
View Answer on Quora