Answer by Rajath Shashidhara:
According to me,View Answer on Quora
being a GSoC’er is about grasping code quickly and learning independently.
Follow these steps:
1. Install Linux. Master using the shell. Learn all the things on you own. Google stuff.
2. Choose a programming language. Work on it.Understand the general structure of it. In the sense that, if you are given a new set of API tomorrow and asked to code using it, you should be able to read the documentation of the api and use it in your code.
3. Once you have mastered the art of independent code learning, choose an opensource organization. Subscribe to their mailing list. Introduce yourself as a newbie to opensource development. Don’t be shy. My experience with opensource community is that they are very supportive. They are very helpful. You will experience a few frustrating days. Don’t give up. Setup the development environment. Then browse the buglist for bugs that catch your eye, thats looks like that can be reproduced by you. Browse the code, look for something called devguide which gives the purpose of each variable or method. Then try to relate the code to the bug. Get help from the community at each step. They will guide you. At times this can get very frustrating, be focused. Get used to bug fixing. If at any point of time, if you feel that you are not enjoying this challenge, then probably you should rethink about it and make a wise decision. This way you can get used to bug fixing. Version control is another important thing. You should learn about them and use them. Once you have done this you have reached a higher level. Now you are an opensource contributer. Congrats! Bond with the developer community.
Choosing an organization must be done smartly. If you have done the above things for a considerable amount of time, you have established as an opensource developer. Then, getting into GSoC will not be difficult at all.Your community will be happy to accept you . :)
Read my blog post: My first step into OpenSource Development. My Explorations for the detailed experience.
Answer by Prakash Gamit:
There are five types of I/O modelsView Answer on Quora
Diagrams are for network I/O, but they would be similar for disk I/O or any other I/O operation.
1. Blocking I/O model
Block if request cannot be completed immediately.
2. Nonblocking I/O model
Do not block if request cannot be completed immediately, return error() instead.EWOULDBLOCK
Settingflag usingO_NONBLOCKfcntlThis is often a waste of CPU time as it uses polling.
3. I/O Multiplexing model
Calland block in one of these system calls, instead of blocking in the actual I/O system call.select / poll
4. Signal-driven I/O model
Tell kernel to notify application withsignal when the descriptor is ready.SIGIO
5. Asynchronous I/O model
telling the kernel to start the operation and to notify application when the entire operation (including the copy of data from the kernel to application buffer) is complete.Following function are available in C for asynchronous I/O in Linux 3.5 system(might be available in earlier versions also)
aio_read(3) Enqueue a read request. This is the asynchronous analog of read(2).
aio_write(3) Enqueue a write request. This is the asynchronous analog of write(2).
aio_fsync(3) Enqueue a sync request for the I/O operations on a file descriptor. This is the asynchronous analog of fsync(2) and fdatasync(2).
aio_error(3) Obtain the error status of an enqueued I/O request.
aio_return(3) Obtain the return status of a completed I/O request.
aio_suspend(3) Suspend the caller until one or more of a specified set of I/O requests completes.
aio_cancel(3) Attempt to cancel outstanding I/O requests on a specified file descriptor.
lio_listio(3) Enqueue multiple I/O requests using a single function call.
The aiocb (“asynchronous I/O control block”) structure defines parameters that control an I/O operation. An argument of this type is employed with all of the functions listed above. This structure has the following form:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20#include <aiocb.h>struct aiocb {/* The order of these fields is implementation-dependent */int aio_fildes; /* File descriptor */off_t aio_offset; /* File offset */volatile void *aio_buf; /* Location of buffer */size_t aio_nbytes; /* Length of transfer */int aio_reqprio; /* Request priority */struct sigevent aio_sigevent; /* Notification method */int aio_lio_opcode; /* Operation to be performed;lio_listio() only *//* Various implementation-internal fields not shown */};/* Operation codes for 'aio_lio_opcode': */enum { LIO_READ, LIO_WRITE, LIO_NOP };
When any of the above function is called it creates a new thread and returns to user process immediately. This new thread keeps waiting for I/O operation to complete while the user process continues its execution. When I/O operation completes it notifies user process by delivering signal specified in aiocb struct.
The current Linux POSIX AIO implementation is provided in userspace by glibc. This has a number of limitations, most notably that maintaining multiple threads to perform I/O operations is expensive and scales poorly.
Sources:
1. Unix Network Programming, Volume 1, 3rd edition, W. Richard Stevens(Chapter 6)
2. aio(7) (aio man page)
Answer by Daniel Kinzler:
Lots of pretty diagrams here! Let me add mine to the mix :)View Answer on Quora
Answer by Aditya Jain:
Cosmin Negruseri has already given a good list that covers most of the books I will be mentioning .While answering this question , I am assuming that you are familiar with most of the material in Introduction to Algorithms by Cormen et al.View Answer on Quora
From here on, depending on your specific needs and interests , you can pick up any of the following books
1. Advanced Data Structures
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics): Hanan Samet: 9780123694461: Amazon.com: Books
This book probably has every data structure you would ever need.The book is huge so you might just want to read the relevant sections.
2.Randomized Algorithms
Randomized Algorithms: Rajeev Motwani, Prabhakar Raghavan: 9780521474658: Amazon.com: Books
Probability and Computing: Randomized Algorithms and Probabilistic Analysis: Michael Mitzenmacher, Eli Upfal: 9780521835404: Amazon.com: Books
3.Approximation Algorithms
Approximation Algorithms: Vijay V. Vazirani: 9783642084690: Amazon.com: Books
4.Computational Geometry
Amazon.com: Computational Geometry: Algorithms and Applications (9783642096815): Mark de Berg, Otfried Cheong, Marc van Kreveld, Mark Overmars: Books
5.String Algorithms
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology: Dan Gusfield: 9780521585194: Amazon.com: Books
6.Algorithmic Game Theory
Algorithmic Game Theory: Noam Nisan, Tim Roughgarden, Eva Tardos, Vijay V. Vazirani: 9780521872829: Amazon.com: Books
7. Distributed Algorithms
Distributed Algorithms (The Morgan Kaufmann Series in Data Management Systems): Nancy A. Lynch: 9781558603486: Amazon.com: Books
Answer by Alex Clemmer:
In order to address the stated question (“Why isn’t supervised machine learning more automated?”), we have to begin to understand how complicated supervised learning really is. (We’ll get to the other questions in the “details” section in a minute.)View Answer on Quora
Here’s how complicated it is.
Say you work at a lab somewhere. Your boss, Professor George O’Jungle, hands you a box marked “data”. He mumbles, “use machine learning to classify this data”, and then walks away. Here is a nifty picture:fig 1: George pictured here in skirt due to it being International Subvert Patriarchy Day
You’re pretty busy, and groan out loud:
Q: Is there some way to automate this learning task?
A: No, at least not at this point in time. You don’t even know what’s in the box. It could be hamsters or something equally ridiculous. Before we do machine learning, we must know what our observable phenomena is.
Ok, so you decide to find out. You open the box. Inside are papers with numbers on them.Great, the empirical phenomena are something sane. If it was, like, bird migration patterns, we’d have to change real observations about that into useful learnable data. But here we have just a bunch of numbers. Computers are good at looking at doing stuff with numbers, so it looks like we’re getting something for free.
So now you ask:
Q: Now can I automate this learning task?
A: Still no. Your data are just numbers on a paper. In order to do machine learning, there must be a meaningful interpretation of our data. Interpreting the data is often called “cleaning” the data. Sometimes this step is trivial. For example, if your data are JPEG images, then you should probably just “interpret” each binary as a JPEG image. In other cases, it is less straightforward. For example, if your data are emails, then you’ll want to remove the headers, HTML tags, images, and other vestigial data that could needlessly harm your algorithm. In other words, you are “interpreting” the data as a set of emails, where “email” is really just “text in the body” or something. Not doing this correctly can literally ruin your ability to do machine learning, so don’t ignore it!
Anyway, you now set off to find out how all these pages of numbers should be interpreted. You ask Professor George. He explains that they are images. He shows you how to use his expensive image-interpreting machine. You feed all the papers into this image-interpreting machine:[Cow image sources: File:Cow female black white.jpg, File:Cow-IMG 2050.JPG]
Now you have a nice stack of images. So once again, you come back to your original question:
Q: Ok, now can I automate this learning task?
A: Sorry, still no. You may have a useful interpretation of your data, but you have not specified your hypothesis set. A hypothesis basically maps things to some set of classes (i.e., it “classifies” things), and a hypothesis set is the set of possible hypotheses. Examples of hypothesis sets are the SVM and the perceptron. Examples of a hypothesis are weight vectors that “classify” your data. Note that hypothesis sets make assumptions about the data (for example, independence assumptions), so choosing the right model is often a balance between hypotheses that are (1) tractable to learn, and (2) expressive. In order to do machine learning, you need to know which hypothesis class you’re using, what your learning algorithm is (e.g., gradient descent), and what classes you’re mapping to.
So once more you consult Professor O’Jungle. He explains that your task is to separate pictures into piles of those that contain cows and those that do not. You can use whatever hypothesis set and learning algorithm you like. “Great!” you think. “I choose the SVM, with whatever learning algorithm is fastest, and the rest will be easy.” You sit back and grin.
Q: … Because now I can completely automate this learning task, right?
A: Wrong. You may have cleaned data, and you may have have a task, but you did not specify which features of the images are relevant to learning this task. The reason is that your hypothesis set is making assumptions about your data. In particular, most hypothesis sets designed for classification assume that your data is a-dimensional real vector, which can be interpreted as points in
-dimensional space that need to be “separated.” Why and how is a technical question for another time. The question for now is: how do we take images and turn them into vectors?
You head back yet again to prof. O’Jungle. Your consolation seems to be that we must be most of the way to automating the learning by now. Or you grimly hope so anyway. You ask what features he wants to use. He slaps his forehead and exclaims that he is an idiot for forgetting to show you his feature-extract-o-tron. It’s not really automatic, because someone had to make it especially for this problem. And in general, it is not clear how to automatically select these features. One good attempt is convolutional deep belief nets, but they’re not quite “there” yet. And anyway, while this solution isn’t really automatic, it is automatic for you. And that’s good enough.
You slump back at your desk. This had better work. You feed all the images in and get back some nice vectors of real numbers:
Finally, you think:
Q: Ok, surely now I can automate this learning task, right?
A: Sorry, but no. Now you have to do the parameter tuning.
The good news is that this is pretty much automatic. First you randomly split your data into “training” and “testing” sets. Then pick a set of values that seems reasonable for your parameters. Then you train your model on the “training” data. Then you test the model on the “test” data. Finally, you select whichever model had the best results.
There probably isn’t a way to completely automate this step away in general, but it’s already pretty close.
So you end up with a well-tuned model a nice result. You look back at the process. None of the steps were really automatic, and in fact, it seems like it would be pretty hard to automate any of them.
You grumble to your colleagues about this:Why do machine learning at all if it can’t be completely automatic?
And the answer is simple: because it actually saved you a lot of time already. Imagine if you had tried to program your cow detector by hand, without using ML. Sure, it’s not completely automatic, but no free lunch, right? (Actually it is provable that there is no free lunch, but that’s for a different post.)
You might complain that this is how you know machines aren’t “really intelligent”, but the bottom line is that if we didn’t have ML, we would not be able to quickly build things like search engines or cow detectors. That is, we could build them, but we’d build them much slower.
So: we use ML because it helps us to quickly deal with large volumes of arbitrary data. It isn’t magic, but it works well enough.Misc: the “details” section of the question:
The question details ask why we can’t completely automate (1) parameter tuning and (2) feature selection. In the case of (1), the short answer is that parameter tuning is already the most automatic thing about supervised ML — you get a working classifier, then optimize it by running it over some range of parameters, and then select the “best” model. Easy. (2) is much harder, and the short answer is that we’re still working on automatic feature selection. Convolutional deep belief nets have great promise, but it will be many years before this is viable, if it ever turns out to be.
If you thought this was useful and you enjoy ML, you might also enjoy my Quora blog, which is primarily concerned with ML-ish problems.
Answer by Eric Talevich:
Go is a systems programming language. Google created it to safely solve three specific problems that C++ was biting them with: concurrency, memory management and compilation time on large systems. It’s concise and readable, like Scala and Julia, but a bit fussy and low-level, unlike Julia. You would use it to write safe, well-performing services that your main application might call to. It would certainly be possible to write scientific software in Go, much as you would with C or C++, but it doesn’t have the same flexibility and interactivity as, say, Python or Julia, or even Scala. If you’re trying to use Go for data analysis, the fussiness and lack of interactivity will disappoint you.View Answer on Quora
Julia is a scientific applications programming language, designed to be a good replacement for Matlab and Python+SciPy. It’s very new, so I won’t attempt to predict which specific features it will or won’t pick up, but the underlying motivations should drive it to be a very good programming language for scientific software — not syntactically optimized for statistical operations on data arrays like R, but good to write a computationally intensive program that uses multiple CPUs and generates pretty graphics.
Scala is a general-purpose language that sits between the other two, capable of dealing with either end of the spectrum. Twitter uses it for services. Scientists use it for aligning protein structures. It’s a good language, a few years older than the other two, and is built on the JVM, which has its own pros and cons. If you want your code to be something other programmers can use and/or build on, Scala is your best choice at the moment — it creates .jar files which can be used directly from plain Java, whereas Go doesn’t seem to be capable of creating reusable shared object (.so) files yet.
Answer by Jay Kreps:
There are a few reasons:View Answer on Quora
The first is that Kafka does only sequential file I/O. To enable this kafka enforces end-to-end ordering of messages in delivery. This means the consumer has a single position in that message stream, and this position can be stored lazily. Typically messaging system keep some kind of per-message state about what has been consumed and have to update it. This introduces all kinds of random updates to mark messages consumed. By contrast Kafka keeps a single pointer into each partition of a topic, rather than a per-message state. All messages prior to the pointer are considered consumed, and all messages after it are consider unconsumed. This eliminates most of the random I/O in acknowledging messages, since by moving the pointer forward many messages at a time we can implicitly acknowledge them all. As a side benefit retaining order is good for other reasons (often the ordering has meaning). The reason most messaging systems don’t do this is because it is hard—it requires co-ordination among the consumers to “elect” consumers for each partition. We lean on zookeeper to manage this process of matching consumes to partitions of data on servers and keeping this matching up to data as the set of available consumers and brokers changes.
The second reason is because Kafka supports end-to-end batching of messages. Computers love linear scans and transfers with big arrays, they hate little bursty random messages. One prerogative of an asynchronous messaging system is the ability to introduces just a little delay to allow what would have been small bursty messages to turn into big fat ones. This speeds up network transfers, disk operations, and even in-memory iteration. We expose this as tunable parameters, so people who can stand a little extra latency can get a lot of extra throughput.
Finally Kafka leans heavily on the OS pagecache for data storage. Although the question says that kafka writes to disk immediately, that is not completely true. Actually Kafka just writes to the filesystem immediately, which is really just writing to the kernel’s memory pool which is asynchronously flushed to disk. There are a couple of reasons this is a good idea:If you are interested in this stuff there are a couple of more detailed write-ups. There is a more complete design document that discusses the trade-offs in more detail [1]. There is also a more recent write-up on the use of kafka at LinkedIn which gives some performance and operational statistics [2].
- Kafka runs on the JVM and keeping data in the heap of a garbage collected language isn’t wise. There are a couple of reasons for this. One is the GC overhead of continually scanning your in-memory cache, the other is the object overhead (in java a hash table of small objects tends to be mostly overhead not data).
- Modern operating systems reserve all free memory as “pagecache”. Basically contiguous chunks of memory that soaks up reads and writes to disk. The nice thing about this is that on a 32GB machine you get access to virtually all of that memory automatically without having to worry about the possibility of running out of memory and swapping.
- Unix has optimizations to allow you to directly write data in pagecache to a socket without any additional copying (aka sendfile). Any data sent on a socket has to cross the process/kernel memory boundary any way. This means if you keep data in your process, and need to deliver that data to multiple consumers you need to recopy it into kernel space, buffering on both sides, each time. This approach gets rid of all the buffering and copying and uses and single structure.
[1] http://incubator.apache.org/kafka/design.html
[2] http://sites.computer.org/debull/A12june/A12JUN-CD
Post by Gayle Laakmann McDowell:
Which vs. ThatView Post on Quora
Post by Gayle Laakmann McDowell:
The difference between e.g. and i.e.View Post on Quora
Answer by Charles H Martin:
When I am asked to interview people, I try to ascertain whether they know the math or not, and how to apply it in a real world context. I also look to see if they understand high performance computing and not just vanilla coding.View Answer on Quora
I was asked to do this as a consultant, acting as a subject matter expert to help interview junior people for the firm.
In our interviews, we asked a candidate to present some code they had written and to talk through it. For an ML person, it would be some kind of ML code.
So, for example, I was involved with an interview with a Physics PhD from MIT discussing some NMF code he wrote in javascript. The javascript was very good and he would be fine doing GUI work , Node.JS work, etc. Certainly not something I could do.
Can he do Machine Learning. Mind you, he has a PhD in a math heavy subject from one of the top 10 schools in the world. So he should know the math.
I wanted to see if he knew how to get it to converge properly. He did not. He knew it was non-convex, but he did not know how to seed it, nor did he know about the convex variants. He tried to give me some nonsense about it being Bi-convex and whatnot. Dude, just use Kmeans++ to seed it. Thats it. Thats all you had to say. This got totally past the VP of engineering and the CTO. (They were just impressed that machine learning involved computing a first order derivative—something neither had since since college calculus)
So here, he knew some basic methods, but did not really know the most important ideas in the field, the important developments, how to really code this. It is clear that he had never done anything like this in his former work, nor did he really understand numerical methods.
This means that his solution would never work in production and — more importantly — that he would have no idea how to evaluate it or how to fix it. I see this a lot. Also, he did not know the available open source codes, how they worked internally, and which one to use, or how to evaluate their performance. For being a PhD from MIT, this was unacceptable to me.
There was also a code evaluation. For me, one needs to know what runs fast and what does not. What good is a method that only runs on 300 data points?! In this case, this interviewee had written his own javascript matrix library. Did he know the BLAS libraries and how they work? Or an alternative? This is critical because you can’t run anything in production if the code is too slow. I see the same problem with most ruby coders—they just don’t know numerical computing.
I was not looking to evaluate 10,000 of complex code , whether he used Agile or Unit Testing. Nor did I care about solving some high school brain teaser. I just wanted to see a small piece of code, with good engineering choices , a good understanding of the math, and how to make this solution work in a modest production environment .
Id rather see old fashioned spectral clustering with a Fortran library, which can scale, as opposed to trying to use a “fancy” method like NMF or LDA if you can not get it to work in production at scale. (I’m not saying they don’t scale—I am saying you better know how to get them to scale if you choose to use them)
In another interview, again a PhD (Ukrainian I think) who was very bright and had solved some good problems and had experience. He was using an off-the-shelf SVM tool—a tool I know very well. I asked a very basic question—how do you adjust the cost parameter for the SVM regularizer. I rephrased the question a couple of different ways to give him a chance. FAIL In other words, did you read the documentation of the tool and did you understand which parameters to tweak and which ones to leave at the default settings ( I kinda would like the person to have read the entire source code of the tool and know how it works. ) Again, this demonstrates a failure of the most basic mathematical concepts in ML — Regularization— and how they would apply in production. Tuning this parameter can increase accuracy by %10-15 (or more). Again, just simple stuff—but important stuff This also shows a lack of attention to notice the important details of the work. We actually offered this guy the job and he asked for a salary way out of the ballpark. If he had not missed this critical question he might have been able to make the case for the salary.
Having shared all this, I would add that I think , for you, the market is very good and you will probably not encounter anything like this. Why? All you need to do is know more machine learning than the VP and the CTO—and here the bar is very low. Everyone and his brother has a funding to do machine learning and they usually just need to solve one small problem and get the product out the door. Most (i.e 7/10 ) CTOs and VPs know absolutely nothing about even basic machine learning so they have no clue even what to ask. (Newton Raphson will blow them away, and they will think you are too expensive if you try compare stochastic gradient descent to interior point methods) They got their start up funded based on the market potential of the idea, and they are expected to hire people to invent their IP.
(Obviously if you are interviewing at Google or Lockheed Martin, disregard all of this and hire me once you get in)
P.S. I was asked once by some VP/CTO evaluating me what the volume of a rectangular prism is. AlI could think of was this old Pink Floyd album Dark Side of the Moon with the Prism on it
http://en.wikipedia.org/wiki/The_Dark_Side_of_the_ Moon
I would never ask this kind of question but you will probably get asked many puzzle questions like this if you are fresh out of school (or an old man like me I guess) I seem to recall there are books and/or web sites with tons of these.
Good Luck