Need faster machine learning? Take a set-oriented approach

http://feedproxy.google.com/~r/oreilly/radar/atom/~3/JLE7EciE-Mw/faster-machine-learning.html

Need faster machine learning? Take a set-oriented approach

We recently faced the type of big data challenge we expect to become increasingly common: scaling up the performance of a machine learning classifier for a large set of unstructured data.

Machine learning algorithms can help make sense of data by classifying, clustering and summarizing items in a data set. In general, performance has limited the opportunities to apply machine learning to understanding big or messy data sets. Analysts need to factor in time for speeding up off-the-shelf algorithms or even whether a machine learning pass would complete in a timely manner. While using smaller random samples can help mitigate performance issues, some data sets lend themselves to improved results when applied to more data.

Here we share our experience implementing a set-oriented approach to machine learning that led to huge performance increases (more detail is available in a related post at O'Reilly Answers). Applying a set-oriented approach can help you expand the opportunities to gain the benefits of machine learning on larger, unstructured and complex data sets.

We are working with the US Department of Health and Human Services (HHS) on a project to look for trends in demand for jobs related to Electronic Medical Records (EMR) and Health Information Technology (HIT). The twist, and the reason we decided to build a classifier, is that we wanted to separate jobs for those using EMR systems from those building, implementing, running and selling EMR systems. While many jobs easily fit in one of the two buckets, plenty of job descriptions had duties and company descriptions that made classifying the jobs difficult even for humans with domain expertise.

Identifying the approximately 400,000 jobs with EMR and related references was achieved with high accuracy using a regular expression rule-base. All the job description data is stored on a multi-node Greenplum Massively Parallel Processing (MPP) database cluster, running a Postgres engine. Having an MPP database has been critical for analyzing the large, 1.4-billion-record data set we work with — we can generally run investigative queries against the full data set in minutes.

After some discussion, we decided a Naive Bayes classifier seemed appropriate for the task. While there are some Python open source naive bayes classifiers available, such as NLTK and Orange, I decided to use the algorithm in Toby Segaran’s "Programming Collective Intelligence" so I could tweak the code and play with different feature arrangements. Toby does a great job of tying the code to the principles behind the naive bayes algorithm, and I thought that would help with modding and tuning the classifier for our purposes.

We had a tricky data set with categories that could be only subtly different. We wanted the classifier to be fast enough to iterate through the data many times so we could spend enough time training and tuning the algorithm to optimize classifier accuracy. Starting with a training set of 1,800 categorized jobs (phew...) and a random sample of 1,850 jobs, we set to work trying and reviewing different sets of feature combinations.

We ran into a Python related problem early on that I think worth sharing. Due to the large numbers of words in a job description, the probabilities used by the Naive Bayes algorithm get exceedingly small, so small that Python turned them into zero, making for suspiciously strange results. Luckily, I complained about this problem to a friend with a doctorate in math who suggested taking the log of the probabilities, since the logs of very small numbers are not so small. Worked like a charm.
That’s when it hit me, job descriptions have lots of words that are often not carefully entered. That creates a large set of words and probabilities to work with, slowing down the algorithm. And, the algorithm was written to explain how Naive Bayes works, not for maximum efficiency. With training and classifying the sample data running for more than six hours, we needed to do something to speed up the process to handle all 400,000 records we wanted to classify.

I contacted Daisy Zhe Wang, an EECS doctoral student at UC Berkeley and a consultant at Bayes Informatics, because of her focus on scaling in-database natural language and machine learning algorithms.

Daisy, together with Bayes Informatics founder Milenko Petrovic, developed a set-oriented approach to implementing the Naive Bayes algorithm that treats the data derived from the training set (features (words) and counts) as a single entity, and converting the Naive Bayes algorithm to Python User Defined Functions (UDFs) that, since Greenplum is a distributed database platform, let us parallelize the classifier process.

The result: The training set was processed and the sample data set classified in six seconds. We were able to classify the entire 400,000-record data set in under six minutes — more than a four-orders-of-magnitude records processed per minute (26,000-fold) improvement. A process that would have run for days, in its initial implementation, now ran in minutes! The performance boost let us try out different feature options and thresholds to optimize the classifier. On the latest run, a random sample showed the classifier working with 92% accuracy.

My simple understanding of their algorithm is that training set results are treated like a model and stored as a single row/column in the database. They're parsed into a permanent Python data structure once, while each job description is parsed into another temporary data structure. The Python UDFs compare the words in the temporary data structure to the words in the model. The result is one database read for each job description and a single write once the probabilities are compared and the classification assignment made. That's quite a contrast from reading and writing each word in the training set and the unassigned job.

Why does the set-oriented approach to machine learning matter? Performance and scale issues have long been a problem when trying to fully apply machine learning to large or unruly unstructured data sets. Set-oriented machine learning provides a straightforward way to bypass performance roadblocks, making machine learning a viable option for categorizing, clustering or summarizing large data sets or data sets with big chunks of data (e.g., descriptions or items with large numbers of features).

With any data set, speeding up machine learning processes allows quicker iterations through the data. That creates room to run experiments that improve accuracy and more time to focus on and interpret results to gain insights about the data. Quicker processing reduces the risk of applying machine learning to new topics by reducing the time investment to determine if results are worthwhile.

In summary, the performance boost provided by set-oriented machine learning makes for:

  • Handling larger and more diverse data sets
  • Applying machine learning to a larger set of problems
  • Faster turnarounds
  • Less risk
  • Better focus on a problem
  • Improved accuracy, greater understanding and more usable results
Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will cover many similar topics related to big data, machine learning, analytics and visualization.



Access : On-line, voluntary control of human temporal lobe neurons : Nature

On-line, voluntary control of human temporal lobe neurons

Moran Cerf1,2,3, Nikhil Thiruvengadam1,4, Florian Mormann1,5, Alexander Kraskov1, Rodrigo Quian Quiroga1,6, Christof Koch1,7,11 & Itzhak Fried2,8,9,10,11

  1. Computation and Neural Systems, California Institute of Technology, Pasadena, California 91125, USA
  2. Department of Neurosurgery, University of California, Los Angeles, California 90095, USA
  3. Stern School of Business, New York University, New York, New York 10012, USA
  4. School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
  5. Department of Epileptology, University of Bonn, Bonn 53105, Germany
  6. Department of Engineering, University of Leicester, Leicester LE1 7RH, UK
  7. Department of Brain and Cognitive Engineering, Korea University, Seoul, 136-713, Korea
  8. Semel Institute for Neuroscience and Human Behavior, University of California, Los Angeles, California 90095, USA
  9. Functional Neurosurgery Unit, Tel-Aviv Medical Center, Tel-Aviv 64239, Israel
  10. Sackler Faculty of Medicine, Tel-Aviv University, Tel-Aviv 69978, Israel
  11. These authors contributed equally to this work.

Correspondence to: Florian Mormann1,5 Email: moran@klab.caltech.edu

Correspondence to: Christof Koch1,7,11 Email: koch@klab.caltech.edu

Correspondence to: Itzhak Fried2,8,9,10,11 Email: ifried@mednet.ucla.edu

Top of page

Abstract

Daily life continually confronts us with an exuberance of external, sensory stimuli competing with a rich stream of internal deliberations, plans and ruminations. The brain must select one or more of these for further processing. How this competition is resolved across multiple sensory and cognitive regions is not known; nor is it clear how internal thoughts and attention regulate this competition1, 2, 3, 4. Recording from single neurons in patients implanted with intracranial electrodes for clinical reasons5, 6, 7, 8, 9, here we demonstrate that humans can regulate the activity of their neurons in the medial temporal lobe (MTL) to alter the outcome of the contest between external images and their internal representation. Subjects looked at a hybrid superposition of two images representing familiar individuals, landmarks, objects or animals and had to enhance one image at the expense of the other, competing one. Simultaneously, the spiking activity of their MTL neurons in different subregions and hemispheres was decoded in real time to control the content of the hybrid. Subjects reliably regulated, often on the first trial, the firing rate of their neurons, increasing the rate of some while simultaneously decreasing the rate of others. They did so by focusing onto one image, which gradually became clearer on the computer screen in front of their eyes, and thereby overriding sensory input. On the basis of the firing of these MTL neurons, the dynamics of the competition between visual images in the subject’s mind was visualized on an external display.

To read this story in full you will need to login or make a payment (see right).

Four short links: 23 July 2010

[Link] http://feedproxy.google.com/~r/oreilly/radar/atom/~3/23huDbc-Wa4/four-short-links-23-july-2010.html

Four short links: 23 July 2010

  1. 5 Reputation Missteps (and how to avoid them) (YouTube) -- a Google Tech Talk from one of the authors of the O'Reilly-published Building Web Reputation Systems.
  2. Solr on EC2 Tutorial -- the tutorial shows how to index Wikipedia with Solr. (via Matt Biddulph)
  3. clive -- a command line utility for extracting (or downloading) videos from Youtube and other video sharing Web sites. It was originally written to bypass the Adobe Flash requirement needed to view the hosted videos..
  4. ChinaSmack -- how to talk smack online in Chinese. (via BoingBoing)

shogun | A Large Scale Machine Learning Toolbox


The machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM) [1]. It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art OCAS [21]Liblinear [20]LibSVM [2]SVMLight[3] SVMLin [4] and GPDT [5]. Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels as e.g. the Locality Improved [6], Fischer [7], TOP [8], Spectrum [9], Weighted Degree Kernel (with shifts)[10] [11] [12]. For the latter the efficient LINADD [12] optimizations are implemented. For linear SVMs the COFFIN framework [22][23] allows for on-demand computing feature spaces on-the-fly, even allowing to mix sparse, dense and other data types. Furthermore, SHOGUN offers the freedom of working with custom pre-computed kernels. One of its key features is the combined kernel which can be constructed by a weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain. An optimal sub-kernel weighting can be learned using Multiple Kernel Learning [13] [14] [18] [19]. Currently SVM one-class, 2-class and multiclass classification and regression problems can be dealt with. However SHOGUN also implements a number of linear methods like Linear Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel) Perceptrons and features algorithms to train hidden markov models. The input feature-objects can be dense, sparse or strings and of type int/short/double/char and can be converted into different feature types. Chains of preprocessors (e.g. substracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing.

SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave and Python and is proudly released asMachine Learning Open Source Software.