My DS Coding Bolg: Machine Learning Models Summary

- **Batch Prediction**: Generating many predictions using an ML model at once, generally saving the results of predictions in offline storage

- **Entity Data**: Data which is needed to compute features (for example age of users).

- **Entity Data Decoration**: Process of fetching *entity data* given an *entity ID* (for example fetching user age given a username). Once decorated, we get raw features.

- **Entity IDs**: Unique identifier for entities used for prediction. For example for a home page ranking requests, Entity IDs would be username and shelf ID.

- **Feature Gallery**: A view of features that have been produced by ML teams.

- **Feature Store**: A store of objects-- usually raw features-- consistent with a standardized specification for: data format and data schema (feature store specification). The choice of storage system for a “feature store” depends on intended access pattern (e.g. GCS for offline, BigTable for online).

- **Features**: Set of data about an instance we wish to do a prediction on (in a context of supervised ML). See *Raw Features* vs. *Transformed Features*.

- **Feature Transformation Stage**: Stage in the ML workflow where raw features are mapped into a format that is useful and consumable by a ML model. For example one hot encoding categorical data.

- **Raw Features**: Features, before going through the feature transformation stage.

- **TFDV / TensorFlow Data Validation**: Library for exploring and validating machine learning data. (cf. [tensorflow/data-validation](https://github.com/tensorflow/data-validation)

- **TFX** - TensorFlow Extended - a Google’s TensorFlow based platform/effort to productionize machine learning workflow. See [whitepaper](https://dl.acm.org/citation.cfm?id=3098021) and [website](https://www.tensorflow.org/tfx/).

- **tf.example**: Standard protobuf format for storing data for ML training and inference. (cf. [proto definition](https://github.com/tensorflow/tensorflow/blob/0076f2830e639efb7be60cd7abf358fe8fab7610/tensorflow/core/example/example.proto#L13-L86)). TensorFlow Example protobuf](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto) - is a standard TensorFlow protobuf to store features.

- **tf.record** - [TFRecord file format](https://www.tensorflow.org/api_guides/python/python_io#tfrecords-format-details) is a simple [record-oriented binary format](https://docs.google.com/document/d/1L_QpnE-OGVkg2KiRlgGoUpL7bCZIGS1CP6hMlBgd2Io/edit#bookmark=id.1nwvsyhwz9dl) that many TensorFlow applications use.

- **Training/Serving skew**: Difference between performance during training and performance during serving (very often due to differences in handling of features). See dedicate paragraph in [Rules of ML](https://developers.google.com/machine-learning/guides/rules-of-ml/#training-serving_skew).

- **Transformed Features**: Features suitable to be used as input to ML model. Most often, this means an array of floating point numbers.

- **columnar storage format** - storage format that embraces columnar nature of data, akin to [Parquet](https://parquet.apache.org/) or [Capacitor](https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format). [Wiki page](https://en.wikipedia.org/wiki/Column-oriented_DBMS).

- **data model** - method used to describe/model structure of data - think Avro Record, CSV, JSON.

- **data format** - whenever we say data format we talk about combination of both storage format and data model, no necessarily as a single entity though.

- **features** - input to the a model. Features is data after featurization process (process of taking raw feature data and transforming it into (usually) a vector that a model can digest).

- **input data** - by “input data to machine learning pipeline” we mean raw features (data before featurization).

- **record-oriented format** - row/record oriented storage format, think [Avro](https://avro.apache.org/) or TSV.

- **storage format** - format used to encode data on some form of storage - an example would be [Parquet](https://parquet.apache.org/) or [Ogg](https://en.wikipedia.org/wiki/Ogg). Storage could be filesystem in which case you would use a “file format”.

- **Online Prediction**: Generating predictions on demand using an ML model, usually as the response to service calls

Supervised learning refers to techniques that use labeled data to train a model. Supervised learning consists of prediction (“regression”) algorithms for interval labels and classification algorithms for class labels. Generally, supervised learning algorithms are trained on labeled, preprocessed examples and assessed by their performance on test data.

- Regression (LASSO regression, logistic regression, ridge regression)

- Decision tree( Gradient boosting, random forests)

- Neural networks

- SVM

- Naïve Bayes

- Neighbors

- Gaussian Process

Unsupervised learning occurs when a model is trained on unlabeled data. Unsupervised learning algorithms usually segment data into groups of examples or groups of features. Groups of examples are often called “clusters.” Combining training features into a smaller, more representative group of features is called “feature extraction,” and finding the most important subset of input features is called “feature selection.” Unsupervised learning can be the end goal of a machine learning task (as it is in the market segmentation), or it can be a preliminary or preprocessing step in a supervised learning task in which clusters or preprocessed features are used as inputs to a supervised learning algorithm.

- Apiori rules
- Association rule learning
- Eclat

- Clustering (k-means clustering, mean shift clustering, spectral clustering)

- Hierarchical Cluster Analysis (HCA)
- Kernel density estimation

- Nonnegative matrix factorization

- PCA (kernel PCA, sparse PCA)
- Locally-Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)
- Singular value decomposition

- SOM

Semi-supervised learning algorithms generally train on small portions of labeled data that are then combined with larger amounts of unlabeled data, combining supervised and unsupervised learning in situations where labeled data are difficult or expensive to procure. Several stand-alone machine learning algorithms have also been described as semi-supervised because they use both labeled and unlabeled training examples as inputs. Although they are less common, semi-supervised algorithms are garnering acceptance by business practitioners.

- Prediction and classification

- Clustering

- EM (Expectation Maximization)

- TSVM

- Manifold regularization

- Auto-encoders (Multilayer perceptron, restricted Boltzmann machines)

- Deep Belief Networks (DBNs)

Reinforcement learning is a learning system, called an agent in the context, can observe the environment, select and preform actions, and get rewards in return ( or penalties int he format of negative rewards). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline: first the system is trained, then it is launched into production and it runs without learning anymore, it just applies what it has learned. This is called offline learning. If you want a batch learning system to know about new data (such as new type of spam), you need to train a new version of the system from scratch on the full dataset (not just the new data, but also the old data), then stop the old system and replace it with the new one. Fortunately the whole process of training, evaluating and launching a Machine Learning system can be automated fairly easily, so even a batch learning system can adapt to change: simply update the data and train a new version of the system from scratch as often as needed. This solution is simple and often works fine, but training using the full set of data can take many hours, so you would typically train a new system only every 24 hours or even just weekly. If your system needs to adapt to rapidly changing data (eg. to predict stock prices), then you need a more reactive solution.

In online learning (incremental learning), the system is trained incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap so the system can learn about new data on the fly, as it arrives. Online learning is great for systems that receive data as a continuous flow (eg. stock prices) and need to adapt to change rapidly or autonomously. They are also well suited if you have limited computing resource: once an online learning system has learned about new data instances, it does not need them anymore so you can discard them (unless you want to be able to rollback to a previous state and “replay” the data). This can save a huge amount of space. Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine’s main memory (this is called out-of-core learning). The algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data.

IMQAV
Ingest

Ingestion is a set of software engineering techniques to adapt high volumes of data that arrive rapidly (often via streaming)

Kafka

RabbitMQ

Fluentd

Swoop

Kinesis(AWS)

Model

Modeling is a set of data architecture techniques to create data storage that is appropriate for a particular domain

-Relational

MySQL

Postgres

RDS(AWS)

-Key Value

Redis

Rick

DynamoDB(AWS)

-Columnar

Cassandra

HBase

RedShift(AWS)

- Document

MongoDB

ElasticSearch

CouchBase

- Graph

Neo4J

OrientDB

ArangoDB

Query

Query refers to extracting data from storage and modifying that data to accommodate anomalies such as missing data

- Batch

MapReduce

Spark

Elastic MapReduce (AWS)

- Batch SQL

Hive

Presto

Drill

- Streaming

Storm

Spark Streaming

Samza

Analyze

Analyze is a broad category that includes techniques from computer science, mathematical modeling, artificial intelligence, statistics, and other disciplines.

- Statistics

SPSS

SAS

Statsmodels

Spicy

Pandas

- Optimization and Mathematical Modeling (Scipy and other libraries)

Linear, Integer, Dynamic, Programming

Gradient and Lagrange methods

- Machine learing

Batch

H2O

Mahout

SparkML

Interactive

sikit-learn

Visualize

Visualize refers to transforming data into visually attractive and informative formats.

matplotlib

seaborn

booked

pandas

Tableau

Leaflet

Highchairs

Kibana

Machine Learning Models

Non-parametric statistics gave us quantiles, but offers so much more. Generally, non-parametric describes any statistical construct that does not make assumptions about probability distribution, e.g. normal or binomial. This means it has the most broadly-applicable tools in the descriptive statistics toolbox. This includes everything from the familiar histogram to the sleeker kernel density estimation (KDE). There’s also a wide variety of nonparametric tests aimed at quantitatively discovering your data’s distribution and expanding into the wide world of parametric methods.

Parametric statistics contrast with non-parametric statistics in that the data is presumed to follow a given probability distribution. If you’ve established or assumed that your data can be modeled as one of the many published distributions, you’ve given yourself a powerful set of abstractions with which to reason about your system. We could do a whole article on the probability distributions we expect from different parts of our Python backend services (hint: expect a lot of fish and phones). Teasing apart the curves inherent in your system is quite a feat, but we never drift too far from the real observations. As with any extensive modeling exercise, heed the cautionary song of the black swan.

Inferential statistics contrast with descriptive statistics in that the goal is to develop models and predict future performance. Applying predictive modeling, like regression and distribution fitting, can help you assess whether you are collecting sufficient data, or if you’re missing some metrics. If you can establish a reliable model for your service and hook it into monitoring and alerting, you’ll have reached SRE nirvana. In the meantime, many teams make do with simply overlaying charts with the last week. This is often quite effective, diminishing the need for mathematical inference, but does require constant manual interpretation, doesn’t compose well for longer-term trend analysis, and really doesn’t work when the previous week isn’t representative (i.e., had an outage or a spike).

Categorical statistics contrast with numerical statistics in that the data is not mathematically measurable. Categorical data can be big, such as IPs and phone numbers, or small, like user languages. Our key non-numerical metrics are around counts, or cardinality, of categorical data. Some components have used HyperLogLog and Count-Min sketches for distributable streaming cardinality estimates. While reservoir sampling is much simpler, and can be used for categorical data as well, HLL and CMS offer increased space efficiency, and more importantly: proven error bounds. After grasping reservoir sampling, but before delving into advanced cardinaltiy structures, you may want to have a look at boltons ThresholdCounter, the heavy hitters counter used extensively in Python services. Regardless, be sure to take a look at this ontology of basic statistical data types.

Multivariate statistics allow you to analyze multiple output variables at a time. It’s easy to go overboard with multiple dimensions, as there’s always an extra dimension if you look for it. Nevertheless, a simple, practical exploration of correlations can give you a better sense of your system, as well as inform you as to redundant data collection.

Multimodal statistics abound in real world data: multiple peaks or multiple distributions packed into a single dataset. Consider response times from an HTTP service:

Successful requests (200s) have a “normal” latency.

Client failures (400s) complete quickly, as little work can be done with invalid requests.

Server errors (500s) can either be very quick (backend down) or very slow (timeouts).

Here we can assume that we have several curves overlaid, with 3 obvious peaks. This exaggerated graph makes it clear that maintaining a single set of summary statistics can do the data great injustice. Two peaks really narrows down the field of effective statistical techniques, and three or more will present a real challenge. There are times when you will want to discover and track datasets separately for more meaningful analysis. Other times it makes more sense to bite the bullet and leave the data mixed.

Time-series statistics transforms measurements by contextualizing them into a single, near-universal dimension: time intervals. At PayPal, time series are used all over, from per-minute transaction and error rates sent to OpenTSDB, to the Python team’s homegrown $PYPL Pandas stock price analysis. Not all data makes sense as a time series. It may be easy to implement certain algorithms over time series streams, but be careful about overapplication. Time-bucketing contorts the data, leading to fewer ways to safely combine samples and more shadows of misleading correlations.

Moving metrics, sometimes called rolling or windowed metrics, are another powerful class of calculation that can combine measurement and time. For instance, the exponentially-weighted moving average (EWMA).

This output packs a lot of information into a small space, and is very cheap to track, but it takes some knowledge and understanding to interpret correctly. EWMA is simultaneously familiar and nuanced. It’s fun to consider whether you want time series-style disrcete buckets or the continuous window of a moving statistic. For instance, do you want the counts for yesterday, or the past 24 hours? Do you want the previous hour or the last 60 minutes? Based on the questions people ask about our applications, PayPal Python services keep few moving metrics, and generally use a lot more time series.

Survival analysis is used to analyze the lifetimes of system components, and must make an appearance in any engineering article about reliability. Invaluable for simulations and post-mortem investigations, even a basic understanding of the bathtub curve can provide insight into lifespans of running processes. Failures are rooted in causes at the beginning, middle, and end of expected lifetime, which when overlaid, create a bathtub aggregate curve. When the software industry gets to a point where it leverages this analysis as much as the hardware industry, the technology world will undoubtedly have become a cleaner place.

-Linear Regression

Linear regression

Linear regression with penalty (LASSO regression, Ridge regression)

Nonlinear regression

Hierarchical Linear Models

- Nonparametric
K-Nearest Neighbors

- Logistic Regression
Logistic regression
Logistic regression with penalty
Logistic regression with calibration
Multinomial logistic regression
Conditional Logistic Regression

- Generalized Linear Models and Exponential Family
Ordered/Cumulative Logit Model
Probit Model
Tobit Model
GAM
GLMMIX

- Time Series Analysis
ARIMA
Stepwise Autoregressive Model
Multivariate Autoregressive State-Space Models
Exponential Smoothing Model
Multiplicative Seasonal Exponential Smoothing Model
Additive Seasonal Exponential Smoothing Model

- Directed Graphical Models (Bayes Nets)
Naive Bayes
Graphical models
Directed Graphical models
Markov and Hidden Markov
Structure Equation Model (SEM), Covariance-based SEM (CB-SEM) and Partial Least Square-based SEM (PLS-SEM)

- Tree-based Models
Decision Tree
Random forests
Gradient Boosted Decision Tree (GBDT)

- Bayesian Models
Beta Binomial Model
Dirichlet Multinomial Model
Mixtures of Gaussians
Mixtures of Multinoullis

- Latent Models
Factor Analysis
Principal Component Analysis (kernel PCA, sparse PCA)
Singular value decomposition
Latent Dirichlet Allocation (LDA)

- Kernal Models
Support Vector Machine
Kernel density estimation
Kernelized KNN

- Web Analytics
TF-IDF
String Kernel
Sentiment Analysis

- Stochastic Process
Brownian Motion
Jump Diffusion
Gaussian Process
Latent Dirichlet Allocation (LDA)
Conditional Random Field (CRF)

- Clustering
Hierarchical clustering
K-means clustering
Gaussian mixture clustering
Mean shift clustering
Spectral clustering
SOM (Self-Organizing maps)

- Neural Networks
Artificial Neural Networks (ANNs)

Convolutional Neural Networks (CNNs)

Association Rule, Apiori rules (support, confidence, lift)

Rapid deployment of ML (machine learning) applications of any complexity with flexibility, scalability and state-of-the-art technology.

Maximum Entropy (MaxEnt)
Statistical Language Model (SLM)
Vowpal Wabbit (VW)

My DS Coding Bolg

Friday, November 11, 2016

Machine Learning Models Summary

No comments:

Post a Comment

Blog Archive