My DS Coding Bolg: January 2017

Tuesday, January 31, 2017

Introduction to Scikit Flow

Scikit Flow is a simplified interface for TensorFlow, to get people started on predictive analytics and data mining. It helps smooth the transition from the Scikit-learn world of one-liner machine learning into the more open world of building different shapes of ML models. You can start by using fit/predict and slide into TensorFlow APIs as you are getting comfortable. It’s Scikit-learn compatible so you can also benefit from Scikit-learn features like GridSearch and Pipeline.

source activate tensorflow

ipython

import tensorflow

Deep Learning Models

Deep Neural Network

Monday, January 30, 2017

Online Random Sampling

Real-time ‘A/B testing’ of models, or combined with more advanced online selection and learning techniques might still have selection bias. Before running A/B tests, the KPI metric stat report can be used to monitor prior test performance, thus being able to respond or adapt if performance degrades for some reasons.

Time is a very precious commodity when you are running A/B tests online. Make sure you go in with agreed upper bounds on how long you’re going to spend testing a given experiment using the prior KPI metic stat report. During the test, if you’re not seeing trends in the p-value that look encouraging, it’s time to pull the plug at the time.

Users are randomly assigned to either your control or your treatment groups might not be random after all, so make sure to audit the systems to make sure there is no selection bias in the actual assignment of users to the control or treatment groups using the KPI metric stat report.

Make sure the user assignment is sticky. If you are messing the effect a chance over an entire session, make sure users are not switching groups before and in the testing period.

Look at the same exact group of users and observe different engagement values, and making sure there’s no inherent bias or other problem, e.g., session leakage and things you need to address.

If you are running multiple test simultaneously, make sure they don't conflict with one another across all user buckets.

https://eng.uber.com/xp-background-push/

In the background, our A/B testing platform, Morpheus, constructs the payload that needs to be sent to users based on the new configuration. This payload is sent to an internal service called GroupPusher, along with a key identifying the flag which needs to be rolled back. GroupPusher is a service which takes some key identifying a set of users in Cassandra and a payload and proceeds to send the payload to all of the users identified set. GroupPusher then pulls the list of affected users from Cassandra based on this key.

GroupPusher next calls Pusher, an internal service that sends the payload to users at a rate of about 3,000 pushes per second. Pusher is responsible for sending push notifications to users, e.g., to let them know that their driver is approaching their pickup location: "Your Uber is arriving now." Pusher then sends the payload down to the mobile clients via APNs (for iOS) and GCM (for Android).

Dipping into the stream

The twist is that we want to sample from an unknown population, considering only one data point at a time. This use case calls for a special corner of computer science: online algorithms, a subclass of streaming algorithms. “Online” implies only individual points are considered in a single pass. “Streaming” implies the program can only consider a subset of the data at a time, but can work in batches or run multiple passes. Fortunately, Donald Knuth helped popularize an elegant approach that enables random sampling over a stream: Reservoir sampling.

First we designate a counter, which will be incremented for every data point seen. The reservoir is generally a list or array of predefined size. Now we can begin adding data. Until we encounter size elements, elements are added directly to reservoir. Once reservoir is full, incoming data points have a size / counter chance to replace an existing sample point. We never look at the value of a data point and the random chance is guaranteed by definition. This way reservoir is always representative of the dataset as a whole, and is just as likely to have a data point from the beginning as it is from the end. All this, with bounded memory requirements, and very little computation. See the instrumentation section below for links to Python implementations.

In simpler terms, the reservoir progressively renders a scaled-down version of the data, like a fixed-size thumbnail. Reservoir sampling’s ability to handle populations of unknown size fits perfectly with tracking response latency and other metrics of a long-lived server process.

Interpretation

Once you have a reservoir, what are the natural next steps?

Look at the min, max, and other quantiles of interest (generally median, 95th, 99th, 99.9th percentiles).
Visualize the CDF and histogram to get a sense for the shape of the data, usually by loading the data in a Jupyter notebook and using Pandas, matplotlib, and occasionally booked.

Reservoir sampling does have its shortcomings. In particular, like an image thumbnail, accuracy is only as good as the resolution configured. In some cases, the data near the edges gets a bit blocky. Good implementations of reservoir sampling will already track the maximum and minimum values, but for engineers interested in the edges, we recommend keeping an increased set of the exact outliers. For example, for critical paths, we sometimes explicitly track the n highest response times observed in the last hour.Once you have a reservoir, what are the natural next steps?Depending on your runtime environment, resources may come at a premium. Reservoir sampling requires very little processing power, provided you have an efficient PRNG. Even your Arduino has one of those. But memory costs can pile up. Generally speaking, accuracy scales with the square root of size. Twice as much accuracy will cost you four times as much memory, so there are diminishing returns.

Transitions

Usually, reservoirs get us what we want and we can get on with non-statistical development. But sometimes, the situation calls for a more tailored approach.

Except for q-digests, biased quantile estimators, and plenty of other advanced algorithms for handling performance data. After a lot of experimentation, two approaches remain our go-tos, both of which are much simpler than one might presume.

The first approach, histogram counters, establishes ranges of interest, called bins or buckets, based on statistics gathered from a particular reservoir. While reservoir counting is data agnostic, looking only at a random value to decide where to put the data, bucketed counting looks at the value, finds the bucket whose range includes that value, and increments the bucket’s associated counter. The value itself is not stored. The code is simple, and the memory consumption is even lower, but the key advantage is the execution speed. Bucketed counting is so low overhead that it allows statistics collection to permeate much deeper into our code than other algorithms would allow.

The second approach, Piecewise Parabolic Quantile Estimation (P2 for short), is an engineering classic. A product of the 1980s electronics industry, P2 is a pragmatic online algorithm originally designed for simple devices. When we look at a reservoir’s distribution and decide we need more resolution for certain quantiles, P2 lets us specify the quantiles ahead of time, and maintains those values on every single observation. The memory consumption is very low, but due to the math involved, P2 uses more CPU than reservoir sampling and bucketed counting. Furthermore, we’ve never seen anyone attempt combination of P2 estimators, but we assume it’s nontrivial. The good news is that for most distributions we see, our P2 estimators are an order of magnitude more accurate than reservoir sampling.

These approaches both take something learned from the reservoir sample and apply it toward doing less. Histograms provide answers in terms of a preconfigured value ranges, P2 provides answers at preconfigured quantile points of interest.

Instrumentation

We focused a lot on statistical fundamentals, but how do we generate relevant datasets in the first place? Our answer is through structured instrumentation of our components. With the right hooks in place, the data will be there when we need it, whether we’re staying late to debug an issue or when we have a spare cycle to improve performance.

Much of services’ robustness can be credited to a reliable remote logging infrastructure, similar to, but more powerful than, rsyslog. Still, before we can send data upstream, the process must collect internal metrics. We leverage two open-source projects, fast approaching major release:

Faststat – Optimized statistical accumulators

Faststat operates on a lower level. True to its name, Faststat is a compiled Python extension that implements accumulators for the measures described here and many more. This includes everything from geometric/harmonic means to Markov-like transition tracking to a metametric that tracks the time between stat updates. At just over half a microsecond per point, Faststat’s low-overhead allows it to permeate into some of the deepest depths of our framework code. Faststat lacks output mechanisms of its own, so our internal framework includes a simple web API and UI for browsing statistics, as well as a greenlet that constantly uploads faststat data to a remote accumulation service for alerting and archiving.

Lithoxyl – Next-generation logging and application instrumentation
Lithoxyl is a high-level library designed for application introspection. It’s intended to be the most Pythonic logging framework possible. This includes structured logging and various accumulators, including reservoir sampling, P2, and others. But more importantly, Lithoxyl creates a separate instrumentation aspect for applications, allowing output levels and formats to be managed separately from the instrumentation points themselves.

One of the many advantages to investing in instrumentation early is that you get a sense for the performance overhead of data collection. Reliability and features far outweigh performance in the enterprise space. Many critical services I’ve worked on could be multiple times faster without instrumentation, but removing this aspect would render them unmaintainable, which brings me to my next point.

Good work takes cycles. All the methods described here are performance-minded, but you have to spend cycles to regain cycles. An airplane could carry more passengers without all those heavy dials and displays up front. It’s not hard to imagine why logging and metrics is, for most services, second only to the features themselves. Always remember and communicate that having to choose between features and instrumentation does not bode well for the reliability record of the application or organization.

For those who really need to move fast and would prefer to reuse or subscribe, there are several promising choices out there, including New Relic and Prometheus. Obviously we have our own systems, but those offerings do have percentiles and histograms.

An Introduction to Bayesian Timeseries Analysis with Python

https://www.chrisstucchio.com/blog/2016/has_your_conversion_rate_changed.html

Hypothesis Test: conversion rate changed or not?

Computing the likelihood of a time series

Computing the likelihood with Python
n = array([ 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000., 1000.])

c = array([51, 40, 51, 41, 44, 39, 54, 41, 61, 52, 65, 58, 44, 49, 34, 39, 24, 28, 36, 43])
theta = array([ 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05])

def log_likelihood(n, c, theta):

return sum(binom.logpmf(c, n, theta))

def likelihood(n, c, theta):

return exp(log_likelihood(n,c,theta))

In [1]: from pylab import *

In [2]: from scipy.stats import binom

In [3]: def log_likelihood(n, c, theta):

   ...:     return sum(binom.logpmf(c, n, theta))

   ...: 

In [4]: def bayesian_jump_detector(n, c, base_cr=0.05, null_prior=0.98, post_jump_cr=0.03):

   ...:     """ Returns a posterior describing our beliefs on the probability of a

   ...:     jump, and if so when it occurred.

   ...: 

   ...:     First return value is probability null hypothesis is true, second return

   ...:     value is array representing the probability of a jump at each time.

   ...:     """

   ...:     theta = full(n.shape, base_cr)

   ...:     likelihood = zeros(shape=(n.shape[0] + 1,), dtype=float) #First element represents the probability of no jump

   ...: 

   ...:     likelihood[0] = null_prior #Set likelihood equal to prior

   ...:     likelihood[1:] = (1.0-null_prior) / n.shape[0] #Remainder represents probability of a jump at a fixed increment

   ...: 

   ...:     likelihood[0] = likelihood[0] * exp(log_likelihood(n, c, theta))

   ...:     for i in range(n.shape[0]):

   ...:         theta[:] = base_cr

   ...:         theta[i:] = post_jump_cr

   ...:         likelihood[i+1] = likelihood[i+1] * exp(log_likelihood(n, c, theta))

   ...:     likelihood /= sum(likelihood)

   ...:     return (likelihood[0], likelihood[1:])

   ...: 

In [5]: n = full((20,), 100)

   ...: c = binom(100, 0.05).rvs(20) #No jump in CR occurs

   ...: 

   ...: bayesian_jump_detector(n, c, null_prior=0.99)

   ...: 

Out[5]: 

(0.99909917827083117,

 array([  1.76896708e-09,   1.84711683e-09,   5.58550880e-09,

          5.83226645e-09,   1.03635573e-08,   3.13384293e-08,

          1.92289232e-08,   9.89510066e-08,   5.09196569e-07,

          1.07887083e-07,   9.44781935e-07,   1.40796089e-05,

          8.63909656e-06,   7.56537499e-05,   3.89310137e-04,

          1.40370734e-04,   5.06124581e-05,   8.99350387e-05,

          3.24272250e-05,   9.80569000e-05]))

In [6]: n = full((20,), 100)

   ...: c = binom(100, 0.05).rvs(20)

   ...: c[13:] = binom(100, 0.03).rvs(7) #Jump occurs at t=13

   ...: 

   ...: bayesian_jump_detector(n, c, null_prior=0.99)

   ...: 

Out[6]: 

(0.42826882885196343,

 array([ 0.05374295,  0.09549772,  0.02023378,  0.0359541 ,  0.03754249,

         0.03920105,  0.02405334,  0.01475887,  0.02622555,  0.0466011 ,

         0.04865985,  0.05080955,  0.0311762 ,  0.01912938,  0.01997448,

         0.00423213,  0.00259679,  0.00093631,  0.00019838,  0.00020715]))

Tuesday, January 24, 2017

Topic Modeling using LDA

Topic Modeling

Gibbs sampling works by performing a random walk in such a way that reflects the characteristics of a desired distribution. Because the starting point of the walk is chosen at random, it is necessary to discard the first few steps of the walk (as these do not correctly reflect the properties of distribution). This is referred to as the burn-in period. We set the burn-in parameter to 4000. Following the burn-in period, we perform 2000 iterations, taking every 500th iteration for further use. The reason we do this is to avoid correlations between samples. We use 5 different starting points (nstart=5) – that is, five independent runs. Each starting point requires a seed integer (this also ensures reproducibility), so I have provided 5 random integers in my seed list. Finally I’ve set best to TRUE (actually a default setting), which instructs the algorithm to return results of the run with the highest posterior probability.

Some words of caution are in order here. It should be emphasised that the settings above do not guarantee the convergence of the algorithm to a globally optimal solution. Indeed, Gibbs sampling will, at best, find only a locally optimal solution, and even this is hard to prove mathematically in specific practical problems such as the one we are dealing with here. The upshot of this is that it is best to do lots of runs with different settings of parameters to check the stability of your results. The bottom line is that our interest is purely practical so it is good enough if the results make sense. We’ll leave issues of mathematical rigour to those better qualified to deal with them

As mentioned earlier, there is an important parameter that must be specified upfront: k, the number of topics that the algorithm should use to classify documents. There are mathematical approaches to this, but they often do not yield semantically meaningful choices of k (http://stackoverflow.com/questions/21355156/topic-models-cross-validation-with-loglikelihood-or-perplexity/21394092 for an example). From a practical point of view, one can simply run the algorithm for different values of k and make a choice based by inspecting the results. This is what we’ll do.

################################################################################
## https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/
#################################################################################
require(tm)
setwd("/Users/tkmaemd/Desktop/R/KLAB/1_24_2017")

#load files into corpus
#get listing of .txt files in directory
#include facebook, instagram, pinterest, twitter
filenames <- list.files(getwd(),pattern="*.txt")
filenames

#read files into a character vector
files <- lapply(filenames, readLines)

#create corpus from vector
docs <- Corpus(VectorSource(files))
inspect(docs)

#inspect a particular document in corpus
writeLines(as.character(docs[[1]]))
writeLines(as.character(docs[[2]]))
writeLines(as.character(docs[[3]]))
writeLines(as.character(docs[[4]]))

#start preprocessing
#Transform to lower case
docs <-tm_map(docs, content_transformer(tolower))

#remove potentially problematic symbols
# toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
# docs <- tm_map(docs, toSpace, "-")
# docs <- tm_map(docs, toSpace, "'")
# docs <- tm_map(docs, toSpace, ".")

#remove punctuation
docs <- tm_map(docs, removePunctuation)
#Strip digits
docs <- tm_map(docs, removeNumbers)
#remove stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
#remove whitespace
docs <- tm_map(docs, stripWhitespace)

#Good practice to check every now and then
writeLines(as.character(docs[[1]]))

#keep it as plaintextdocument
docs <- tm_map(docs, PlainTextDocument)
#Stem document
docs <- tm_map(docs,stemDocument)
#change to the character type
docs <- lapply(docs, as.character)
#create the object
docs <- Corpus(VectorSource(docs))
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
#convert rownames to filenames
rownames(dtm) <- filenames

##Create a WordCloud to Visualize the Text Data
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
#length should be total number of terms
length(freq)
head(freq, 20)

# Create the word cloud
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
set.seed(123)
pal = brewer.pal(11,"Spectral")
wordcloud(words = wf$word,
freq = wf$freq,
scale = c(3.5, 1.2),
random.order = F,
colors = pal,
max.words = 100)

#################################################################################
## Topic Models
#################################################################################
## What is the meaning of each topic?
## How prevalent is each topic?
## How do the topics relate to each other?
## How do the documents relate to each other?
#Set parameters for Gibbs sampling
burnin <- 400
iter <- 200
thin <- 50
seed <-list(2003,5,63)
nstart <- 3
best <- TRUE

#Number of topics
k <- 3

## What is the meaning of each topic?
## How prevalent is each topic?
## How do the topics relate to each other?
## How do the documents relate to each other?
#Set parameters for Gibbs sampling
burnin <- 400
iter <- 200
thin <- 50
seed <-list(2003,5,63)
nstart <- 3
best <- TRUE

#Number of topics
k <- 3

#Run LDA using Gibbs sampling
ldaOut <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed,
best=best, burnin = burnin, iter = iter, thin=thin))

## Show document-topic distribution
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics, file=paste("LDAGibbs",k,"DocsToTopics.csv"))

## Show term-topic distriubtion
#top 20 terms in each topic
ldaOut.terms <- as.matrix(terms(ldaOut,20))
write.csv(ldaOut.terms,file=paste('LDAGibbs',k,'TopicsToTerms.csv'))

#probabilities associated with each topic assignment
topicProbabilities <- as.data.frame(ldaOut@gamma)
topicProbabilities
write.csv(topicProbabilities, file=paste("LDAGibbs",k,"TopicAssignmentProbabilities.csv"))

#Find relative importance of top 2 topics
topic1ToTopic2 <- lapply(1:nrow(dtm),function(x) sort(topicProbabilities[x,])[k]/sort(topicProbabilities[x,])[k-1])
topic1ToTopic2

#Find relative importance of second and third most important topics
topic2ToTopic3 <- lapply(1:nrow(dtm),function(x) sort(topicProbabilities[x,])[k-1]/sort(topicProbabilities[x,])[k-2])
topic2ToTopic3

Thursday, January 19, 2017

Google Spreadsheets Python API

Official Google Spreadsheet API is flexible but can take a considerable amount of work if your idea is “just” accessing the data of the spreadsheets and doing a few operations on them.

To leverage all the API work and help us all there is gspread, which wraps up the API and makes clear and pragmatic functions sprout.

Using OAuth2 for Authorization

http://gspread.readthedocs.io/en/latest/oauth2.html

Wednesday, January 18, 2017

Use American Community Survey (ACS) Data

#############################
## state.R
#############################
## require(choroplethr)

?df_pop_state
data(df_pop_state)
head(df_pop_state)
dat10 <- merge(dat9, df_pop_state, by.x="states", by.y="region")
dat10$perc <- dat10$cust/dat10$value
percent <- function(x, digits = 2, format = "f", ...) {
paste0(formatC(100 * x, format = format, digits = digits, ...), "%")
}
dat10$statelabel <- paste(dat10$demand_state, "\n", percent(dat10$perc,2,"f"), sep="")
head(dat10)

p9 <- ggplot() +
geom_map(data=states, map=states,
aes(x=long, y=lat, map_id=region),
fill="#ffffff", color="#ffffff", size=0.15) +
geom_map(data=dat10, map=states,
aes(fill=perc, map_id=states),
color="#ffffff", size=0.15) +
coord_fixed(1.3) +
scale_fill_continuous(low = "thistle2", high = "darkred", guide="colorbar") +
geom_text(data=dat10, aes(x=long, y=lat, label=statelabel), colour="black", size=4 ) +
ggtitle("Kohl's Customers from 11/1/2014 to 10/31/2016") + ylab("") + xlab("") +
theme(plot.title = element_text(face = "bold", size = 20)) +
theme(axis.text.x = element_text(face = "bold", size = 14)) +
theme(axis.text.y = element_text(face = "bold", size = 14)) +
theme(axis.title.x = element_text(face = "bold", size = 16)) +
theme(strip.text.x = element_text(face = "bold", size = 16)) +
theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
guides(fill=FALSE)

## collect acs table names and column names
install.packages("acs")
require(acs)

acs.lookup(endyear=2013, span=5, table.name="Age by Sex")
acs.lookup(endyear=2013, span=5, table.number="B01001")
write.csv(, "tmp.csv")

#############################

## demo.py
#############################

# https://github.com/CommerceDataService/census-wrapper
# https://www.socialexplorer.com/data/ACS2013_5yr/documentation/12e2e690-0503-4fb9-a431-4ce385fc656a
# http://www.census.gov/geo/maps-data/data/relationship.html
import csv
import pandas as pd
from census import Census
from us import states
# Datasets
# acs5: ACS 5 Year Estimates (2013, 2012, 2011, 2010)
# acs1dp: ACS 1 Year Estimates, Data Profiles (2012)
# sf1: Census Summary File 1 (2010, 2000, 1990)
# sf3: Census Summary File 3 (2000, 1990)

API_KEY = '0c9d762f43fe6dd071ec0f5a3bfdd19b478c2381'
c = Census(API_KEY, year=2013)

# example: convert names and fips
#c.acs5.get(('NAME', 'B25034_010E'),{'for': 'state:{}'.format(states.MD.fips)})

# Total number for all states:
records=['B01001_031E', 'B01001_032E','B01001_033E','B01001_034E','B01001_035E', 'B01001_036E']
flag = 0
for record in records:
if flag==0:
tmp=c.acs5.get(record, {'for': 'state:*'})
df=pd.DataFrame(tmp)
flag=1
else:
tmp1=c.acs5.get(record, {'for': 'state:*'})
df1=pd.DataFrame(tmp1)
df = df.merge(df1, on='state', how='outer')

df.index=df['state']
df.sort_index(inplace=True)
del df['state']
df1=df.applymap(int)
df1['tot']=df1.apply(sum, axis=1)/2
df1['tot']=df1['tot'].astype(int)

# Get State label
statelabel=[]
for state in df1.index:
statelabel.append(states.lookup(state).abbr)
df1['statelable']=statelabel

df1.to_csv('~/Desktop/output.csv')
print 'Done'

Wednesday, January 11, 2017

Mastering Social Media Mining 5 - Mining Topic Analysis on Google+

How to interact with the Google+ API with the help of Python

How to search for people or pages on Google+

How to use the web framework, Flask, to visualize search results in a web GUI

How to process content from a user's post to extract interesting keywords

Mastering Social Media Mining 4 - Mining Facebook Posts, Pages, and User Interactions

Creating an app to interact with the Facebook platform

Interacting with the Facebook Graph API

Mining posts from the authenticated user

Mining Facebook Pages, visualizing posts, and measuring engagement

Building a word cloud from a set of posts

Mastering Social Media Mining 3 - Mining Twitter Users, Followers, and Communities on Twitter

How to download a list of friends and followers for a given user

How to analyze connections between users, mutual friends, and so on

How to measure influence and engagement on Twitter

Clustering algorithms and how to cluster users using scikit-learn

Network analysis and how to use it to mine conversations on Twitter

How to create dynamic maps to visualize the location of tweets

Mastering Social Media Mining 2 - Mining Twitter Hashtags, Topics, and Time Series

The curve that we can observe represents an approximation of a power law (https://en.wikipedia.org/wiki/Power_law).

In statistics, a power law is a functional relationship between two quantities; in this case, the frequency of a term and its position within the ranking of terms by frequency.

This type of distribution always shows a long tail (https://en.wikipedia.org/wiki/Long_tail), meaning that a small portion of frequent items dominate the distribution, while there is a large number of items with smaller frequencies.

Another name for this phenomenon is the 80-20 rule or Pareto principle (https://en.wikipedia.org/wiki/Pareto_principle), which states that roughly 80% of the effect comes from 20% of the cause (in our context, 20% of the unique terms account for 80% of all term occurrences).

Tuesday, January 10, 2017

Mastering Social Media Mining 1 - Python tools for data science

Machine learning

#################################################################################
# Chap01/demo_sklearn.py

#################################################################################

from sklearn import datasets

from sklearn.cluster import Means

import matplotlib.pyplot as pet

if __name__ == '__main__':

# Load the data

iris = datasets.load_iris()

X = iris.data

petal_length = X[:, 2]

petal_width = X[:, 3]

true_labels = iris.target

# Apply KMeans clustering

estimator = KMeans(n_clusters=3)

estimator.fit(X)

predicted_labels = estimator.labels_

# Color scheme definition: red, yellow and blue

color_scheme = ['r', 'y', 'b']

# Markers definition: circle, "x" and "plus"

marker_list = ['o', 'x', '+']

# Assign colors/markers to the predicted labels

colors_predicted_labels = [color_scheme [lab] for lab in predicted_labels]

markers_predicted = [marker_list[lab] for lab in predicted_labels]

# Assign colors/markers to the true labels

colors_true_labels = [color_scheme[lab] for lab in true_labels]

markers_true = [marker_list[lab] for lab in true_labels]

# Plot and save the two scatter plots

for x, y, c, m in zip(petal_width, petal_length, colors_predicted_labels, markers_predicted): plt.scatter(x, y, c=c, marker=m)

plt.savefig('iris_clusters.png')

for x, y, c, m in zip(petal_width, petal_length, colors_true_labels, markers_true):

plt.scatter(x, y, c=c, marker=m)

plt.savefig('iris_true_labels.png')

print(iris.target_names)

Natural language processing

#################################################################################

Social network analysis

pip install networks

#################################################################################

# Chap01/demo_networkx.py

import networkx as nx

from datetime import date time

if __name__ == '__main__':

g = nx.Graph()

g.add_node("John", {'name': 'John', 'age': 25})

g.add_node("Peter", {'name': 'Peter', 'age': 35})

g.add_node("Mary", {'name': 'Mary', 'age': 31})

g.add_node("Lucy", {'name': 'Lucy', 'age': 19})

g.add_edge("John", "Mary", {'since': datetime.today()})

g.add_edge("John", "Peter", {'since': datetime(1990, 7, 30)})

g.add_edge("Mary", "Lucy", {'since': datetime(2010, 8, 10)})

print(g.nodes())

print(g.edges())

print(g.has_edge("Lucy", "Mary"))

Processing data in Python

#################################################################################

# Chap01/demo_json.py

import json

if __name__ == '__main__':

user_json = '{"user_id": "1", "name": "Marco"}'

user_data = json.loads(user_json)

print(user_data['name']) # Marco

user_data['likes'] = ['Python', 'Data Mining']

user_json = json.dumps(user_data, indent=4)

print(user_json)

Thursday, January 5, 2017

Mobile Advertising and Social Advertising

Mobile Advertising

It is no surprise that digital advertising will surpass television ads for the first time this year. When it comes to digital, advertisers are taking note as consumers spend an average of 3 hours and 8 minutes per day on their mobile devices. Not only is mobile the future of e-commerce, it is a platform that supports the entire shopping journey, from research to payment.

Looking at the recent boom in mobile advertising, a clear trend is emerging. In-app ads tend to be the most successful and offer marketers a better opportunity to target consumers at the right time, as apps are often enhanced by location data. Being able to offer highly targeted promotions is a huge advantage when it comes to consumer engagement. From Google's launch of AMP for Ads, aimed to encourage faster ads, to Facebook's ability to bypass ad blockers, these companies remain best in class.

With non-app advertising, incorporating video is key. Mobile users are incredibly responsive to this type of content and spend a significant amount of time watching videos. Spotify successfully increased subscription intent by 3% when it ran a video campaign promoting its top artists of the year on Snapchat. Others are using live-streaming capabilities to attract consumers. Twitter will also begin to live-stream NFL games starting this September.

Regardless of the type of ad and platform, engaging the audience as opposed to interrupting them is the biggest takeaway when it comes to mobile advertising. Actively helping the target audiences with shorter, simpler ads drives sales.

Social Advertising

Social media has gained a lot of attention to understand fashion, trends of fashion and how different fashion brands perform brand marketing.

Through extensive research, CMG identified Generation Z as social media power-users that value authenticity and inclusivity. With a young team of editors experimenting with the delivery, distribution and creation of content, the brand has made timely strides in finding a unique voice. Obsessee has emerged as a safe space for young girls to have fun, interact and consume content through a Gen-Z lens.

CROSSING CONTENT CHANNELS

1 Case Study of Obsessee - tells real stories about real girls
2 Case Study of Sweetgreen - healthy food
3 Case Study of Country Road - lifestyle brand
4 Case Study of SoulCycle - Indoor cycling studio
5 Case Study of Stackbuck
6 Case Study of Glossier - Beauty brand
7 Case Study of Muji

Unlike its social-only competitors (think Hearst’s Sweet, published exclusively on Snapchat Discover), Obsessee tells compelling stories across ten platforms. By teasing a small amount of context on one platform, they entice audiences to navigate to another. With a team of editors devoted to experimenting with new methods of storytelling (think emerging platforms), Obsessee is aiming to become more interconnected.

Youtube

Case 2 - They have not posted new content since July 2015, but previous videos featured new store openings, farmer and employee profiles and videos from the Sweetlife Festival.
Case 3 - Content mostly includes model profiles that coincide with campaigns and how-to's
Case 4 - Fun promotional videos featuring instructors relating to pop culture or seasonal trends. Documents its efforts in giving back across different communities.
Case 5 - Informative videos featuring their social efforts, recipes and company history
Case 6 - Informative product tutorials and morning routine series titled 'Get Ready With Me'
Case 7 - Videos of new products put in daily use context, branding videos, behind-the-scenes for in-store art projects, manufacturing process of wooden home products

Snapchat

Case 1 - The channel is referred to as 'Obsessee TV' and features takeovers from editors and influencers. Content ranges from cooking and recipe instructions to 'a day in the life of'.
Case 2 - This platform is engaging, dynamic and audience-focused. They post various trivia questions about food, encouraging users to take screenshots for a chance to win prizes.
Case 3 - Live coverage of charity events. Celebrates its different studios' anniversaries with a customised Snap filter. Every-day in-studio experience captures its instructors' and employees' personalities.
Case 5 - Light and playful original stories that incorporate a product directed towards a younger audience
Case 6 - Product teases prior to launch, takeovers, behind-the-scenes peek into its HQ office in Soho

Twitters

Case 1 - Original imagery from its features; links to product pushes form editors; links to a wide range of new outlets covering topics that vary across each platform.
Case 2 - Used to communicate with customers and respond to enquiries. It has also become a portal for understanding changing customer needs and sourcing talent.
Case 3 - Links to product pushes, signature hashtag #countryroadstyle, links to blog posts and campaign imagery, links to company careers page, teasers for strategic partnerships
Case 4 - Announces themed classes by hashtagging the SoulCycle studio location and tagging the instructor. Posts daily inspirational quotes with a separate hashtag: #dailySOUL. Redirects to its blog.
Case 5 - Retweets from pleased fans and other owned accounts as well as cross channel promotion
Case 6 - Retweets from happy customer comments, brand and publication reviews. Glossier's popularity paired with its limited production has enabled the arrival of product in the mail to be the ultimate unboxing experience. Customers feel compelled to share their purchases with their personal online community using a Glossier hashtag. The brand's social accounts are heavily weighted with user-generated content from its loyal community (it's estimated that Twitter is roughly 50% user-generated), permitting the Glossier voice to be predominantly consumer-led.
Case 7 - Store openings, promotional campaigns, links to Instagram contests, new products, designer quotes

Facebook

Case 1 - Features platform-specific contents; links to girl-power-themed articles about current affairs, activism, fashion, beauty and life hacks from outlets like Bustle and Her Campus.
Case 2 - Features on this platform include event invites, direct correspondence with customers who communicate their complaints and content promoted across its other platforms.
Case 3 - Sales pushes, campaign shots, customer feedback and customer service initiatives, and links to the blog
Case 4 - Redirects to the Community tab on its website featuring: inspiring stories on how SoulCycle has guided its members, healthy lifestyle advice and behind-the-scenes videos of instructors off-duty.
Case 5 - Product-heavy imagery and dialogue with consumers highlighting social issues.With a host of regional Facebook accounts, Starbucks can directly engage with local consumers in that market. Examples of such accounts include: Starbucks New Zealand and Starbucks Canada. This approach proves that the Starbucks customer (wherever they may be) is at the forefront of their strategy.
Case 6 - Live make-up tutorials, product promotion, redirection to Into The Gloss blog
Case 7 - Videos of new products put in daily use context, promotional campaigns, live streaming of designer talks, new store openings, collaborations with designers and educational institutes, outfit suggestions

Instagrams

Case 1 - Features with extended text and imagery that interact with their going community; product reviews and recommendations; inspirational quotes form female influencers.
Case 2 - The content on is equal parts inspirational and approachable.
Case 3 - Major push on sales and discounted merchandise accompanied by beautiful campaign imagery, product shots, push to blog posts
Case 5 - Mostly user-generated content and festive product shots celebrating the current season
Case 6 - Inspirational imagery and mood boards, user-generated content, product pushes. Glossier launched on Instagram, capitalising on the visual preferences of its targeted demographic. This was a marketing tactic used to drive dialogue and hype. Instagram has challenged traditional norms enabling anyone to become famous and once unknown brands, like Glossier, to emerge as industry-leaders in the making with a cult-like status.
Case 7 - Original professional editorials and product shots with minimalistic styling, videos of new products put in daily use context

Case 1 - Boards with themes including: "under $100 Gift Ideas:, "Make-up Ideas", "Hairstyle Ideas", "Pastel Hair", "Natural Hairstyles", "Style Icons", "Travel+Adventure" and more.
Case 3 - 24 boards in total with themes including: simple things, woman, man, child, travel, culture, home, living
Case 5 - 23 boards ranging from coffee and tea photography to store design and cup art
Case 6 - 24 total boards with themes including: Skin Is In, Glossier Pink, Beauty How-Tos
Case 7 - Board titles include: Muji To Go, Art & Craft, myMuji, Fun Stationery, Aroma, Travel, Product Fitness 80, and Muji Food

Spotify

Case1 - Curated and guest-edited public playlists(from girls featured on other platforms): "Bowie's Best", "Post-punk", "Warm 'n" Guzzier", "All that Glitters" and more.

Tumblr

Case 1- mostly beauty inspiration, images of female icons, interesting landscapes, dogs, food and fashion with a hint of nostalgia.

Actions

SPEAKING GEN-Z
Editors often use humorous slanguage with words like "bae", "woke" and "lit" casually worked into editorial posts. By using this jargon (a viral depiction of youth culture), the content feels conversational, inclusive and friendly. The brand's original imagery is not hyper-glamourised, (even appearing blurry at times) and products featured are accessible, reinforcing its stand as the approachable publisher catering to all of Gen-Z's "feels".

SOCIAL STORYTELLING
Obsessee uses its Instagram and Facebook accounts as storytelling tools, marrying original images and long-form copy in an easily digestable format. On Instagram, the story is spread across multiple images and the text is written as captions. On Facebook, the same story is published via the platform's ‘notes’ function, streamlining the piece into a glossy display (resembling a blog), which readers can read without leaving the platform.

SOCIAL CURRENCY
In July 2016, Obsessee opened a pop-up shop at the Grove in LA. For three days, customers could purchase from brand partners like Keds and Ban.do. Customers ‘paid’ via social media posts on Instagram, Facebook and Snap. Platforms were valued differently, with Snap being the least valuable. Before leaving the store with merch, customers had to upload their images, tag Obsessee and use the hashtag #IGotFreeStuff.

NAVIGATING NATIVE ADS
Obsessee has invested in a native ad model, taking to Instagram to partner with Hawaiian Tropic, Swatch and others. Obsessee has diligently disclosed its participation in these initiatives, using the hashtag #ad, in compliance with the FTC's guidelines (particularly relevant as Gen-Z thrives on transparency and 'wokeness'). As this content blends seamlessly into an Instagram feed, it can be difficult for young audiences with untrained eyes to differentiate.

CHALLENGE ACCEPTED
Obsessee encourages user participation on Instagram, leading to an organically engaged audience. Through its monthly photo challenges, audiences post imagery aligning with the given word of each day, tag Obsessee and then use #happyscrolling to be featured. Obsessee doesn't always attach an incentive, having also posted about tidying up, digital detoxing and making a difference, the latter aligning perfectly with Gen Z priorities.

TUNING OUT THE NOISE
Obsessee's Spotify account is filled with editor-approved playlists. Using Instagram to help get the word out, posts feature songs that are holiday-themed or provide the "feels" following a particular current event (the election, for example). With a corresponding lengthy synopsis (deeply personal or informative) often detailing the reason for the selection, audiences can relate and socially share emotions through music.

Important statistics include number of likes, comments, caption, geolocation, hashtags.

Brand Behavior on Instagram

- Average frequency of posts, hashtags, favorites, and retweets of posts made by a brand on the platform

- Identify popular trends or topics

Use the captions or text attached with all the posts of a brand and try to understand how the topics vary across all the brands on the two platforms.

- Detect visual features

User transfer learning, where the networks are trained on one task and are used to create representation and analysis on other tasks, learn a distributed representing an map the images on such as objectless, color, texture, etc.

Tuesday, January 31, 2017

Monday, January 30, 2017

Tuesday, January 24, 2017

Thursday, January 19, 2017

Wednesday, January 18, 2017

Wednesday, January 11, 2017

Tuesday, January 10, 2017

Thursday, January 5, 2017

Blog Archive