My DS Coding Bolg

Wednesday, August 9, 2017

d3

 brew install n

 n lts

 sudo n lts

 node --version

 npm --version

 mkdir d3-project

 cd d3-project

 npm init -y

 npm install "babel-core@^6" "babel-loader@^6" "babel-preset-es2017@^6" "babel-preset-stage-0@^6" "webpack@^2" "webpack-dev-server@^2" css-loader style-loader json-loader --save-dev

python -m http.server 8000

Collapsible Tee -
https://bl.ocks.org/mbostock/4339083

Tuesday, June 20, 2017

Data innovation will provide competitive differentiation and enhance the end user experience

Delivering compelling, differentiated customer experiences in our key brands is key to winning market share, and many of these experiences require extensive, innovative use of data. With the mail property as an example, we look in this section at that property’s roadmap for 2017, and highlight the key roadmap features that will require significant data support. The degree to which data is linked to differentiated capabilities is not because mail as a property is different than other applications. The innovative use of data when fully factored into strategy is expected to play a central role in promoting growth in most areas of the business.

* Rich Compose - This feature helps users take advantage of rich content when sending messages, which could include attachments, GIFs, images, and video. Content recommendations would be data driven, requiring understanding of the user and their immediate intent.

* Contacts - Building up an understanding of a mail user’s contacts requires sophisticated analysis of that user’s emails, including who they send mail to or receive mail from. Additionally, messages sent to the user from social network contacts can generate emails to the user from the social network which can be analyzed to gain knowledge of these social network contacts. Deriving the contact details associated with a comprehensive set of contacts (name, relationship, email, phone number, address) from the data in each inbox brings powerful benefits to the mail user. Additionally, from the mobile mail application we can often inherit contact lists, call-logs and message histories, with associated caller-ID. This data can improve the handling of unknown or undesired callers. Advanced CallerID experiences require data analysis, digestion and recommendations, particularly for B2C experiences, for example when your airline calls you. Finally, we have the opportunity to construct a “global graph” from all of our users which can be leveraged both to protect as well as to provide differentiated features for each individual user.

* Coupons - Many users care a great deal about deals that have the potential to reduce the cost of goods and services. Our mailboxes contain an enormous number of discount offers made available to our customers. Organizing these coupons and making them available to customers at the appropriate time, based on relevance and expiration date for example, has the potential to delight our customers and to increase mail usage. Developing powerful coupon recommendation capabilities will require understanding when the coupons are set to expire, where geographically they are valid, and when they are of highest value (for example we could alert users when they are near the store where their coupon is valid). Finally, our goal is to develop a global coupon extraction system, so all of our users can receive recommendations that tap into the overall Coupon pool.

* Photos - Understanding the content within a photo has significant benefits to enhancing search relevance. A mail user who searches for “beach vacation” would be delighted to find the sought after email where the only relevant information was an attached photo that shows a tropical beach. Leveraging vision systems to analyze user photo’s enables a set of compelling use-cases (including search, auto-tagging, and object recognition). providing the ability to group photos by similarity, date, or included objects has the potential to allow powerful new ways for users to leverage their mail.

* Content Organization - To assist our users in organizing their mailboxes so that the content can be more easily accessed, we are working to provide differentiated experiences. Email classification underpins some of the planned improvements. Our goal is to to build browsing experiences for common use cases, such as: flights, hotels, subscriptions, and finance. Providing a simplified way to unsubscribe from a subscription would be an example of a specific goal.

* Personalization - Characterizing our users and giving them a personalized mail experience based on their usage has the potential to make mail more usable and efficient. For example, users who are running a business on our mail system have very different needs and expectations than grandparents who are staying in touch with their families. Recognizing and categorizing users and personalizing their experience has the potential to drive higher satisfaction and retention.

* Notifications - For all of our planned scenarios, we also need to consider when to trigger notifications, at the right time, at the right interval. It is imperative that we not overload the users with such signals. We need to analyze all available data to understand when notification will be welcome, and will trigger engagement, and not have the opposite effect. The collection of GPS signals can be very helpful to generating a notification at the optimal time to inform a user about a relevant local deals or coupon.

Great brands are powered by cutting-edge technology.

Technology fuels the mobile experiences people love—from the software that delivers the content you crave to the code that powers your favorite communication apps.
We’re engineering one of the industry’s most comprehensive ad technology stack across mobile, video, search, native and programmatic to maximize results for our partners.
Our research and engineering talent solves some of the industry’s biggest challenges in infrastructure, data, AI, machine learning and more.
We’re building on our rich foundation of technical innovation as we break new ground with Verizon.

We design for consumers first.

We listen to consumers through research, user engagement and user feedback to build the experiences they love.
We build mobile experiences for everyone. Products that are developed with every individual in mind, including users with disabilities, are better products.
We abide by the highest standards of accountability and transparency to protect our consumers’ and customers’ data.
We benefit from Verizon’s experience and resources to further strengthen our security.

We build technology for scale.

We build products that reach one billion users, and share our technologies with the world.

Every part of our business is fueled by machine learning, improving our brands and products with better image recognition, advertising targeting, search rankings, content personalization and e-commerce recommendations.
We’re a partner to all. We frequently work with the open source community, tech industry counterparts and academic peers to build the best possible products.

Sunday, June 4, 2017

Keras

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (7,7) # Make the figures a bit bigger

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.utils import np_utils

# ## Load Training Data
nb_classes = 10 # number of outputs = number of digits

# the data, shuffled and split between tran and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
print("X_train original shape", X_train.shape)
print("y_train original shape", y_train.shape)

for i in range(9):
plt.subplot(3,3,i+1)
plt.imshow(X_train[i], cmap='gray', interpolation='none')
plt.title("Class {}".format(y_train[i]))

# ## Format the data for training

## reshape 28*28 = 784
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

## normalize
X_train /= 255
X_test /= 255
print("Training matrix shape", X_train.shape)
print("Testing matrix shape", X_test.shape)

# In[6]:
## one-hot format - convert class vectors to binary class matrix
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

# ## Build the NN
model = Sequential()
model.add(Dense(512, input_shape=(784,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(10))
model.add(Activation('softmax'))
model.summary()

# ## Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# ## Train the model
history = model.fit(X_train, Y_train,
batch_size=128,
epochs=4,
verbose=1,
validation_data=(X_test, Y_test))

# ## Evaluate the performance
score = model.evaluate(X_test, Y_test, verbose=1)
print("\nTest score:", score[0])
print('Test accuracy:', score[1])

# ## Inspecting the output
predicted_classes = model.predict_classes(X_test)

# Check which items we got right / wrong
correct_indices = np.nonzero(predicted_classes == y_test)[0]
incorrect_indices = np.nonzero(predicted_classes != y_test)[0]

plt.figure()
for i, correct in enumerate(correct_indices[:9]):
plt.subplot(3,3,i+1)
plt.imshow(X_test[correct].reshape(28,28), cmap='gray', interpolation='none')
plt.title("Predicted {}, Class {}".format(predicted_classes[correct], y_test[correct]))
plt.show()

plt.figure()
for i, incorrect in enumerate(incorrect_indices[:9]):
plt.subplot(3,3,i+1)
plt.imshow(X_test[incorrect].reshape(28,28), cmap='gray', interpolation='none')
plt.title("Predicted {}, Class {}".format(predicted_classes[incorrect], y_test[incorrect]))

plt.show()

# ## List all data in history

print(history.history.keys())

# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

https://github.com/wxs/keras-mnist-tutorial/blob/master/MNIST%20in%20Keras.ipynb

Wednesday, May 24, 2017

Bayesian using Stan

#########################################################
## stan
#########################################################
data{
int<lower=0> N;
vector[N] Value;
vector[N] SqFt;
}
parameters
{
real alpha;
real beta;
real<lower=0> sigma;
}

model
{
Value ~ normal(alpha + beta*SqFt, sigma);
}

#########################################################
## R
#########################################################
housing <- read.table("housing1.csv", sep=",", header=TRUE, stringsAsFactors = FALSE)
head(housing)
mod1 <- lm(ValuePerSqFt ~ SqFt, data=housing)
summary(mod1)

fit = stan(file='y.stan', data=list(N=nrow(housing),
Value=housing$Value,
SqFt=housing$SqFt), iter=10)

Monday, May 8, 2017

Windowing in Hive

1 Partition specification: It includes a column reference from the table. It could not be any aggregation or other window specification.

- SELECT fname,ip, COUNT(pid) OVER (PARTITION BY ip) FROM sales;

- SELECT fname,ip,zip,pid, COUNT(pid) OVER (PARTITION BY ip, zip) FROM sales;

select user_id,

client_session_id,

event_name,

count(event_name) over (partition by user_id, client_session_id)

from tmp_churn_mar2017_little_sister

where user_id = 255177207;

2 Order specification: It comprises a combination of one or more columns. The ordering could be ASC or DESC, which by default is ASC.

- SELECT fname,pid, COUNT(pid) OVER (PARTITION BY ip ORDER BY fname) FROM sales;

- SELECT fname,ip,pid, COUNT(pid) OVER (PARTITION BY ip, pid ORDER BY fname) FROM sales;

- select user_id,

client_session_id,

event_name,

count(event_name) over (partition by user_id, client_session_id order by event_timestamp)

from tmp_churn_mar2017_little_sister

where user_id = 255177207;

3 Window frame: A frame has a start boundary and an optional end boundary. Frame type: Window frames could be any of the following types: ROW, RANGE

- SELECT fname, ip, COUNT(pid) OVER (PARTITION BY ip ORDER BY fname ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM sales;

select user_id,

client_session_id,

event_name,

count(event_name) over (partition by user_id, client_session_id order by event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

from tmp_churn_mar2017_little_sister

where user_id = 255177207;

Frame boundary: A frame is associated with a direction or an amount. A direction value could be PRECEDING or FOLLOWING and the amount could be an integer value or keyword UNBOUNDED.

Effective window frames:
BETWEEN <start boundary> AND CURRENT ROW: When only the start boundary of a frame is specified.
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: When only the order is specified but no window frame is specified.
ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: When no order and no window frame are specified.

- SELECT fname, ip, COUNT(pid) OVER (PARTITION BY ip ORDER BY fname ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) FROM sales;
- SELECT fname, ip ,COUNT(pid) OVER (PARTITION BY ip ORDER BY fname ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) FROM sales;
- SELECT fname, ip, COUNT(pid) OVER (PARTITION BY ip ORDER BY fname ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)FROM sales;

4 Source name for window definition: A row R in the input table belongs to a partition as defined in the partition specification.

SELECT fname,pid, LEAD(pid) OVER (PARTITION BY ip ORDER BY ip) FROM sales;

SELECT fname,pid, LAG(pid) OVER (PARTITION BY ip ORDER BY ip) FROM sales;

select user_id,
client_session_id,
event_name,
event_timestamp,
LAG(event_name) OVER (PARTITION BY user_id, client_session_id order by event_timestamp) previous_event_name,
count(event_name) over (partition by user_id, client_session_id order by event_timestamp) event_ts_order
from tmp_churn_mar2017_little_sister
where user_id = 255177207;

5 Second largest
select max(Salary)
from Employee
where Salary < (select max(salary) from Employee)

6 Nth largest
select a.Salary
from Employee a
join Employee b
where a.Salary <= b.Salary
group by a.Salary
having count(distinct b.Salary) = N

7 Top three Salaries
select a.dept, a.Salary
from Employee a
join Employee b
where a.Salary <= b.Salary
group by a.dept, a.Salary
having count(distinct b.Salary) <=3

8 Median
select Id, Company, Salary
from Employee e
where abs(
select(count(*) from Employee e1 where e.company=e1.company and e.Salary>=e1.Salary) -
select(count(*) from Employee e2 where e.company=e2.company and e.Salary<=e2.Salary) -
)<=1
group by Id, Company

Sunday, April 30, 2017

Basic Boostrap

Q: How to derive distribution estimates based on relatively small sample?

A: Using bootstrap to derive a statistic or model parameter.

The basic idea of bootstrapping is that inference about a population from sample data, (sample → population), can be modeled by resampling the sample data and performing inference about a sample from resampled data, (resampled → sample).

Bootstrap does not necessarily involve any assumptions about the data, or the sample statistic, being normally distributed. In practice, it is not necessary to actually replicate the sample a huge number of times. We simply replace each observation after each draw—we sample with replacement. In this way, we effectively create an infinite population in which the probability of an element being drawn remains unchanged from draw to draw. The algorithm for a bootstrap resampling of the mean is as follows, for a sample of size N:

Draw a sample value, record, replace it
Repeat N times
Record the mean of the N resampled values
Repeat steps 1-3 B times
Use the B results to:

Calculate their standard deviation (this estimates sample mean standard error)
Produce a histogram or boxplot
Find a confidence interval

Thursday, March 30, 2017

Facebook Prophet

pip install pystan

pip install fbprophet

import pandas as pd
import numpy as np
from fbprophet import Prophet
import matplotlib.pyplot as plt

df = pd.read_csv("~/downloads/res2.csv", index_col=[0])
df.head(2)

data = pd.DataFrame({'ds': df.dt,
'y': np.log(df['amount'])})
data.head(2)

m = Prophet()
m.fit(data)

future = m.make_future_dataframe(periods=4)
future.tail()

forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

m.plot(forecast)

m.plot_components(forecast);
plt.show()

My DS Coding Bolg

Wednesday, August 9, 2017

d3

Tuesday, June 20, 2017

Data Innovation

Sunday, June 4, 2017

Keras

Wednesday, May 24, 2017

Bayesian using Stan

Monday, May 8, 2017

Windowing in Hive

Sunday, April 30, 2017

Basic Boostrap

Q: How to derive distribution estimates based on relatively small sample?

A: Using bootstrap to derive a statistic or model parameter.

Thursday, March 30, 2017

Facebook Prophet

Blog Archive