Thursday, July 5, 2012

Collinearity and VIF


Collinearity occurs when two or more variables are highly associated. Including them in a linear model can result in confusing, nonsensical, or misleading results, because the model cannot differentiate the contribution from each of them.

Because visits and transactions are so highly related, and also because a linear model assumes that effects are additive, an effect attributed to one variable (such as transactions) is not available in the model to be attributed jointly to another that is highly correlated (visits). This will cause the standard errors to he predictors to increase, which means that the coefficient estimates will be highly uncertain or unstable. As a practical consequence, this may cause coefficient estimates to differ dramatically from sample to sample due to minor variation in the data even when underlying relationships are the same.

The degree of collinearity in data can be assessed as the variance inflation factor(VIF). This estimates how much the standard error (variance) of a coefficient in the linear model is increased because of shared variance with other variables, compared to the situation if the variables were un-correlated or simple single predictor regression were performed.

The VIF provides a measure of shared variance among variables in a model. A common rule of thumb is that VIF > 5.0 indicates the need to mitigate collinearity.

There are three general strategies for mitigating collinearity:
- Omit variables that are highly correlated.
- Eliminate correlation by extracting principal components or factors for sets of highly correlated predictors.
- Use a method that is robust to collinearity, i.e., something other than traditional linear modeling, e.g, random forest, which only uses a subset of variables at a time. Or, use PCA to extract the first component from the variables.
In all, common approaches to fixing collinearity include omitting highly correlated variables, and using principle components or factor scores instead of individual items.







SEM to estimate Structural Equation Modeling in R

Structural models are helpful when your modeling needs meet any of the conditions:
- To evaluate interconnection of multiple data points that do not map neatly to the division between predictors and an outcome variable
- To include unobserved latent variables such as attitudes and estimate their relationships to one another or to observed data
- To estimate the overall fit between observed data and a proposed model with latent variables or complex connections.

Structural models are closely related to both linear modeling because they estimate associations and model fit, and to factor analysis because they use latent variables.

With regard to latent variables, the models can be used to estimate the association between outcomes such as purchase behavior and underlying attitudes that influence those, such as brand perception, brand preference, likelihood to purchase, and satisfaction.

Create graphical path diagram of influences and then estimate the strength of relationship for each path int he model. Such paths often concern two kinds of variables: manifest variables that are observed, i.e., that have data points, and latent variables that are conceived to underlie the observed data.

With SEM, it is feasible to do several things that improve our models: to include multiple influences, to posit unobserved concepts that underlie the observed indicators (i.e., constructs such as brand preference, likelihood to purchase, and satisfaction), it specify how those concepts influence one another, to assess the model's overall congruence to the data, and  to determine whether the model fits the data better tan alternative models.

SEM creates a graphical path diagram of influences and then estimating the strength of relationship for each path in the model. Such paths often concern two kinds of variables: manifest variables that are observed, i.e., that have data points, and latent variables that are conceived to underlie the observed data. The set of relationships among the latent variables is called the structural model, while the linkage between those elements and the observed, manifest variables is the measurement model.

Structural equation models are similar to linear regression models., but differ in three regards.
First, they assess the relationships among many variables, with models that may be more complex than simply predictors and outcomes.
Second, those relationships allow for latent variables that represent underlying constructs that are thought to be manifested imperfectly in the observed data.
Third, the models allow relationships to have multiple 'downstream' effects.

Two general approaches to SEM are the covariance-based approach (CB-SEM), which attempts to model the relationships among the variables at once and thus is a strong test of the model, and the partial least squares approach (PLS-SEM), which fits parts of the data sequentially and has less stringent requirements. 

After specify a CB-SEM model, simulate a data set using simulateData() from lavaan with reasonable guesses as to variable loadings, Use the simulated data to determine whether your model is likely to converge for the sample size your expect. 

Plot your specified model graphically and inspect it carefully to check that it is the model you intended to estimate. 

Whenever possible, specify one or two alternative models and check those in additional to your model. Before accepting a CB-SEM model, use compareFit() to demonstrate that your model fits the data better than alternatives. 


If you have data of varying quality, nominal categories, small sample, or problems converging a CB-SEM model, consider partial least squares SEM (PLS-SEM). 


##################################################################
## SEM
##################################################################
## install.packages("sem")
require(sem)

R.DHP <- readMoments(diag=FALSE, names=c("ROccAsp", "REdAsp", "FOccAsp",
                                         "FEdAsp", "RParAsp", "RIQ", "RSES", "FSES", "FIQ", "FParAsp"),
                     text="
                     .6247
                     .3269 .3669
                     .4216 .3275 .6404
                     .2137 .2742 .1124 .0839
                     .4105 .4043 .2903 .2598 .1839
                     .3240 .4047 .3054 .2786 .0489 .2220
                     .2930 .2407 .4105 .3607 .0186 .1861 .2707
                     .2995 .2863 .5191 .5007 .0782 .3355 .2302 .2950
                     .0760 .0702 .2784 .1988 .1147 .1021 .0931 -.0438 .2087
                     ")
model.dhp.1 <- specifyEquations(covs="RGenAsp, FGenAsp", text="
                                RGenAsp = gam11*RParAsp + gam12*RIQ + gam13*RSES + gam14*FSES + beta12*FGenAsp
                                FGenAsp = gam23*RSES + gam24*FSES + gam25*FIQ + gam26*FParAsp + beta21*RGenAsp
                                ROccAsp = 1*RGenAsp
                                REdAsp = lam21(1)*RGenAsp # to illustrate setting start values
                                FOccAsp = 1*FGenAsp
                                FEdAsp = lam42(1)*FGenAsp
                                ")
sem.dhp.1 <- sem(model.dhp.1, R.DHP, 329,
                 fixed.x=c('RParAsp', 'RIQ', 'RSES', 'FSES', 'FIQ', 'FParAsp'))
summary(sem.dhp.1)


Wednesday, February 8, 2012

LME to estimate Mixed Effect Models in R

In common marketing discussion, a hierarchical model estimates both group level effects and individual differences in effects. Such models are popular in marketing because they provide insight into differences among customers (heterogeneity) and distribution of preference. HLM are exemplified when we estimate the importance of effects for individuals as well as for an overall population. 

Effects that are associated with all observations are known as fixed effects, and those that differ across various grouping levels are known as random effects. 

These models are also known as mixed effect models, because the total effect for each person is composed of the effect for the overall population ( the fixed effect) plus the per-individual (random) effect. 

The difference between estimating hierarchical effects, as opposed to including the grouping variable as a factor in a standard linear model, is that a hierarchical model estimates every specified effect for each individual or group, not only a single adjustment term. 

The formula for a mixed effect model includes a grouping term + (... |group). Common models have a different intercept by group using (1|group) or different intercepts and slopes for predictors within each group using (predictor|group). To estimate an individual level model, the grouping term is typically the respondent identifier. 


Hierarchical model can be used to group observations at other levels than the individual level. For example, we might wish to group by store, advertising campaign, salesperson, or some other factor, if we went o estimate effects that are specific to such a grouping. 


Hierarchical models in marketing are often estimated with Bayesian methods that are able to pool information and produce best estimates of both group and individual effects using potentially sparse data. 

Model coefficients from a hierarchical model are inspected using summaries of the many estimates that are collected in an mcmc object.

library(nlme)
model1<-lme(mathach ~ 1, random = ~ 1 | id, data=hsb)
summary(model1)

Saturday, February 4, 2012

SQL to Compute User Age


drop table tmptbl;
create table lh_tmptbl as
select aa.userhashkey, min(aa.datekey) as firstdt
from factevent aa
where aa.userhashkey % 100 = 0 and aa.eventtypeid = 1 and aa.uac = 1 and aa.datekey <= 20130728
group by 1
having min(aa.datekey)>=20130522 and min(aa.datekey)<=20130628;

drop table tmp2;
create table tmp2 as
select  tbl1.firstdt,
        TO_DATE(cast(tbl1.datekey as VARCHAR), 'YYYYMMDD')-  TO_DATE(cast(tbl1.firstdt as VARCHAR), 'YYYYMMDD') as age,
        count(*) as imps
from (
select b.firstdt, a.datekey, a.userhashkey, a.eventkey
from factevent a,
     tmptbl b
where a.userhashkey = b.userhashkey and a.eventtypeid = 1 and a.uac = 1
and a.datekey <= 20130728
) tbl1
group by 1,2
;

drop table tmp3;
create table tmp3 as
select tbl1.firstdt,
count(distinct userhashkey) as uusers,
count(*) as imps
from (
select b.firstdt, a.datekey, a.userhashkey, a.eventkey
from factevent a,
     lh_tmptbl b
where a.userhashkey = b.userhashkey and a.eventtypeid = 1 and a.uac = 1
and a.datekey <= 20130728
) tbl1
group by 1;

-- R for triangle output;
sql <- paste("select a.*, b.uusers, b.imps as totimp
             from tmp2 a,
                  tmp3 b
             where a.firstdt = b.firstdt
             order by a.firstdt, a.age;" , sep ='');
data <- sqlQuery(con, sql, as.is = T, max =0 )
data$age <- as.integer(data$age)
head(data)
x <- dcast(data, firstdt+uusers+totimp ~ age, value.var = "imps", fill = 0);
head(x)
write.csv(x, "totimp.csv")

#break down by weeks;
sql <- paste("select a.firstdt, b.uusers,
             (case when a.age = 0 then 1
                  when a.age >=1 and a.age <=7 then 2
                  when a.age >=8 and a.age <=30 then 3
                  when a.age >=31 then 4 end) as agegrp,
             sum(a.imps) as grpimp
             from tmp2 a,
                  tmp3 b
             where a.firstdt = b.firstdt
             and a.firstdt < 20130729
             group by 1,2,3
             order by 1,2,3;" , sep ='');
data <- sqlQuery(con, sql, as.is = T, max =0 )
head(data)
x <- dcast(data, firstdt+uusers ~ agegrp, value.var = "grpimp", fill = 0);
head(x)
write.csv(x, "totimp2.csv")

-- 30-30-30 rules
create table userfiltering as
select userhashkey as userid, sectionid, datekey
from database1 join database2 using(sectionkey)
where userhashkey % 1000 = 0 and userhashkey <> 0;

create table userfiltering2 as
select *,
row_number() over(partition by userid order by userid, datekey, sectionid) user_cumcnt,
row_number() over(partition by userid, sectionid order by userid, datekey, sectionid) user_fact_cumcnt,
row_number() over(partition by userid, datekey order by userid, datekey, sectionid) user_dt_cumcnt,
row_number() over(partition by userid, datekey, sectionid order by userid, datekey, sectionid) user_fact_dt_cumcnt
from userfiltering
order by userid, datekey, sectionid;
select min(datekey), max(datekey)
from userfiltering2;

--cutoff1=20111128, cutoff2=20111229, gaps=31;
create table userfiltering3 as
select userid,
min(datekey) as first_dt,
max(datekey) as last_dt,
to_date(max(datekey),'YYYYMMDD') - to_date(min(datekey),'YYYYMMDD')as gaps,
(case when (to_date(min(datekey),'YYYYMMDD') - to_date(20111128,'YYYYMMDD'))<=0  then 1
 when (to_date(min(datekey),'YYYYMMDD') - to_date(20111128,'YYYYMMDD'))>0 and (to_date(min(datekey),'YYYYMMDD') - to_date(20111128,'YYYYMMDD'))<=31 then 2
 when (to_date(min(datekey),'YYYYMMDD') - to_date(20111128,'YYYYMMDD'))>31  then 3 end) as index1,
(case when (to_date(max(datekey),'YYYYMMDD') - to_date(20111128,'YYYYMMDD'))<=0  then
when (to_date(max(datekey),'YYYYMMDD') - to_date(20111128,'YYYYMMDD'))>0 and (to_date(max(datekey),'YYYYMMDD') - to_date(20111128,'YYYYMMDD'))<=31 then 2
when (to_date(max(datekey),'YYYYMMDD') - to_date(20111128,'YYYYMMDD'))>31  then 3 end) as index2
from userfiltering2
group by userid
order by userid;

--before 30-30-30 rules;
select count(*),
sum(case when gaps>=0 then 1 else 0 end)/count(*) as duration0,
sum(case when gaps>=1 then 1 else 0 end)/count(*) as duration1,
sum(case when gaps>=3 then 1 else 0 end)/count(*) as duration3,
sum(case when gaps>=7 then 1 else 0 end)/count(*) as duration7,
sum(case when gaps>=14 then 1 else 0 end)/count(*) as duration14,
sum(case when gaps>=30 then 1 else 0 end)/count(*) as duration30,
sum(case when gaps>=60 then 1 else 0 end)/count(*) as duration60
from userfiltering3;

--apply 30-30-30 rules;
select count(*),
sum(case when gaps>=0 then 1 else 0 end)/count(*) as duration0,
sum(case when gaps>=1 then 1 else 0 end)/count(*) as duration1,
sum(case when gaps>=3 then 1 else 0 end)/count(*) as duration3,
sum(case when gaps>=7 then 1 else 0 end)/count(*) as duration7,
sum(case when gaps>=14 then 1 else 0 end)/count(*) as duration14,
sum(case when gaps>=30 then 1 else 0 end)/count(*) as duration30,
sum(case when gaps>=60 then 1 else 0 end)/count(*) as duration60
from userfiltering3
where index1=2 and index2=2;

--break down by first-dt cohort;
select
first_dt,
count(*),
sum(case when gaps>=0 then 1 else 0 end)/count(*) as duration0,
sum(case when gaps>=1 then 1 else 0 end)/count(*) as duration1,
sum(case when gaps>=3 then 1 else 0 end)/count(*) as duration3,
sum(case when gaps>=7 then 1 else 0 end)/count(*) as duration7,
sum(case when gaps>=14 then 1 else 0 end)/count(*) as duration14,
sum(case when gaps>=30 then 1 else 0 end)/count(*) as duration30,
sum(case when gaps>=60 then 1 else 0 end)/count(*) as duration60
from userfiltering3
where index1=2 and index2=2
group by 1
order by 1;

Monday, January 16, 2012

SQL to Self Join

-- Mail Solicitation;
-- create features;
drop table ANALYTICS_STG..CAMPAIGN_ADDRESS_3;
-- 2,038,831,646 rows affected
create table ANALYTICS_STG..CAMPAIGN_ADDRESS_3 as
select a.*,
row_number() over(partition by a.EMAIL_ADDRESS_ID order by a.EMAIL_ADDRESS_ID, a.SEND_TS) seq_freq,
extract(DAY from a.SEND_TS - LAG(a.SEND_TS, 1) OVER (partition by a.EMAIL_ADDRESS_ID ORDER BY a.EMAIL_ADDRESS_ID, a.SEND_TS)) as recency,
b.reach_freq as tot_freq
from ANALYTICS_STG..CAMPAIGN_ADDRESS_2 a
join (select EMAIL_ADDRESS_ID, count(*) as reach_freq from ANALYTICS_STG..CAMPAIGN_ADDRESS_2 group by 1) b
on a.EMAIL_ADDRESS_ID = b.EMAIL_ADDRESS_ID;


-- Mail Response;
-- Dedupe response;
create table ANALYTICS_STG..RESPONSE_ADDRESS_2 as
select a.*,
case when upper(a.FEEDBACK_EVENT_CD)='OPEN' then 1 else 0 end FEEDBACK_EVENT_CD_1,
case when upper(a.FEEDBACK_EVENT_CD)='CLICK' then 1 else 0 end FEEDBACK_EVENT_CD_2,
case when upper(a.FEEDBACK_EVENT_CD)='UNDEL' then 1 else 0 end FEEDBACK_EVENT_CD_3,
case when upper(a.FEEDBACK_EVENT_CD)='CONVERSION' then 1 else 0 end FEEDBACK_EVENT_CD_4,
case when upper(a.FEEDBACK_EVENT_CD)='UNSUB' then 1 else 0 end FEEDBACK_EVENT_CD_5,
case when upper(a.FEEDBACK_EVENT_CD)='SPAM' then 1 else 0 end FEEDBACK_EVENT_CD_6
from (
select *, row_number()
over (partition by CAMPAIGN_ID, ELECTRONIC_ADDRESS_ID, CONCEPT_ID, FEEDBACK_EVENT_CD, MAIL_TS
order by  CAMPAIGN_ID, ELECTRONIC_ADDRESS_ID, CONCEPT_ID, FEEDBACK_EVENT_CD, MAIL_TS, FEEDBACK_EVENT_TS
) row_id
from ANALYTICS_STG..RESPONSE_ADDRESS_1
) a
where a.row_id = 1;

-- create response features
create table ANALYTICS_STG..RESPONSE_ADDRESS_3 as
select a.*,
row_number() over (partition by a.ELECTRONIC_ADDRESS_ID order by a.ELECTRONIC_ADDRESS_ID, a.MAIL_TS, a.FEEDBACK_EVENT_TS) seq_engagement,
b.tot_engagement,
b.tot_open,
b.tot_click,
b.tot_unsub,
extract(DAY from (a.FEEDBACK_EVENT_TS - a.MAIL_TS)) as feedback_day
from ANALYTICS_STG..RESPONSE_ADDRESS_2 a
inner join (select ELECTRONIC_ADDRESS_ID,
count(*) as tot_engagement,
sum(FEEDBACK_EVENT_CD_1) as tot_open,
sum(FEEDBACK_EVENT_CD_2) as tot_click,
sum(FEEDBACK_EVENT_CD_5) as tot_unsub
from ANALYTICS_STG..RESPONSE_ADDRESS_2
where FEEDBACK_EVENT_CD_1 = 1 or FEEDBACK_EVENT_CD_2 = 1 or FEEDBACK_EVENT_CD_5 = 1
group by 1
) b
on a.ELECTRONIC_ADDRESS_ID = b.ELECTRONIC_ADDRESS_ID
where a.FEEDBACK_EVENT_CD_1 = 1 or a.FEEDBACK_EVENT_CD_2 = 1 or a.FEEDBACK_EVENT_CD_5 = 1;

-- Join solicitations and responses
create table ANALYTICS_STG..CAMPAIGN_ADDRESS_RESPONSE_1 as
select a.*,
b.FEEDBACK_EVENT_TS, 
b.FEEDBACK_EVENT_CD, 
b.FEEDBACK_EVENT_CD_1,
b.FEEDBACK_EVENT_CD_2,
b.FEEDBACK_EVENT_CD_5,
b.SEQ_ENGAGEMENT,
b.TOT_CLICK,
b.TOT_OPEN,
b.TOT_ENGAGEMENT,
b.TOT_UNSUB,
b.FEEDBACK_DAY
from  ANALYTICS_STG..CAMPAIGN_ADDRESS_3 a
inner join ANALYTICS_STG..RESPONSE_ADDRESS_3 b
on a.CAMPAIGN_ID = b.CAMPAIGN_ID
and a.CONCEPT_ID = b.CONCEPT_ID
and a.EMAIL_ADDRESS_ID = b.ELECTRONIC_ADDRESS_ID
and a.SEND_TS = b.MAIL_TS
;

-- Last attributions;
create table ANALYTICS_STG..CAMPAIGN_ADDRESS_RESPONSE_2 as
select a.*,
b.FEEDBACK_EVENT_TS, 
b.FEEDBACK_EVENT_CD, 
b.FEEDBACK_EVENT_CD_1, 
b.FEEDBACK_EVENT_CD_2, 
b.FEEDBACK_EVENT_CD_5, 
b.TOT_ENGAGEMENT, 
b.SEQ_ENGAGEMENT, 
b.TOT_CLICK, 
b.TOT_OPEN, 
b.TOT_UNSUB,
b.FEEDBACK_DAY,
SUM(case when b.FEEDBACK_EVENT_CD_1=1 then 1 else 0 end) OVER(partition by a.EMAIL_ADDRESS_ID ORDER BY a.EMAIL_ADDRESS_ID, a.SEQ_FREQ, b.SEQ_ENGAGEMENT rows unbounded preceding)/a.SEQ_FREQ CUM_OPEN_PERC,
SUM(case when b.FEEDBACK_EVENT_CD_2=1 then 1 else 0 end) OVER(partition by a.EMAIL_ADDRESS_ID ORDER BY a.EMAIL_ADDRESS_ID, a.SEQ_FREQ, b.SEQ_ENGAGEMENT rows unbounded preceding)/a.SEQ_FREQ CUM_CLICK_PERC,
SUM(case when b.FEEDBACK_EVENT_CD_5=1 then 1 else 0 end) OVER(partition by a.EMAIL_ADDRESS_ID ORDER BY a.EMAIL_ADDRESS_ID, a.SEQ_FREQ, b.SEQ_ENGAGEMENT rows unbounded preceding)/a.SEQ_FREQ CUM_UNSUB_PERC
from ANALYTICS_STG..CAMPAIGN_ADDRESS_3 a
left join ANALYTICS_STG..CAMPAIGN_ADDRESS_RESPONSE_1 b
on a.CAMPAIGN_ID=b.CAMPAIGN_ID
and a.CONCEPT_ID=b.CONCEPT_ID
and a.EMAIL_ADDRESS_ID = b.EMAIL_ADDRESS_ID
and a.SEQ_FREQ = b.SEQ_FREQ

;

SQL to Batchloading

sn <- "*********"
sql.con <- odbcConnect(dsn)
sql.db <- "*********"
advertiserid <- "4009"

rtg.facts <- unlist(sqlQuery(sql.con, paste("select distinct t.FactId
      from ", sql.db, ".dbo.Segment s with (nolock)
                                            inner join ", sql.db, ".dbo.SegmentTaxonomy st with (nolock)
                                            on s.SegmentId = st.SegmentId
                                            inner join ", sql.db, ".dbo.Taxonomy t with (nolock)
                                            on st.TaxonomyId = t.TaxonomyId
                                            where s.AdvertiserId = ", advertiserid, sep = "")))
str <- paste(rtg.facts, collapse= ",")
pfacts <- sqlQuery(sql.con, paste("exec ", sql.db, ".dbo**** ", advertiserid, sep = ""))
odbcClose(sql.con)

#save(str, file="str.rda",ascii=FALSE)
load(file = "str.rda")
load(file = "pfacts.rda")

pfacts <- data.frame(factid = as.integer(pfacts[,1]))
dsn <- "********"
con <- odbcConnect(dsn)
#sqlSave(con, pfacts1, tablename="Emily", rownames=FALSE, colnames= FALSE, append=TRUE, varTypes=varTypes)
odbcClose(con)

write.table(coef, file = "ling_coef.csv", row.names = F, col.names = F, sep = ",")

system("cd .\\SVM");
system("winscp.bat");

sqlQuery(con, "drop table lh_coef")
sqlQuery(con, "CREATE TABLE lh_coef (
                    Distance double precision,
                    center double precision,
                    scale double precision,
                    FactId int distkey
                    )
                    sortkey(factid)")
sqlQuery(con, "copy lh_coef from \'/home/coef.csv\' with delimiter ','")

SQL to Merge Imps and Convs


select count(*) from lh_imps;
select * from lh_imps limit 10;
--drop table lh_imps;
create table lh_imps as
select a.userhashkey, a.datekey,  a.revenue,
       c.adid, d.sectionid,
       to_timestamp(a.datekey * 1000000.0 + a.timekey - 1000000, 'YYYYMMDDHHMISS') as impdatetime
from databasefactevent a
join databasedimcampaign b using(campaignkey)
join databasedimad c using(adkey)
join databasedimsection d using(sectionkey)
where a.eventtypeid = 1
and b.campaignid = 57464
and a.datekey between 20120901 and 20120930
order by a.userhashkey, impdatetime;


select count(*) from lh_imps2;
select * from lh_imps2 limit 10;
--drop table lh_imps2;
create table lh_imps2 as
select a.*,
       row_number() over(partition by userhashkey order by userhashkey, impdatetime, adid, sectionid) frequency,
       sum(revenue) over(partition by userhashkey order by userhashkey, impdatetime, adid, sectionid) cum_revenue
from lh_imps a;

select count(*) from lh_conv;
select * from lh_conv limit 10;
--drop table lh_conv;
create table lh_conv as
select a.eventkey, a.datekey as convdate, a.userhashkey, c.adid, d.sectionid,
       to_timestamp(a.datekey * 1000000.0 + a.timekey - 1000000, 'YYYYMMDDHHMISS') as convdatetime
from databasefactconversion a
join databasedimcampaign b using(campaignkey)
join databasedimad c using(adkey)
join databasedimsection d using(sectionkey)
where b.campaignid = 57464
and a.datekey between 20120901 and 20120930;

select count(*) from lh_conv2;
select * from lh_conv2 limit 10;
--drop table lh_conv2;
create table lh_conv2 as
select b.eventkey, b.userhashkey, b.adid, b.sectionid, b.convdate,  b.convdatetime,
       max(a.impdatetime) as lastImpTime,
       max(a.datekey) as lastImpdatekey,
       max(a.frequency) as lastImpfrequency
from lh_imps2 a
join lh_conv b
on a.userhashkey = b.userhashkey and a.adid = b.adid and a.sectionid = b.sectionid
where a.impdatetime < b.convdatetime
group by 1, 2, 3, 4, 5, 6;

select count(*) from lh_conv3;
select count(*) from lh_conv3 where converted = 1;
select * from lh_conv3 where converted = 1 limit 10;
--drop table lh_conv3;
create table lh_conv3 as
select a.*,
       (case when b.userhashkey is not null then 1 else 0 end ) converted,
       (case when b.userhashkey is not null then b.convdate - a.datekey  else -1 end ) days2conv
from lh_imps2 a
left join lh_conv2 b
on a.userhashkey = b.userhashkey and a.adid = b.adid and a.sectionid = b.sectionid and a.impdatetime = b.lastImpTime;

select userhashkey, adid, sectionid, impdatetime, count(*)
from lh_conv3
where converted = 1
group by 1,2,3,4
having count(*)>2;

select count(*) from lh_conv4;
select count(*) from lh_conv4 where converted = 1;
select * from lh_conv4 where converted = 1 limit 10;
select * from lh_conv4 where converted = 0 limit 10;
--drop table lh_conv4;
create table lh_conv4 as
select a.*,
       (case when b.userhashkey is not null then 1 else 0 end ) converted,
       (case when b.userhashkey is not null then b.convdate - a.datekey  else -1 end ) days2conv
from lh_imps2 a
left join (
select userhashkey, adid, sectionid, max(convdate) as convdate,
       max(convdatetime) as convdatetime,
       max(lastImpTime) as lastImpTime,
       max(lastImpdatekey) as lastImpdatekey
from lh_conv2
group by 1,2,3
) b
on a.userhashkey = b.userhashkey and a.adid = b.adid and a.sectionid = b.sectionid and a.impdatetime = b.lastImpTime;