Wednesday, July 31, 2013

Shell awk Scripting

Introduction
awk is good for processing and summarizing structured data; awk is particularly suited for filtering, summarizing and rearranging data.

- cat fruits.txt | awk '{ print }' (write all lines to stdout)
- cat fruits.txt | awk '{ print $1}' (write 1 col of all lines to stdout)
- cat fruits.txt | awk '{ print $3, $1}' (write 3 col and then 1 col of all lines to stdout)
- cat fruits.txt | awk '/apple/{print}' (print all lines starting from apple)
- cat fruits.txt | awk '/^a/{print}' (print all lines starting from a)
- cat fruits.txt | awk '! /apple/{print}' (print all lines not starting from apple)
- cat fruits.txt | awk '$1 == "apple" {print}' (print all lines starting from apple)
- cat numbers.txt | awk '{var1 = $1; var2 = $2; avg = (var1+var2)/2; print avg}'
- cat numbers.txt | awk 'NR %2 == 0 {print}'

login3 15:50:31 ~ $ cat vartot.awk
BEGIN {
 var1 = 0
 var2 = 0
 var3 = 0
}
{
 var1 += $1
 var2 += $2
 var3 += $3
}
END {
 print var1, var2, var3
}
- cat numbers.txt | awk -f vartot.awk (write all lines to stdout)

Patterns and Actions
Patterns can be any expression that returns a value, including regex and range of the form expr1, expr2.
awk has certain variables such as record separators whose values are dynamically set by awk itself.
awk interprets addresses and actions.
NF: number of fields, using whitespace as separators as default.
NR: number of records, using whitespace as separators as default.
RS: set record separators
FS: set field separators
- cat numbers.txt | awk 'NR %2 == 0 {print}' (print row lines are even)
- cat numbers.txt | awk '/^10/, NR %2 == 0 {print}' (/, functions as AND)
- cat numbers.txt | awk 'BEGIN{ FS = ":"} {print $1}' (print first col given the field seperator is :)
- cat numbers.txt | awk 'BEGIN{ FS = ":"} NR %2 == 0 {print $1}' (print first col when row lines are even given the field seperator is :)
- cat numbers.txt | awk 'BEGIN{ RS = "\n\nFrom: " ; FS = "\n"} {print $1}' (print first col given the specified field sepearator and record separator)
- cat months.txt | awk 'BEGIN{ tot = 0} { tot = tot + $2} NR%12 == 0 {print tot/12; tot = 0}' (monthly avg of second col)
- echo $((502/12)) ( round it to integer)
- cat months.txt | awk '{ tot = tot + $2} NR%12 == 0 {print tot/12; tot = 0}' (monthly avg of second col; able to drop initialization of tot)
- cat annual.awk
{
tot = tot + $2
}
NR%12 == 0 {
print tot/12;
tot = 0
}
- cat months.txt | awk -f annual.awk (monthly avg of second col; able to drop initialization of tot)

- match(input_var, regular expression, output_var)
- gsub(regular expression, streams)
...
match($1, /:(.*),(.*)/*, resultmatch)
gsub(/ +/, " ", resultmatch[1]) (squeeze into one space)
gsub(/ +/, " ", resultmatch[2]) (squeeze into one space)
if (tolower(resultmatch[1]) ~ result) (search for result to lower case of resultmatch col 1)
{
  print $1;
}

Tuesday, July 30, 2013

Shell sed Scripting

Unix command such as uniq, sort, nl read-in from standard input, transform text and export to standard output. sed and awk also transform text like filter, can be part of a pipeline, are programmable, have their own scripting language to specify transformation rules.

sed supports a number of operations,
s command (stand for substitute) is the one of the most commonly used commands,
d command removes empty lines.

Regular Expressions use to identify patterns, similar as glob expressions. While globs patterns match pathnames, regexes match any general text data. For example, [a-zA-Z0-9] match letters and digits and meta-characters include * . + ? $ | ():
* represents zero or more place expression,
+ is a regular expression,
^ anchors the match to the beginning of the record,
$ represents end of the expression.
capture parentheses use \( and /) , and back-reference use \1.
\b is a word boundary,
\w is a single word character,
\b\w\+\b matches a word.

sed 's/a/b/' - s:substitution, a:search string, b:replacement string
Syntax is s/REGEXP/REPLACEMENT/FLAGS.
Examples
echo 'upstream' | sed 's/up/down/' (replace up with down)
echo 'upstream and upward' | sed 's/up/down/' (replace once)
echo 'upstream and upward' | sed 's/up/down/g' (replace globally)
echo 'upstream and upward' | sed 's|up|down|g' (delimiter can be / or | or :)
echo 'Mac OS X/Unix: awesome.' | sed 's|Mac OS X/Unix:|sed is |' (use backslash to escape forward slash)
sed 's/apple/mango/' fruits.txt (replace apple with mango first one each line)
sed 's/apple/mango/g' fruits.txt (replace apple with mango globally)
echo 'During daytime we have sunlight.' | sed 's/day/night/'
echo 'During daytime we have sunlight.' | sed -e 's/day/night/' -e 's/sun/moon/' (add second command by -e, e stands for edits)
echo 'who needs vowels?' |sed 's/[aeiou]/_/g' (substitute aeiou using _)
echo 'who needs vowels?' |sed -E 's/[aeiou]+/_/g' (use extended features)
sed 's/^a/  A/g' fruits.txt (indent apple)
sed 's/^/>  /g' fruits.txt (indent everything)
sed 's/^/>(ctrl+v+Tab)/g' fruits.txt (indent using Tab)
sed -E 's/<[^<>]+>//g' homepage.html (take out html tags)
echo 'daytime' | sed 's/\(...)time/daylight/' (change ...time to daylight)
echo 'daytime' | sed -E 's/(...)time/\1daylight/' (extended replace and reused the matched word)
echo 'Dan Stevens' | sed -E 's/([A-Za-z]+) ([A-Za-z]+)/\2, \1/'(change the order of the first and second words)

- cat fruits.txt | sed 's/Apple/apple/' (replace Apple with apple)
- cat fruits.txt | sed 's/Apple/apple/i' (flag to specify case insensitive using /i)
- cat fruits.txt | sed 's/\(A\)pple/apple' (match partially, either Apple or apple will match)
- cat fruits.txt | sed 's/\(A\)pple/apple/g' (globally occurrence, match more than one match)
- cat fruits.txt | sed 's/\(A\)pple/\1pple/g' (globally occurrence, match using back-reference \1)
- cat fruits.txt | sed 's/\b\(A\)pple\b/\1pple/g' (find word boundary)
- cat fruits.txt | sed 's/\(\b\w\+\b\) \1/\1/g' (find the repeated word and keep the first one)
- cat fruits.txt | sed -e '9s/\(A\)pple/apple/g' (execute the s command for only line 9)
- cat fruits.txt | sed -e '9s/\(A\)pple/apple/g' (execute the s command except for line 9)
- cat fruits.txt | sed 's/^apple/xx/g' (replace for the beginning of the streams)
- cat fruits.txt | sed 's/apple$/xx/g' (replace for the end of the streams)
- cat fruits.txt | sed 's/^$/xx/g' (replace for empty lines)
- cat fruits.txt | sed 's/^ *$/xx/g'(* represents zero or more places, like the + metacharacter)
- cat fruits.txt | sed 's/apple | Apple/xx/g'(| represents regex1 or regex2) ??
- cat fruits.txt | sed {'s/apple/xx/g', 's/Apple/xx/g'} (substitute one by one)  ??
- cat fruits.txt | sed 'd' (delete the lines)
- sed -e 's/^-1/0 |f/' |sed -e 's/^+1/1 |f/' (Converts the labels from "-1" to "0" and from "+1" to "1" and put the features into a namespace called "f")
- sed -e 's/$/ const:.01/' (Adds a constant feature called "const" with value ".01")
- sed -i '18164755d' data.reduced (delete line  18164755)
- sed -n '1,1000p' (export first 1000 lines)

Tuesday, July 23, 2013

Basic Bash Shell Scripting


Bash shell is a programming environment, with commands and control structures combined. Conventionally end with .sh extension
Executable: chomod u+x myscript.sh
The first line of the script must be #! /bin/bash
login3 09:39:50 ~ $ cat myscript.sh
#!/bin/bash
echo "Listing of current directory"
ls
echo "Listing of root directory"
cd /
ls
Run myscript.sh: ./myscript.sh

Pipes
Connect stdout and stdin using “|” : command 1 | command 2
sort command & uniq command

Redirection
Use the shell redirect operator  using “>”
Command > file.txt
login3 10:01:15 ~ $ cat uniquefruits.sh
#!/bin/bash
sort fruits.txt | uniq | nl > uniquefruits.txt

Command Line Arguments
$1 has value of first command line parameter
$2 the second command line parameter
$0 the command name itself
 sort “$1” | uniq | nl > uniquefruits.txt

Exit Status
Integer indicates success or failure. Success is 0, non-zero indicate failure.
grep command (grep [OPTIONS] PATTERN [FILE…]
Consider Pattern be a literal word. grep command prints all lines that match the word.
grep apple fruits.txt (matched including substring)
grep -w apple fruits.txt (exactly matched)
Suppress output by redirecting stdout of grep to /dev/null.
     login3 10:39:58 ~ $ cat isscript.sh
#!/bin/bash
if grep "/bin/bash" "$1" > /dev/null
then
  echo "File is a shell script"
elif grep "/bin/python" "$1" > /dev/null
then
  echo "File is a python script"
else
  echo "File is not a shell or python script"
fi

Conditions
Use test or [] to examine the conditions
= indicate equal, != or < or > compare lexicographically
-eq -ne -gt -lt compare arithmetically
Combinations of conditions use && and || or
- True if file/directory exists
if [ -e "$1" ]
- True if file/directory does not exist
if [ ! -e "$1" ]
-True is no argument is passed
if [ "$1"="" ]
login3 07:48:32 ~ $ cat isscript.sh
#!/bin/bash
if [ -e "$1" ]
then
  echo "File exist"
fi
a=1
b=1
if [ $a -eq 1 ] && [ $b -eq 1 ]
then
 echo "var a and var b are ones"
else
 echo "one of var or var b is not one"
fi

Looping: while
Shift command can shift values of all command line argument variables left.
login3 08:04:16 ~ $ cat uniquefruits.sh
#!/bin/bash
while [ -n "$1" ] #checking $1 is non-zero or not; -z checking for zero
do
 sort "$1" | uniq
 shift
done

Looping: for
$@ expands to all command line arguments
{} can use for command substitution
 login3 08:23:30 ~ $ cat uniquefruits.sh
#!/bin/bash

for filename in "$@"
do
 sort "$filename" | uniq > "${filename}.output1"
done
while [ -n "$1" ]
do
 sort "$1" | uniq  > "${1}.output2"
 shift
done

Global Expressions
*  ?   [] are meta-characters.
* match any characters. ? match one character. [] match any character in the list.
* txt match any characters containing txt
r??l match one character
r[!0-9]l match character in the list without numbers 0-9

Command Substitution
Command substitution - $(command), $(seq 10)
Arithmetic expansion - $((Expression)), ivar=10, echo $((ivar+10)).
grep subroutine *.f  > subs.txt
grep apple *.txt  > apple.txt
- read build-in to reads a single line from stdin and assigns to variable.
read iword
two words
echo $iword
-tr command
Translate or delete characters
login3 10:39:46 ~ $ echo $iword
one two three
login3 10:39:58 ~ $ echo "Unix is simple" | tr "i" "a"
Unax as sample
login3 10:42:52 ~ $ echo "Unix is simple" | tr "is" "aT"
Unax aT Tample
login3 10:43:05 ~ $ echo "Unix is simple" | tr "a-z" "A-Z"
UNIX IS SIMPLE
login3 10:43:24 ~ $ echo "Unix is simple" | tr -d "i"
Unx s smple
login3 10:43:38 ~ $ echo "Unix is simple" | tr -d "is" #remove i and s
Unx  mple

Bash shell example
login3 11:30:38 ~ $ cat myscript.sh
#!/bin/bash
searchword=$1
searchword="$( echo $searchword | tr "A-Z" "a-z" )"
count=0
while read inputline
do
  inputline="$( echo $inputline | tr -d ".,()'\"" )"
  inputline="$( echo $inputline | tr "A-Z" "a-z" )"
  for word in $inputline
  do
     if [ $word = $searchword ]
     then
         count=$(($count + 1))
     fi
  done
done
echo $count
login3 11:30:52 ~ $ cat fruits.txt | ./myscript.sh Apple
16
5

2 File System basic
# / - root
# /bin - Binaries, programs
# /sbin - System binaries, sytem program
# /dev - Devices: hard drives, keyboard, mouse, etc.
# /etc - system configurations
# /home - user home directories
# /lib - libraries of code
# /tmp - temporary files
# /var - various, mostly files the system uses
# /usr - user programs, tools and libraries (not files)
# /usr/bin
# /usr/etc
# /usr/lib
# /usr/local

login3 10:42:38 ~ $ pwd
/homes/lingh

# parent's parent's directory
cd ../..

# user directory
cd ~

# toggle directory

cd -

3 Working with Files and Dictionaries
cd "Application Support"
cd Application\ Support

touch fruits.txt
vi fruits.txt

#small files
Cat fruits.txt

#large files
less -N fruits.txt
less -M fruits.txt

#sample lines from files
head -1 fruits.txt
tail -1 fruits.txt
tail -f /var/log/apache2/error_log

#create directory
mkdir -p testdir/test1/test2
mkdir -vp testdir/test1/test3

#move files and directory
mv newfile.txt ../newfile.txt
mv newfile.txt ..
mv -nv overwrite1.txt overwrite2.txt (not overwriting)
mv -fv overwriete1.txt overwrite2.txt (overwriting)
mv -i overwrite1.txt overwrite2.txt (interactive overwriting)

#copy files and directories
cp newfile.txt newerfile.txt
cp -nv newfile.txt newerfile.txt (not overwriting)
cp -fv newfile.txt newerfile.txt (overwriting)
cp -i newfile.txt newerfile.txt (interactive overwriting)
cp -R test1 test1_copy_dir

#delete files and directories
rm somefile.txt
rmdir somedir
rm -R somedir (delete files and directories recursively)

#finder aliases/links and symbolic links
ln fruits.txt hardlink (ln filetolink hardlink - reference a hard copy file in a file system)
ln -s fruits.txt symlink (ln -s filetolink symlink - reference a file path of directory path)

#search for files and directories
# (*) is zero or more characters(glob)
# (?) is any one character
# ([]) is any character in the brackets
find ~/Documents -name 'someimage.jpg' (find path expression)
find ~/Sites -name 'index.html'
find ~/Sites -name 'index.???'
find ~/Sites -name 'index.*'
find ~ -name *.plist
find ~ -name *.plist -and -not -path *QuickTime
find ~ -name *.plist -and -not -paht *QuickTime -and -not *Preference
find /homes/lingh/ -name '*.*'

4 Ownership and Permissions

#who am I
whoami
cd ~

#Unix group
group
chown lingh:users fruits.txt
chown -R lingh:users fruits (change all files ownership in a dir)
sudo chown user1:users ownership.txt

# group categories (user, group, other)
# permission read(r-4), write(w-2), execute(x-1)
# user(rwx), group(rw-), other(r--)
chmod mode filename
chmod ugo=rwx filename
chmod u=rwx,g=rw,o=r filename
chmod ug+w filename
chmod o-w filename
chmod ugo+rw filename
chmod a+rw filename
chmod -R g+w testdir

chmod 777 filename (all permissions)
chmod 764 filename (rwx+rw+r)
chmod 755 filename (rwx+rx+rx)
chmod 700 filename (rwx+ + )
chmod 000 filename (no permission)

# root user
sudo ls -lla
sudo chown lingh file.txt

5 Commands and Programs

#show command path
whereis echo
which echo
whatis echo
echo $PATH

#computer system set up
date
uptime
#dedupe nodedupe login users
users
who
#system running on info
uname
uname -mnrsvp
hostname
domainname

#disk free space
df
df -h
df -H
#disk usage - allocation of hard disk
du
du ~
du -h ~/ (only directory)
du -ah ~/ (all files and directories)
du -hd 1 ~/ (go to directory with one deep)
du -hd 0 ~/ (only current directory)

#viewing processes
ps (process status)
ps -a
ps aux (a: all processes, u: column showing th eprocess user, x show the background processes)
ps aux | grep lingh
top
top -n 10 (top 10 processes)
top -n 10 -o cpu -s 3 -U lingh (top 10 processes of lingh, sorted by CPU, refress every 3 seconds)

#Stopping processes
kill pid
kill -9 pid (force to kill the process id)

#Text File Helpers
wc (word count)
login3 13:13:28 ~ $ wc fruits.txt
      25      22     156 fruits.txt (25 lines, 22 words per line, total 156 words)
sort (sort lines)
sort fruits.txt
sort -f fruits.txt (mix upper case letters and lowercase letters)
sort -r fruits.txt (reverse sort)
sort -u fruits.txt (sorted and unique)
unique (filter in/out repeated lines)
uniq fruits.txt
uniq -d fruits.txt (duplicate lines)
uniq -u fruits.txt (unduplicate lines)

#Utility programs
cal/ncal (calendar)
cal
cal 12 20013
cal -y 2014
bc (bench calculator)
scale =100
q00/9
quit
expr (expression evaluator)
expr 1 + 1
expr 1122 \* 3344
units (unit conversion)
login3 13:32:19 ~ $ units
586 units, 56 prefixes
You have: 1 foot
You want: meters
        * 0.3048
        / 3.2808399

#Command history
#!3 - references history command #3
#!-2 - references command withich was 2 commands-back
#!cat - references most recent command beginning with "cat"
#!! - references previous command
#!$ - references previous command's arguements
cat .bash_history
history
!1 (run 1st line of command in the history)
!-5 (run a command that was five commands ago)
!expr (most recent command that began with expr)
!! (edit lines and re-execute)
chown lingh fruit.txt
sudo !! (that is, sudo chown lingh fruit.txt)
cat fruits.txt
less !$ (run the command with the arguments fromt the previous command)
history -d 10 (delete the 10th line of history)
history -c (clean all history)

6 Directing Input and Output

#Standard input -stdin, keyboard, /dev/stdin
#standard output -stdout, text terminal /dev/stdout
sort fruits.txt > sortedfruits.txt
unique sortedfruits.txt > uniquefruits.txt
cat apple.txt apple2.txt > applecat.txt
echo 'fruits.txt' >> apple.txt (appended to the file)
#Directing input form a file
sort < fruits.txt
sort < fruits.txt > sortedfruits.txt
#Piping output to input
echo "HELLO WORLD" | wc
echo "(3+4)*9" | bc
cat fruits.txt | sort
cat fruits.txt | sort | uniq
sort < fruits.txt | uniq  > uniquefruits.txt
ps aux | less
#Suppressing output
#/dev/null - 'null device', 'bit bucket', 'black hold'
#similar to special files /dev/stdin and /dev/stdout, unix discards any data sent there
ls -lah > /dev/null

7 Configuring Your Working Environment

When you login (type username and password) via Terminal, either sitting at the machine, or remotely via ssh, .bash_profile is executed to configure your shell before the initial command prompt.

If you've already logged into your machine and open a new terminal window (xterm) inside a client other than Mac OSX Terminal, then .bashrc is executed before the window command prompt.

#Upon login to a bash shell - This only runs on user login
#/etc/profile - system configurations with master default commands
#~/.bash_profile, ~/.bash_login, ~/.profile, ~/.login
#Add to ~/.bash_profile and then put all shell configuration in ~/.bashrc
if [ -f ~/.bashrc ]; then
source ~/.bashrc
fi

#Upon starting a new bash subshell - This loads in the configuration .bashrc, put all configuration there.
#~/.bashrc (typein shell and open sub-shell using other resources)
source .bashrc (run .bashrc file)

#Upon logging out of bash shell
#~/.bash_logout

#Setting command aliases
alias
alias lah='ls -lah' (define a new aliase)
unalias lah (delete an aliase)

#Setting environment variables
echo USER=lingh (define a variable in bash)
expot echo (every time Unix launches a program, it'll make the variable available)

#Setting PATH variables
#PATH is a colon delimited list of file paths that Unix uses when it's trying to locate a command that you want it to run.
export PATH=/usr/local/bin:/usr/local/sbin:$PATH

#Configurign history with variables
export HISTSIZE=1000
export HISTFILESIZE=1000000
export HISTTIMEFORMAT='%b %d %I:%M %p       '
export HISTCONTROL=ignoreboth (ignore dups and ignore space)
export HISTIGNORE="history:pwd:ls -lah:exit" (ignore commands in the history)

#Customizing the command prompt with format codes
#\u - username
#\s - current shell
#\w - current working directory
#\W - basename of current working directory
#\d - date in 'weekday month date' format
#\D{format} - date in strftime format {"%Y-%m-%d"}
#\A time in 24-hour HH:MM format
#\t time in 24-hour HH:MM:SS format
#\@ time in 12-hour HH:MM am/pm format
#\T time in 12-hour HH:MM:SS format
#\H hostname
#\h hostname up to first "."
#|! history number of this command
#\$ when UID is 0 (root), a "#", otherwise a "$"
#\\ a literal backslash
PS1='--> ' (change the main command prompt)
PS1="\h:\W\u$'

8 Unix Power Tools

# grep - global regular expression print
grep apple fruit.txt (return the line containing apple in fruit.txt)
grep -i apple fruit.txt (case insensitive)
grep -w apple fruit.txt (find only word apple)
grep -v apple fruit.txt (reverse match, that is, find lines don't matched)
grep -n apple fruit.txt (find the matched lines with line numbers)
grep -c apple fruit.txt (find counts of the matched lines)
grep -R apple . (find word apple in all files under the current directory)
grep -Rn apple . (find word apple in all files with line number under the current directory)
grep -Rl apple . (find word apple in all files with only file under the current directory)
grep -RL apple . (find word apple in all files with only file didn't match under the current directory)
grep apple fruits.* (find word apple in selected files)
ps aux | grep lingh
history | grep ls
history | grep pig | less
grep --color apple fruit.txt
export GREP_COLOR="34;47"
export GREP_OPTIONS="--color=auto"
grep --color=auto apple fruit.txt

# regular expression - basics
.  (wild card, any one character except line breaks, gre.t)
[] (character set, any one character listed inside [], gr[ea]y)
[^] (negative character set, any one character not listed inside [], [^aeiou])
- (range indicator, when inside a character set, [A-Za-z0-9])
* (Preceding element can occur zero or more times, files_*name)
+ (Preceding element can occur one or more times, gro+ve)
? (Preceding element can occur zero or one time, colou?r)
| (Alernation, OR operator, jpg|gif|png)
^ (start of line anchor, ^Hello)
$ (End of line anchor, World$)
\ (Escape the next character (\+ is literal, + character), image\.jpg)
\d (any digit, 20\d\d-06-09)
\D (Anything not a digit ^\D+)
\w (any word character, alphanumeric + underscore, \w+_export\.sql)
\W (anything not a word character, \w+\W\w+)
\s (Whitespace, space, tab, line break, \w+\s\w+)
\S (Anything not whitespace, \S+\s\S+)
#regular expression character classes
[:alpha:] (alphabetic characters)
[:digit:] (numeric characters)
[:alnum:] (alphanumeric characters)
[:lower:] (lower-case alphabetic characters)
[:upper:] (upper-case alphabetic characters)
[:punct:] (punctuation characters)
[:space:] (space characters, spce, tab, new line)
[:blank:] (whitespace character)
[:print:] (printable characters, including space)
[:graph:] (printabel characters, not including space)
[:cntrl:] (control characters, not printing)
[:xdigit:] (Hexadecimal characters 0-9, A-F, a-f)
grep 'apple' fruits.txt
grep 'a..le' fruits.txt (. is wildcard character for regular expression)
grep '.a.a' fruits.txt (return word by any character followed by 'a' and then any character followed by 'a')
grep 'ea[cp]' fruits.txt (return word by any character contain ea and followed by c or p)
grep '^a' fruits.txt (return word start of a)
grep 'le$' fruits.txt (return word end of le)
echo 'berry bush berry' | grep --color 'berry$'
echo 'ABcDdefg' | grep --color [:upper:]
echo 'ABcDdefg' | grep --color [[:upper:]] (look up character set make up of character class)
echo 'AB,Dd:fg' | grep --color [[:punct:]] (look up character set make up of punctuateion class)
grep 'ap*le' fruits.txt (two asterisks have different meanings)
grep 'ap+le' fruits.txt (literal + sign)
grep -e 'ap+le' fruits.txt (extended set)

#tr - translate characters
echo 'a,b,c' | tr ',' '-' (translate , to -)
echo '1435478956780' | tr '123456789' 'ABCDEFGHI' (position matched)
echo 'This is ROT-13 encrpted.' | tr 'A-Za-z' 'N-ZA-Mn-zaa-m'
echo 'Guve ve EBG-13 rapdbfrq.' | tr 'N-ZA-Mn-zaa-m' 'A-Za-z'
echo 'already daytime' | tr 'day' 'night'
tr 'A-Z' 'a-z' < fruits.txt (change from uppercase to lowercase)
tr '[:upper:]' '[:lower:]' < fruits.txt
tr ' ' '\t' < fruits.txt
#tr: deleting and squeezing characters
# -d delete characters in listed set
# -s squeeze repeats in listed set
# -c user complementatry set
# -dc delete characters not in listed set
# -sc squeeze characters not in listed set
echo 'abc123deee567f' | tr -d [:digit:] (delete digits)
echo 'abc123deee567f' | tr -dc [:digit:] (delete everythings except for digits)
echo 'abc12333333deee567f' | tr -s [:digit:] (squeeze digits)
echo 'abc123deee567f' | tr -sc [:digit:] (squeeze everything not digits)
echo 'abc123deee567f' | tr -ds [:digit:] [:alpha:] (delete digits and then squeeze letters)
echo 'abc123deee567f' | tr -dsc [:digit:] [:digit:] (-c only apply to delete -d)
tr -dc [:print:] < file1 > file2 (remove non-printable characters from file 1)
tr -d  '\015\031' < windows_file > unix_file (remove surplus carriage return and end of file character)
tr -s ' ' < file1 > file2 (remove double spaces from file1)

#sed - Stream Editor
sed 's/a/b/' - s:substitution, a:search string, b:replacement string
echo 'upstream' | sed 's/up/down/' (replace up with down)
echo 'upstream and upward' | sed 's/up/down/' (replace once)
echo 'upstream and upward' | sed 's/up/down/g' (replace globally)
echo 'upstream and upward' | sed 's|up|down|g' (delimiter can be / or | or :)
echo 'Mac OS X/Unix: awesome.' | sed 's|Mac OS X/Unix:|sed is |' (use backslash to escape forward slash)
sed 's/apple/mango/' fruits.txt (replace apple with mango first one each line, each line was treated as stream)
sed 's/apple/mango/g' fruits.txt (replace apple with mango globally)
echo 'During daytime we have sunlight.' | sed 's/day/night/'
echo 'During daytime we have sunlight.' | sed -e 's/day/night/' -e 's/sun/moon/' (add second command by -e, e stands for edits)
echo 'who needs vowels?' |sed 's/[aeiou]/_/g' (substitute aeiou using _)
echo 'who needs vowels?' |sed -E 's/[aeiou]+/_/g' (use extended features)
sed 's/^a/  A/g' fruits.txt (indent apple)
sed 's/^/>  /g' fruits.txt (indent everything)
sed 's/^/>(ctrl+v+Tab)/g' fruits.txt (indent using Tab)
sed -E 's/<[^<>]+>//g' homepage.html (take out html tags)
echo 'daytime' | sed 's/\(...)time/daylight/' (change ...time to daylight)
echo 'daytime' | sed -E 's/(...)time/\1daylight/' (extended replace and reused the matched word)
echo 'Dan Stevens' | sed -E 's/([A-Za-z]+) ([A-Za-z]+)/\2, \1/'(change the order of the first and second words)

#cut: Cutting select text portions
#(-c characters, -b bytes, -f fields)
ls -lah | cut -c 2-10 (grab characters from 2 to 10)
echo '     4 lingh  users   '|wc
ls -lah | cut -c 2-10,32-37,53
history | grep 'fruit' | cut -c 10-
ps aux | grep 'lingh' | cut -c 66-
ps aux | grep 'lingh' | cut -f 1 (grab field 1)
cut -f 2,6 -d "," us_presidents.csv (grab fields 2, 6 by changing the delimiter as ,)

#diff: Comparing files
diff fruits.txt.output1 fruits.txt.output2 (original txt file typically first, c indicates change, a indicates append)
# -i (case insensitive)
# -b (ignore changes to blank characters)
# -w (ignore all whitespace)
# -B (ignore blank lines)
# -r (recursively compare directories)
# -s (show identical files)
# -c (copied context)
# -u (unified context)
# -y (side-by-side)
# -q (only whether files differ)

diff -c fruits.txt.output1 fruits.txt (show context difference, + added, ! changed, - deleted)
diff -y fruits.txt.output1 fruits.txt (side by side)
diff -u fruits.txt.output1 fruits.txt (unified comparison)
diff -q fruits.txt.output1 fruits.txt
diff -q fruits.txt.output1 fruits.txt  |

#xargs: passing argument lists to commands
wc fruits.txt
echo 'fruits.txt | wc
echo 'fruits.txt' | xargs -t wc (run and show running commands)
echo 'fruits.txt apple.txt' | xargs -t wc (run and show running commands)
echo 'fruits.txt apple.txt' | xargs -t -n1 wc (loop running commands with one arguments each time)
echo 1 2 3 4 | xargs -t -n2
head fruits.txt | xargs -L 2 (every head of two lines returned)
head fruits.txt | xargs -n 2 (every head of two words each line returned)
cat fruits.txt | xargs -I {} echo "buy more: {}" ({} indicates positions)
cat fruits.txt | xargs -I :FRUIT: echo "buy more: :FRUIT:" (define :FRUIT:)
cat fruits.txt | grep 'apple.*' | xargs -0 -n1
cat fruits.txt | sort | uniq | xargs -I {} mkdir ~lingh/{}
ps aux | grep 'lingh' | cut -c 10-15 | xargs kill -9 (kill all processes)
grep -l 'apple' *fruits.txt | xargs wc
grep -l 'apple' *fruits.txt | xargs wc
find ~lingh/ -type f -print0 | xargs -0 chmod 755 (-print0 make sure the null character is used to seperate them, -0 on xargs make sure the xargs uses the null characters)
find ~lingh/ "*fruits.txt" -print0 | xargs -0 -I {} cp {}  ~lingh/{}.backup
find  ~lingh/ -name "*.backup" -print0 | xargs -p -0 -n1 rm

# Creating an archive using tar command
tar cvf archive_name.tar dirname/
tar cvfj archive_name.tar.bz2 dirname/
# z - filter the archive through gzip
# j - filter the archive through bzip2
# Extracting (untar) an archive using tar command
#Extract a *.tar file using option xvf
tar xvf archive_name.tar
#Extract a gzipped tar archive ( *.tar.gz ) using option xvzf
tar xvfz archive_name.tar.gz
#Extracting a bzipped tar archive ( *.tar.bz2 ) using option xvjf
tar xvfj archive_name.tar.bz2

    API
    Training
    Shop

    © 2014 GitHub, Inc.
    Help
    Support


Thursday, July 18, 2013

Hadoop Streaming with Python



--install python
yinst install ypython27 –nosudo

chmod +x wc_map.py

./hello.py

-- Part I: Map/Reduce on Local Machine
cat wc_map.py
#!/usr/local/bin/python
import os, sys, string

what = sys.argv[1]
for ln in sys.stdin:
        ww = ln.rstrip().split("\t")
        ct = reduce (lambda x,y: x+y, [what == w and 1 or 0 for w in ww])
        if ct:
                print "%s\t%d" % (what, ct)

cat wc_reduce.py
#!/usr/local/bin/python
import os, sys, string

ct_ttl=0
for ln in sys.stdin:
        ww = ln.strip().split("\t")
        ct_ln = int(ww[1])
        ct_ttl += ct_ln

print ct_ttl

cat data1 | python wc_map.py dog

cat data1 | python wc_map.py dog > intermediate

cat intermediate | python wc_reduce.py

cat data1 | python wc_map.py dog | python wc_reduce.py

Part II: Map/Reduce on Hadoop Cluster
hadoop jar $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \
-Dmapred.job.queue.name=unfunded \
-mapper "python wc_map.py dog  "  \
-reducer "python wc_reduce.py"  \
-input data1  \
-output whatever  \
-file wc_map.py  \
-file wc_reduce.py  \
-jobconf mapred.map.tasks=2  \
-jobconf mapred.reduce.tasks=1 


For example, cat mapper.py
import sys

# input comes from STDIN (standard input)
#f=open("linux.words", "r")
for line in sys.stdin:
#for line in f:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)
# testing echo "foo foo quux labs foo bar quux" | python mapper.py


For example, cat reducer.py
#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print '%s\t%s' % (current_word, current_count)
   

#echo "foo foo quux labs foo bar quux" | python mapper.py | sort -k1,1 | /python reducer.py

Wednesday, July 17, 2013

Google Python Class Day2

Copy & Paste Python code from https://developers.google.com/edu/python/regular-expressions

#Python Regular Expressions;
import re

match = re.search('iig', 'called piiig')
print match.group()

def Find(pat, text):
    match = re.search(pat, text)
    if match: print match.group()
    else: print 'Not Found'

# Basic Examples
. indicate any char
\w word char
\d digit
\s whitespace
\S non-whitespace
+ 1 or more
* 0 or more

Find('iig', 'called piiig')
Find('...g', 'callled piiig')
Find('..gs', 'called piiig')
Find('x..g', 'callled piig  much better :xyzgs')
Find(r'c\.l', 'c.llled piig  much better :xyzgs')
Find(r':\w\w\w', 'blah :cat blah blah')
Find(r'\d\d\d', 'blah : 123****')
Find(r'\d\s\d\s\d', 'blah : 1 2 3 ****')
Find(r'\d\s+\d\s+\d', '1               2            3')
Find(r':\w+', 'blah blah :kitten blabh blah')
Find(r':\w+', 'blah blah :kitten123 blabh blah')
Find(r':\w+', 'blah blah :kitten& blabh blah')
Find(r':.+', 'blah blah :kitten blabh blah')
Find(r':\w+', 'blah blah :kitten123123 blabh blah')
Find(r':\S+', 'blah blah :kitten123123&a=123&yatta blabh blah')

# Email examples
Find(r'\w+@\w+', 'purple alice-b@google.com monkey dishwasher')
Find(r'\w+@\w+', 'purple alice-b@google.com monkey dishwasher')
Find(r'\w[\w.]*+@[\w.]+', 'purple alice-b@google.com monkey dishwasher')

# Group Extraction
m = re.search(r'([\w.]+)@([\w.]+)', 'purple alice-b@google.com monkey dishwasher ')
print m.group()
print m.group(1)
print m.group(2)

# findall With Files
m = re.findall(r'[\w.-]+@[\w.]+', 'purple alice-b@google.com monkey dishwasher foo@bar')
m = re.findall(r'([\w.-]+)@([\w.]+)', 'purple alice-b@google.com monkey dishwasher foo@bar')
print m

# findall and Groups
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print tuples  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
  print tuple[0]  ## username
  print tuple[1]  ## host

# Utilities
# File System -- os, os.path, shutil
import os
dir(os)
help(os.listdir)

# Hello.py contain examples
# Running External Processes -- commands
print os.path.exists('/tmp/foo')
import shutil
#shutil.copy(source, dest)

import commands
#cmd = 'ls'
#print commands.getstatusoutput(cmd)

# Exceptions

# HTTP -- urllib and urlparse
import urllib
uf = urllib.urlopen('http://google.com')
uf.read()
urllib.urlretrieve("http://google.com/...*gif", 'blah.gif')

# cat Hello.py
import sys
import os
import commands

def list(dir):
 
    # File System -- os, os.path, shutil
    filenames = os.listdir(dir)
    print filenames
       
    for filename in filenames:
        print filename 
        print os.path.join(dir, filename)
        print os.path.abspath(os.path.join(dir, filename))
   
    # Running External Processes -- commands
    cmd = 'ls -l ' + dir
    (status, output) = commands.getstatusoutput(cmd)
    if status:
        sys.stderr.write(output)
        sys.exit(1)
    print output
   
def Hello(name):
    if name == 'Alice' or name == 'Nick':
        name = name + '?????'
        DoesNotExit
    else:
        name = name + "!!!!!"
    print 'Hello', name


def Cat(filename):
    try:
       f = open(filename, 'rU')
       print '-----', filename
       #for line in f:
       #    print line,
   
       #lines = f.readlines()
       #print lines
       text = f.read()
       print text,
       f.close()
    except IOError:
        sys.stderr.write('problem reading:' + filename)

def main():
    print Cat(sys.argv[1])
    #print Hello(sys.argv[1])
    #print list(sys.argv[1])
   
if __name__=='__main__':
    main()

    print 'this is {1}, that is {0}'.format(a,b)
    print 'this is {0}, that is {1}'.format(a,b)
    print 'this is {0}, that is {1}, this is too is {1}'.format(a,b)
    print 'tis is {bob} and that is {fred}'.format(bob = a, fred = b)
    d = dict(bob = a, fred = b)
    print 'this is {bob} and that is {fred}'.format(**d)

Google Python Class Day1


Copy & Paste Python codes from https://developers.google.com/edu/python/introduction

dir(sys)
help(sys)
help(sys.exit) 
help(len)

# Python Introduction
len('Hello')

# Python Program Hello.py
#!/usr/bin/python
import sys

def main():
    print 'Hello there', sys.argv[1]

if __name__ == '__main__':
    main()

# python hello.py Guido

# Modules and Imports
import sys
sys.exit(0)

# Python Strings
a = 'HELLO "poor" and '
b = "Too fancy ''"
c = "\* little"

a.lower()
b.upper()
a.fine('e')

#make copy
b=a[:]

b.append(10)
a.pop(10)
del a
del b[1]

# String Slices
s = 'Hello'
print s[:]
print s[1:3]
print s[-3:]

text = "%d little pigs come out or I'll %s and %s and %s" % (3, 'huff', 'puff', 'blow down')
print text

# Strings (Unicode)
ustring = u'A unicode \u018e string \xf1'
ustring
s = unistring.encode('utf-8')
print s
t =  unicode(s, 'utf-8')
print t

# If Statement
if speed >= 80: print 'You are so busted'
else: print 'Have a nice day'

# Python Lists
a = [4,2,3,1]
sorted(a)

# FOR and IN
result = []
for s in a: result.append(s)

s = 'this is a string'
for c in s:
       if c == 's': continue
       print c,
      
print '\n'
  
for c in s:
        if c == 's': break
        print c,
  
for c in s:
        print c,
    else:
        print 'else'
  
i= 0
while i < len(s):
        print s[i],
        i +=1
  else:
        print 'else'

# Range
range(20)
for i in range(20): print i

# Python Sorting
a =['aaaa', 'bbb', 's', 'bb']
sorted(a, key=len, reverse=True)
b=':'.join(a)
b.split(':')

# Tuples
(1,2,3)
(x,y) = (1,2)
a = [(1,"b"), (2,"a"), (1,"a")]
sorted(a)

# Python Dict
d = {}
d['a'] = 'alpha'
d['o'] = 'omega'
d['g'] = 'gamma'

d.get('s')
d.get('a')

'a' in d

d.keys()
d.values()
d.items()

for k in sorted(d.keys()): print 'key:', k, '->', d[k]
for k in d.items(): print k

# Del
var = 6
del var  # var no more!
 
list = ['a', 'b', 'c', 'd']
del list[0]    
del list[-2:]
print list     

dict = {'a':1, 'b':2, 'c':3}
del dict['b']  
print dict     

# Files
f = open('foo.txt', 'rU')
for line in f: 
   print line,   
f.close()

Tuesday, July 16, 2013

R Markdown Quick Reference

Content Weekly Trend
========================================================

### Read-in content raw data

```{r echo=FALSE, results='hide',message=FALSE}
memory.limit()
memory.size(max = TRUE)
rm(list=ls(all=T))

require(data.table)
require(xtable)
require(stringr)
require(ggplot2)

setwd("C:\\Users\\lingh\\Desktop\\Ad effectness\\Y! Index\\Yindex\\Ling\\content awareness");

data <- fread("content_week.csv", head=F, sep=',');
setnames(data, c("brandid", "pv", "wk"));
regexp <- "([[:digit:]]+)"
data$wk1 <- as.integer(str_extract(data$wk, regexp))
str(data)
```

```{r echo=TRUE, results="asis"}
data <- data[,wk:=NULL]
print(xtable(head(data)),type='html')
```

### Plot page views by brand

```{r echo=FALSE, results='hide',message=FALSE}
data1<-data[,sum(pv), by=brandid][order(brandid)]
setnames(data1, c("brandid", "pv"))
brand_meta <- fread("brand_meta.txt", head=F, sep='\t')
setnames(brand_meta, c("brandid", "brand_type"))
setkey(data1, "brandid")
setkey(brand_meta, "brandid")
plot1<-merge(data1,brand_meta)
plot1<-plot1[order(pv)]
plot1$group <- 1:length(plot1$pv)

plot2 <- plot1
plot2 <- transform(plot2, brand_type = factor(brand_type))
plot2 <- transform(plot2, brand_type = reorder(brand_type, rank(group)))
```

```{r echo=TRUE, results="asis"}
print(xtable(plot2),type='html')
```

```{r echo=FALSE, results='hide',message=FALSE, fig.width=20, fig.height=8}
ggplot(plot2, aes(x=brand_type, y=pv)) +
  geom_bar(stat = "identity", fill = "blue") + coord_flip() +
  geom_text(aes(x=brand_type, y=pv, label = paste(round(pv/10^6),"M") , hjust=-.2, face = "bold", size=20)) +
  ggtitle("US Page View Counts") +
  ylab("Page View Counts") + xlab("Brand") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.text.y = element_text(face = "bold", size = 16)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle = 90)) +
  theme(legend.position = "none")
```


### Plot weekly page views
```{r echo=FALSE, results='hide',message=FALSE}
data2<-data[,sum(pv), by=wk1][order(wk1)]
setnames(data2, c("wk", "pv"))
week_meta <- fread("week_meta.txt", head=F, sep='\t')
setnames(week_meta, c("wk", "week"))
setkey(data2, "wk")
setkey(week_meta, "wk")
plot3<-merge(data2,week_meta)
plot3<-plot3[order(wk)]

plot4 <- plot3
plot4 <- transform(plot4, week = factor(week))
plot4 <- transform(plot4, week = reorder(week, rank(wk)))
```


```{r echo=TRUE, results="asis"}
print(xtable(plot4),type='html')
```

```{r  echo=FALSE, results='hide',message=FALSE, fig.width=20, fig.height=8}
ggplot(plot4, aes(x=week, y=pv)) +
  geom_bar(stat = "identity", fill = "blue")+
  geom_abline(intercept = mean(plot3$pv), color = 'red', size = 1, lty = 3) +
  ggtitle("Daily Page view Counts") +
  ylab("Page Views") + xlab("Week") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 16, angle=-90)) +
  theme(axis.text.y = element_text(face = "bold", size = 16)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle = 90)) +
  theme(legend.position = "top") +
  theme(legend.key = element_rect(colour = NA)) +
  theme(legend.title = element_blank()) +
  theme(legend.position = "none")
```

Gamma
========================================================

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the **MD** toolbar button for help on Markdown).

When you click the **Knit HTML** button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:


Markdown Basics
-------------------------


Emphasis
-------------------------

*italic*

**bold**

_italic_ 

__bold__


Lists
-------------------------

### Unordered List

* Item 1
* Item 2
  * Item 2a
  * Item 2b

### Ordered List

1. Item 1
2. Item 2
3. Item 3
   * Item 3a
   * Item 3b


Manual Line Breaks
-------------------------

Roses are red,    
Violets are blue. 
To you, 
I always true. 


Links
-------------------------

http://example.com

[example](http://example.com)


Images
-------------------------

![image](http://example.com/logo.png)

![Capture](C:\\Users\\*****\\Desktop\\capture.png)


Blockquotes
-------------------------

A friend once said:

> It's always better to give  
> than to receive.


R Code Blocks
-------------------------

```{r}
summary(cars)
summary(cars$dist)
summary(cars$speed)
```

You can also embed plots, for example:

```{r fig.width=7, fig.height=6}
plot(cars)
```


Inline R Code
-------------------------

There were `r nrow(cars)` cars studied



Plain Code Blocks
-------------------------

```
# This text is displayed verbatim / preformatted
```


Inline Code
-------------------------

We defined the `add` function to
compute the sum of two numbers.


LaTeX Equations
-------------------------


### Inline Equation

$equation$
$\lambda$

$latex equation$
$latex l_a$

\( equation \)
\(\beta+\gamma+\sum_i\)

### Display Equation

$$ equation $$
$$ \lambda $$

$$latex equation $$
$$latex l_a $$

\[ equation \]
\[\beta+\gamma+\sum_i\]


Horizontal Rule / Page Break
-------------------------

Three or more asterisks or dashes:

******

------


Tables
-------------------------

First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell


Reference Style Links and Images
-------------------------

### Links
A [linked phrase][id].

At the bottom of the document:

[id]: http://example.com/ "Title"

### Images
![alt text][id]

At the bottom of the document:

[id]: C:\\Users\\*****\\Desktop\\capture.png "Title"


Miscellaneous
-------------------------

superscript^2

~~strikethrough~~


Typographic Entities
-------------------------

ASCII characters are transformed into typographic HTML entities: 
Straight quotes ( " and ' ) into “curly” quotes 
Backtick quotes (``like this'') into “curly” quotes 
Dashes (“--” and “---”) into en- and em-dash entities 
Three consecutive dots (“...”) into an ellipsis entity 
Fractions 1/4, 1/2, and 3/4 into ¼, ½, and ¾. 
Symbols (c), (tm), and (r) into ©, ™, and ® 

Wednesday, July 10, 2013

Check User Retention and User Churn Using R


rm(list=ls(all=TRUE))
sessionInfo()

require(data.table)


#########################################################
## Active Customers
#########################################################
user_act=fread("~/Desktop/Jobs/Tumblr/user_act.csv", header = T)
str(user_act)
user_act$datetime = as.POSIXct(as.numeric(as.character(user_act$ts)),origin="1970-01-01",tz="UTC")
user_act$date= format(user_act$datetime, format = "%m/%d/%Y")
user_act$time=format(user_act$datetime, format = "%H %p")
user_act$label = paste(user_act$date, user_act$time, sep="-")
dim(user_act)

data <- ddply(user_act, .(date, label), summarise, tot=length(user_id))
summary(data$tot)
ggplot(data, aes(y=tot, x=label, fill=date)) +
  geom_bar(stat="identity") +
  ggtitle("Active Users by Time") +
  geom_text(aes(label=data$tot), hjust=-.05, size=4, fontface = "bold") +
  geom_hline(yintercept=median(data$tot), color="blue", linetype = 2) +
  ylab("Hourly Active Users") +
  xlab("Time") +
  coord_flip() +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 14)) +
  theme(axis.text.y = element_text(face = "bold", size = 14)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(strip.text.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle=90)) +
  guides(fill=FALSE)


#########################################################
## Register Customers
#########################################################
user_regi=fread("~/Desktop/Jobs/Tumblr/user_regi.csv", header = T)
str(user_regi)
user_regi$datetime = as.POSIXct(as.numeric(as.character(user_regi$regi_ts)),origin="1970-01-01",tz="GMT")
user_regi$date= format(user_regi$datetime, format = "%m/%d/%Y")
user_regi$time=format(user_regi$datetime, format = "%H:%M")
dim(user_regi)


Customer retention refers to the ability of a company or product to retain its customers over some specified period. High customer retention means customers of the product or business tend to return to, continue to buy or in some other way not defect to another product or business, or to non-use entirely.

dt <- c(20130409:20130430, 20130501:20130531, 20130601:20130607)
rate <- c()
for (jj in 0:30){
  cat("\n----------------- gap = ", jj, "--------------------\n");
  data <- c()
  for (ii in 1:30) {
    cnt1 <- unlist(sqlQuery(con, paste("select count(*) from
                                     (select distinct userhashkey from
                                     (select userhashkey from vdw.dev.lh_tmp where datekey = ", dt[ii], ") a) b", sep ='')));     

    cnt2 <- unlist(sqlQuery(con, paste("select count(*) from
                                   (select distinct a.userhashkey from
                                   (select userhashkey from dev.tmp where datekey = ", dt[ii], ") a
                                   join (select userhashkey from dev.tmp where datekey > ", dt[ii + jj], ") b using(userhashkey)) a", sep = '')));
    data <- c(data, cnt2 / cnt1)
    cat(ii, cnt2 / cnt1, ' ', cnt2, ' ', cnt1, '\n')
  }
  rate <- c(rate, max(data))
  cat(jj, max(data))
}

data <- data.frame(1:length(rate)-1, c(1, rate[1:30]));
names(data) <- c("days", "sr")
write.csv(data, "sr3.csv")


#########################################################
## User Retention Modeling
#########################################################
head(all)
names(all)

n=nrow(all)
ntrain <- round(n*0.7)
set.seed(333)
tindex <- sample(n, ntrain)
all$y <- all$y==1
train <- all[tindex, ]
test <- all[-tindex, ]

summary(cbind(all$pageviews1, all$follows1, all$likes1, all$reblogs1, all$original_posts1, all$searches1, all$unfollows1, all$received_engagments1))
summary(cbind(train$pageviews1, train$follows1, train$likes1, train$reblogs1, train$original_posts1, train$searches1, train$unfollows1, train$received_engagments1))
summary(cbind(test$pageviews1, test$follows1, test$likes1, test$reblogs1, test$original_posts1, test$searches1, test$unfollows1, test$received_engagments1))


# "y"
# "is_verified"         "pageviews1"           "follows1"             "likes1"            
# "reblogs1"             "original_posts1"  
# "searches1"            "unfollows1"           "received_engagments1"


#########################################################
## Modeling -- KNN: only for continuous values;
#########################################################
train_x <- data.frame(is_verified=as.numeric(train$is_verified),
                      pageviews1=as.numeric(train$pageviews1),
                      follows1=as.numeric(train$follows1),
                      likes1=as.numeric(train$likes1),
                      reblogs1=as.numeric(train$reblogs1),
                      original_posts1=as.numeric(train$original_posts1),
                      searches1=as.numeric(train$searches1),
                      unfollows1=as.numeric(train$unfollows1),
                      received_engagments1=as.numeric(train$received_engagments1),
                      time1=as.numeric(train$time1),
                      regi_geo1=as.numeric(train$regi_geo1))
train_y=train$y
test_x <- data.frame(is_verified=as.numeric(test$is_verified),
                     pageviews1=as.numeric(test$pageviews1),
                     follows1=as.numeric(test$follows1),
                     likes1=as.numeric(test$likes1),
                     reblogs1=as.numeric(test$reblogs1),
                     original_posts1=as.numeric(test$original_posts1),
                     searches1=as.numeric(test$searches1),
                     unfollows1=as.numeric(test$unfollows1),
                     received_engagments1=as.numeric(test$received_engagments1),
                     time1=as.numeric(test$time1),
                     regi_geo1=as.numeric(test$regi_geo1))

tmp <- knn(train_x, test_x, train_y, k=5)
test$score=factor(tmp, levels=c(FALSE, TRUE))
#calculate auc;
AUC(test$score, test$y)
# KS is the maximum difference between the cumulative true positive and cumulative false positive rate.
KS(test$score, test$y)

ggplot(test, aes(pageviews1, follows1)) +
  geom_point(aes(color=test$y))


#########################################################
## Modeling --  Support Vector Machines
#########################################################
## 0.6148956 0.2297911
svm <- svm(y ~ is_verified + pageviews1 + follows1 + likes1 + reblogs1 + original_posts1
           + searches1 + unfollows1 + received_engagments1 + time1 + regi_geo1, data=train,
           method = "C-classification", kernel = "radial", cost = 100, gamma = 1)
test$score <- predict(svm, test)
AUC(test$score, test$y)
KS(test$score, test$y)

## 0.621205 0.2424101
svm <- svm(y ~ is_verified + pageviews1 + follows1 + likes1 + reblogs1 + original_posts1
           + searches1 + unfollows1 + received_engagments1 + time1 + regi_geo1, data=train,
           method = "C-classification", kernel = "radial", cost = 1, gamma = 1)
test$score <- predict(svm, test)
AUC(test$score, test$y)
KS(test$score, test$y)


#########################################################
## Modeling -- Classification Trees
#########################################################
# str(train)
# dtre1 <- tree(as.factor(y)~is_verified + pageviews1 + follows1 + likes1
#               + reblogs1 + original_posts1 + searches1 + unfollows1 + received_engagments1, data=train)
# plot(dtre1)
# text(dtre1)
# summary(dtre1)
# test$score <- predict(dtre1, test, type='class')
# prop.table(table(test$score, test$y))
ct <- rpart(as.factor(y) ~ is_verified + pageviews1 + follows1 + likes1
            + reblogs1 + original_posts1 + searches1 + unfollows1 + received_engagments1
            + time1 + regi_geo1,
            data = train, minsplit = 5)
summary(ct)
plot(ct)
text(ct)

test$score = predict(ct, newdata = test, type = "prob")[,2]
#calculate auc;
AUC(test$score, test$y)
# KS is the maximum difference between the cumulative true positive and cumulative false positive rate.
KS(test$score, test$y)


#########################################################
## Modeling -- Naive Bayes
#########################################################
train$y=factor(train$y, levels=c(TRUE, FALSE))
newdata <- data.frame(y=test$y,
                      is_verified=test$is_verified,
                      pageviews1=test$pageviews1,
                      follows1=test$follows1,
                      likes1=test$likes1,
                      reblogs1=test$reblogs1,
                      original_posts1=test$original_posts1,
                      searches1=test$searches1,
                      unfollows1=test$unfollows1,
                      received_engagments1=test$received_engagments1,
                      time1=test$time1,
                      regi_geo1=test$regi_geo1)
# nb1 <- NaiveBayes(y ~ is_verified + pageviews1 + follows1 + likes1 + reblogs1 + original_posts1
#                   + searches1 + unfollows1 + received_engagments1 + time1 + regi_geo1, data=train)
# tmp <- predict(nb1, newdata = newdata)
# table(tmp$class, test$y)
# test$score = tmp$posterior
# #calculate auc;
# pred = prediction(test$score, test$y);
# auc.tmp <- performance(pred,"auc");
# auc <- as.numeric(auc.tmp@y.values); auc;
# perf <- performance(pred,"tpr","fpr");
# plot(perf);
# abline(0, 1, lty = 2);
# # KS is the maximum difference between the cumulative true positive and cumulative false positive rate.
# max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])
nb1 <- NaiveBayes(y ~ is_verified + pageviews1 + follows1 + likes1 + reblogs1 + original_posts1
                  + searches1 + unfollows1 + received_engagments1 + time1 + regi_geo1, data=train,
                  kernel = "gaussian", userkernel=TRUE)
summary(nb1)
pred <- predict(nb1, newdata = newdata)
table(pred$class, test$y)
test$score = pred$posterior[,1]
#calculate auc;
AUC(test$score, test$y)
KS(test$score, test$y)


#########################################################
## Modeling -- Logistic Regression
#########################################################
## Logistic Regression Model
glm1 <- glm(y ~ is_verified + pageviews1 + follows1 + likes1, data=train, family=binomial)
summary(glm1)

glm2 <- glm(y ~ is_verified + pageviews1 + follows1 + likes1
            + reblogs1 + original_posts1 + searches1 + unfollows1 + received_engagments1
            + regi_geo1 + time1, data=train, family=binomial)
summary(glm2)
test$score<-predict(glm2,type='response',test)
#calculate auc;
AUC(test$score, test$y)
# pred<-prediction(test$score,test$y);
# auc.tmp <- performance(pred,"auc");
# auc <- as.numeric(auc.tmp@y.values); auc;
# KS is the maximum difference between the cumulative true positive and cumulative false positive rate.
# max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])
KS(test$score, test$y)

slm1 <- step(glm2)
summary(slm1)
slm1$anova


#########################################################
## Modeling -- Neural Networks
#########################################################
train$y=as.integer(train$y)
train$y1=1-train$y
train$is_verified=as.integer(train$is_verified)
train$time1=as.integer(train$time1)
train$regi_geo1=as.integer(train$regi_geo1)

nn1 <- neuralnet(y ~ is_verified + pageviews1 + follows1 + likes1 + reblogs1 + original_posts1
                 + searches1 + unfollows1 + received_engagments1 + time1 + regi_geo1,
                 data=train, hidden=c(4))
plot(nn1)

test_x <- data.frame(is_verified=as.integer(test$is_verified),
                     pageviews1=as.integer(test$pageviews1),
                     follows1=as.integer(test$follows1),
                     likes1=as.integer(test$likes1),
                     reblogs1=as.integer(test$reblogs1),
                     original_posts1=as.integer(test$original_posts1),
                     searches1=as.integer(test$searches1),
                     unfollows1=as.integer(test$unfollows1),
                     received_engagments1=as.integer(test$received_engagments1),
                     time1=as.integer(test$time1),
                     regi_geo1=as.integer(test$regi_geo1))
prediction <- compute(nn1, test_x)
prediction <- prediction$net.result

test$score <- prediction[,1]
AUC(test$score, test$y)
KS(test$score, test$y)


#########################################################
## Modeling -- Random Forest
#########################################################
rf <- randomForest(factor(y)~is_verified + pageviews1 + follows1 + likes1
                   + reblogs1 + original_posts1 + searches1 + unfollows1 + received_engagments1
                   + time1 + regi_geo1, data=train, ntree=500, mtry=2, importance=TRUE)
plot(rf)
pred <- predict(rf, newdata = test, type = 'class')
prop.table(table(pred, test$y))

rf <- randomForest(factor(gear) ~ ., data = train, ntree = 100, importance=T)
varImpPlot(rf, main="Importance of Variables")  

test$score <- predict(rf, newdata=test, type='class')
table(test$score, test$y)
AUC(test$score, test$y)
KS(test$score, test$y)