Wednesday, May 29, 2013

Transform to VW format

import sys
import string

for line in sys.stdin:
    line = line.strip()
    toks = line.split('\t')
    pline =  toks[0].strip() + " | "
    #continuous
#    for i in range(1, sys.argv[3]):
    for i in range(1, 2):
        if len(toks[i].strip()) == 0 :
            continue
        pline = pline + str(i) + ":" + toks[i].strip() + str('\t')
    #categorical
    for i in range(2, len(toks)):
        if len(toks[i].strip()) == 0 :
            continue
        pline = pline + str(i) + "cell" + toks[i].strip() + ":1" + str('\t')
    print pline
 
hadoop  jar hadoop-streaming.jar \
-input $1 \
-output $2 \
-mapper "python csv2vm2.py" \
-reducer NONE \
-file csv2vm2.py \
-jobconf mapred.reduce.tasks=5 \
-jobconf mapred.job.queue.name=***;

./csv2vm.sh /train1/* /vwtrain1/
./csv2vm.sh /test1/* /vwtest1/



Running Mincemeat Example on Windows

Lightweight MapReduce in python: https://github.com/michaelfairley/mincemeatpy.

client

E:\Python27>python example.py

server

E:\Python27>python mincemeat.py -p changeme localhost

Word count example

import glob
import mincemeat

#text_files=glob.glob('hw3data/*')
text_files=glob.glob('hw3data/c0001')
print(text_files)

def file_contents(file_name):
    f=open(file_name,'rb')
    try:
        return f.read()
    finally:
        f.close()

source=dict((file_name,file_contents(file_name))
    for file_name in text_files)

# setup map and reduce functions
def mapfn(key,value):
      for line in value.splitlines():
          for word in line.split():
               yield word.lower(),1

def reducefn(key,value):
       return key,len(value)  
  
# start the server
s =    mincemeat.Server()
s.datasource = source
s.mapfn = mapfn
s.reducefn = reducefn

results = s.run_server(password="changeme")
print results

Saturday, May 18, 2013

Install VM for Data Science Class


go here

https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/5_0

download and install VMWare Player

then download the .ova file from the spark-public link you posted
it will be approximately 2,451,525 bytes or 2.33 GB or 2,510,something bytes

then start VMWare Player and select Open Virtual Machine
select the .ova file you downloaded

it WILL give you an error

it might say saying "invalid /corrupted file"
in which case the file didn't completely download
so you'll have to try and download it again

but even if it downloads completely and correctly, it will say "file didn't pass .ofv specification, try again"

for the 2nd message, just hit ok and it will try again
it will take a few seconds while it sets up

then a 800 x 600 linux window will appear

and you're good to go.

Since the example won’t work on Windows, you’ll need to resort to using Powershell to get the first 10 lines. Here’s how you do it.
Open Powershell – click Start – in Search for programs and files, type “Powershell”. You’ll see it come up on your list of apps.
Launch Powershell and type the following:
1
Get-Content -Path output.json | select-object -first 20 > output.tx