zondag 11 oktober 2015

Foute bussen - deel twee - SVM

Na het naarstig verzamelen van een hoop bus verplaatsings-gegevens toch maar een kijken of we de 'outlier bussen' kunnen gaan herkennen. Het idee was dus dat het systeem moet kunnen voorspellen waar de bus zich in Chicago zou moeten bevinden op basis van 4 eerdere meetpunten (elke 2 minuten een meting). Ik besluit de data maar eens aan het skikit SVM (support vector machines) algoritme te voeden. Ik neem wel alleen de data van de bussen waar meer dan 50 registraties voor zijn. Het lijkt heel aardig te werken. Ik krijg bijvoorbeeld deze resultaten:

0 est= [ 41.87393951] real= 41.9331611883 perc= [ 0.14122874] est= [-87.62910461] real= -87.6452115168 perc= [ 0.01837739]
1 est= [ 41.87393951] real= 41.9370801714 perc= [ 0.15056045] est= [-87.62910461] real= -87.6482599046 perc= [ 0.02185473]
2 est= [ 41.87393951] real= 41.9398941161 perc= [ 0.15725982] est= [-87.62910461] real= -87.6506749713 perc= [ 0.02460946]
3 est= [ 41.87393951] real= 41.9439892769 perc= [ 0.16700787] est= [-87.62910461] real= -87.6538715363 perc= [ 0.02825537]
4 est= [ 41.87393951] real= 41.9488455455 perc= [ 0.17856518] est= [-87.62910461] real= -87.6577396393 perc= [ 0.03266685]
5 est= [ 41.87393951] real= 41.9534422633 perc= [ 0.18950233] est= [-87.62910461] real= -87.6614710796 perc= [ 0.03692211]
6 est= [ 41.87393951] real= 41.962390813 perc= [ 0.21078708] est= [-87.62910461] real= -87.6661007621 perc= [ 0.0422012]
7 est= [ 42.01862335] real= 41.9673858056 perc= [ 0.12208896] est= [-87.62910461] real= -87.667195012 perc= [ 0.04344886]
8 est= [ 42.01862335] real= 41.9742720468 perc= [ 0.10566307] est= [-87.62910461] real= -87.6682257516 perc= [ 0.04462408]
9 est= [ 42.01862335] real= 41.8739395142 perc= [ 0.34552239] est= [-87.62910461] real= -87.6305847168 perc= [ 0.00168903]
10 est= [ 42.01862335] real= 41.8739395142 perc= [ 0.34552239] est= [-87.62910461] real= -87.6305847168 perc= [ 0.00168903]
11 est= [ 41.87393951] real= 41.8726501465 perc= [ 0.00307926] est= [-87.62910461] real= -87.630531311 perc= [ 0.00162808]
12 est= [ 41.87393951] real= 41.8745476943 perc= [ 0.00145239] est= [-87.62910461] real= -87.6291046143 perc= [ 0.]
13 est= [ 41.87393951] real= 41.8781782532 perc= [ 0.01012159] est= [-87.62910461] real= -87.629161377 perc= [  6.47760393e-05]
14 est= [ 41.87393951] real= 41.8800046709 perc= [ 0.01448223] est= [-87.62910461] real= -87.6294408728 perc= [ 0.00038373]
15 est= [ 41.87393951] real= 41.881796893 perc= [ 0.01876084] est= [-87.62910461] real= -87.629375009 perc= [ 0.00030857]
16 est= [ 41.87393951] real= 41.884943702 perc= [ 0.02627242] est= [-87.62910461] real= -87.6291711981 perc= [  7.59836012e-05]
17 est= [ 41.87393951] real= 41.8869228003 perc= [ 0.03099604] est= [-87.62910461] real= -87.629644286 perc= [ 0.00061586]
18 est= [ 41.87393951] real= 41.889896046 perc= [ 0.0380916] est= [-87.62910461] real= -87.6295939359 perc= [ 0.0005584]
19 est= [ 41.87393951] real= 41.8905435249 perc= [ 0.03963666] est= [-87.62910461] real= -87.629737854 perc= [ 0.00072263]
20 est= [ 41.87393951] real= 41.8931026459 perc= [ 0.04574293] est= [-87.62910461] real= -87.6297836304 perc= [ 0.00077487]

De percentages geven de afwijking aan in lengte en in breedte tussen de voorspelling en de werkelijkheid. Ik moet nu eens gaan berekenen hoeveel kilometer er dan afstand is.


Hier de gebruikte code:

import webbrowser
import time
import numpy as np
from sklearn.externals import joblib
from sklearn import svm

Data = joblib.load('/Users/DWW/Documents/Bus_outlier.pkl')
Data = np.array(sorted(Data, key = lambda x: (x[0], x[1])))
busses = sorted(list(set([Colm[0] for Colm in Data])))

def perc(est,real):
    if real == 0:
        perc = 0
    else:
        perc = np.absolute(100* (real-est)/real)
    return perc

Xl = np.array([0.,0.,0.,0.])
Yl = np.array([0.])
Xb = np.array([0.,0.,0.,0.])
Yb = np.array([0.])
for i in busses:
    HData = Data[Data[:,0]==i]
    print i, len(HData)
    if len(HData) > 50: # Alleen voor bussen waar meer dan 50 waarnemengen van zijn.
        for j in range(len(HData)-4):
            Xl = np.vstack((Xl,[float(HData[j][2]), float(HData[j+1][2]), float(HData[j+2][2]), float(HData[j+3][2])]))
            Yl = np.append(Yl,[float(HData[j+4][2])])
            Xb = np.vstack((Xb,[float(HData[j][3]), float(HData[j+1][3]), float(HData[j+2][3]), float(HData[j+3][3])]))
            Yb = np.append(Yb,[float(HData[j+4][3])])

si = -.1 * len(Xl)
clf_l = svm.SVC(gamma=0.01, C=100)
clf_l.fit(Xl[1:si],Yl[1:si])
clf_b = svm.SVC(gamma=0.01, C=100)
clf_b.fit(Xb[1:si],Yb[1:si])

for i in range(len(Xl[si:])):
    lpred = clf_l.predict(Xl[si+i])
    bpred = clf_b.predict(Xb[si+i])

    print i, 'est=',lpred,'real=',Yl[si+i],'perc=', perc(lpred, Yl[si+i]), 'est=', bpred ,'real=',Yb[si+i], 'perc=', perc(bpred, Yb[si+i])

Geen opmerkingen:

Een reactie posten