Stanford Machine Learning Class

Posted on December 25, 2016 by jmhorn

stanfordml

For the past 5 weeks I have been doing the Stanford Machine Learning class offered by Coursera. In the past I have started and stopped taking the class but this time around I am fully committed and now into week 5. I’ve learned so much so far. I have always used functions like logistic regression in R and understood a little about how it worked down to the sigmoid function however this class has really given me an appreciation down to the linear algebra that is happening under the hood. The neural networks section has also been revealing. I had always assumed neural networks to be unapproachable however so far the class is doing a great job of breaking it down. I have now applied a neural network from scratch using linear algebra in Octave.

This is definitely a great value for the material and quality of the videos and exercises. I decided to do the certificate version because to me this is a core curriculum class in my MOOC journey. I still have quite a few more weeks in this class however I’m already looking to the next one which will probably be the Caltech EdX Learning from Data Class.

While I am used to python, matlab, and R my initial reaction to Octave has been a little mixed. While it was incredibly easy to grasp the syntax the submission portion of all the classes has been plagued with coding errors. I had to search stack overflow until I found the correct fix. Also I had to install version 4.2.0 and not all versions of Octave worked. I understand the reasoning for choosing Octave though as R does have some data type issues for those new to it and the focus is really on the math not the coding syntax. Also matlab isn’t really an open source nor cheap option for most. Overall I’ve been very happy with this course. I think the presentation and exercises complement the John Hopkin’s Data Science courses and should be taken after the fundamentals of those classes are complete.

An Analysis of the Flint Michigan Initial Surface Water Corrosivitiy

Posted on March 1, 2016 by jmhorn

An Analysis of the Flint Michigan Initial Surface Water Corrosivitiy

Part 1: Start at the Source

Introduction

As many have heard recently residents of Flint Michigan have been rightly outraged due to the high presence of toxic chemicals including lead in their drinking water. The question arises how did this occur and was it a forseeable incident? The backstory that led up to this incident can be generalized into a few main chapters.

Flint had long sourced their water from the Detroit Water and Sewerage Department (DWSD)
The city had financial incentive to reduce spending because they were under financial stress
Flint went into an agreement with the Karegnondi Water Authority (KWA) and their to be completed source from Lake Huron(end of 2016)
The existing supplier DWSD provided their 12 month notice that their supply contract would end on April 2014
The flint river was relied on to supply water in the interim
The flint river contained significantly higher levels of chloride than the Detroit water source and no anti-corroding agents were applied

Hypothesis

In this analysis we will utilize Census Data as well as US Geological Wate Quality Survey Data to analyse the Flint incident starting at the source pre-treated water as well as nearby streams in Detroit and near Lake Huron. It is not meant to serve as conclusive evidence of any kind. We will be looking specifically at chloride concentrations to see if Flint, Mi has very corrosive water to begin with.

Before we begin let’s check for and install any necesary packages for this story

setwd("~/DSTribune/Stories/FlintWaterQuality")
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(xtable)

The city of Flint is located in Genesee County and this is really a story of three counties. Detroit, Wayne County that originally sold treated water from multiple rivers including the Detroit River to Flint as well as the KWA plant [under construction sourcing water from Lake Huron] (http://www.nytimes.com/2014/05/26/business/detroit-plan-to-profit-on-water-looks-half-empty.html?_r=0) located in Sanilac County.

We will download fresh water data from those two counties and merge them into one data frame

#Genesee (Flint)
temp <- tempfile()
download.file("http://waterqualitydata.us/Result/search?countrycode=US&statecode=US%3A26&countycode=US%3A26%3A049&sampleMedia=Water&characteristicType=Inorganics%2C+Major%2C+Non-metals&characteristicName=Chloride&mimeType=csv&zip=yes&sorted=no", temp)
wqGen<- read.csv(unz(temp, "result.csv"))
wqGen$County = "Genesee"


#Wayne (Detroit River)

temp <- tempfile()
download.file("http://waterqualitydata.us/Result/search?countrycode=US&statecode=US%3A26&countycode=US%3A26%3A163&sampleMedia=Water&characteristicType=Inorganics%2C+Major%2C+Non-metals&characteristicName=Chloride&mimeType=csv&zip=yes&sorted=no", temp)
wqWayne<- read.csv(unz(temp, "result.csv"))
wqWayne$County = "Wayne"


#Merge the three County Water Measurements
wqDf <- rbind(wqGen, wqWayne)

#Save an offline version of the merged county water data
write.csv(wqDf, file ="MI3CountyCountyWaterData.csv")

We filtered our data for high quality measurements only taken at the surface. We specifically collected data on dissolved chloride concentrations because chloride ions are the key element in contributing to the corrosion in Flint pipes leading the leaching of metals such as lead. In the second half of this story we will also cover how the addition of chlorine escalated chloride concentrations but for now we will focus on pre-treatment water quality.

wqDf <- filter(wqDf, ActivityMediaSubdivisionName == "Surface Water", ResultSampleFractionText == 'Dissolved', ResultStatusIdentifier == 'Accepted' | ResultStatusIdentifier == 'Final' | ResultStatusIdentifier == 'Historical')
wqDf$MonitoringLocationIdentifier <- as.character(wqDf$MonitoringLocationIdentifier)
wqDf$ActivityStartDate <- as.POSIXct(wqDf$ActivityStartDate)
wqDf <- wqDf %>%
  filter(ResultMeasureValue != "NA")

We now would like to see if there is a significant difference in pre-treated chloride concentrations amongst the two counties.

#What we want is a percentage of samples binned by concentration
percentConc<- wqDf %>%
  group_by(County) %>%
  summarise(Avg = mean(ResultMeasureValue, na.rm = TRUE),
            Max = max(ResultMeasureValue, na.rm = TRUE),
            Median = median(ResultMeasureValue, na.rm = TRUE),
            LatestSample = max(ActivityStartDate, na.rm = TRUE),
            totalSamples = n(),
            stdError = sd(ResultMeasureValue, na.rm = TRUE))

percentConc$min <- percentConc$Avg - percentConc$stdError
percentConc$max <- percentConc$Avg + percentConc$stdError
       
plot1 <- ggplot(percentConc, aes(x=County)) 
plot1 <- plot1 + geom_errorbar(aes(ymin=min,ymax=max),data=percentConc,width = 0.5)
plot1 <- plot1 + geom_boxplot(aes(y=Avg))
plot1 <- plot1 + ggtitle("Surface Water Chloride Concentrations \n in Genesse and Wayne County MI (USGS)") + ylab("Average Chloride Concentration")
plot1

What we care about in this plot is the average as well as the standard error of the distribution. On first glance it appears that Genesee County overall has a higher concentration of chloride in the surface water overall. Let’s see if this is statistically significant or not as their is overlap in the standard error.

Gen <- filter(wqDf, County == "Genesee")
Way <- filter(wqDf, County == "Wayne")
Gen_Way <- t.test(Gen$ResultMeasureValue, Way$ResultMeasureValue, alternative=c("greater"))
Gen_Way$p.value

## [1] 1.04371e-05

The p-value for this t-test shows that Genesee County has a significantly greater chloride conncentration in its surface water compared to Dwayne county. Remember Dwanye county houses the Detroit River AKA the old reliable and expensive source of water that Flint was sourcing its water from originally before switching.

## Warning in formatC(x = structure(c(1446015600, 1019113200), class =
## c("POSIXct", : class of 'x' was discarded

	County	Avg	Max	Median	LatestSample	totalSamples	stdError	min	max
1	Genesee	41.31	185.00	21.00	1446015600.00	229	42.88	-1.57	84.19
2	Wayne	25.41	330.00	8.50	1019113200.00	383	46.62	-21.22	72.03

Surface Water samples taken in the County of Genesee appear to show a multi-decade historical average of 41.3 mg/l almost twice as much as the 25.4 mg/l average in Genesee County. At this point I got a funny feeling why not check the median it should be relatively close to the mean?

tapply(wqDf$ResultMeasureValue, wqDf$County, median)

## Genesee   Wayne 
##    21.0     8.5

Turns out the median was nowhere near the mean. The median shows Genesee County having a chloride concentration of 21.0 mg/l and Wayne with a 8.5 mg/l concentration. Genesee County has almost 3X the pre-treatment or initial chloride concentration compared to Wayne county. The discrepancy between the median and mean could be outliers or a non-normal distribution. If my experience has taught me any thing in these circumstances I need to see the full distribution and see what is happening here.

ggplot(wqDf, aes(x = ResultMeasureValue, fill = County)) + geom_density(alpha = 0.3) + ggtitle("Density of Chloride Concentrations \n Genesee and Wayne County Surface Water") + xlab("[Chloride] (mg/l)") + ylab("Frequency")

That distribution sure doesn’t look normal. It appears Wayne county has a lot of samples with low concentrations of chloride. It could be that one sampling site has so many samples that it is warping the mean and median. Perhaps what we should be doing is collecting an average by sample site and looking at the distribution of sample site averages.

percentConc<- wqDf %>%
  group_by(MonitoringLocationIdentifier, County) %>%
  summarise(Avg = mean(ResultMeasureValue, na.rm = TRUE),
            Max = max(ResultMeasureValue, na.rm = TRUE),
            Median = median(ResultMeasureValue, na.rm = TRUE),
            LatestSample = max(ActivityStartDate, na.rm = TRUE),
            totalSamples = n(),
            stdError = sd(ResultMeasureValue, na.rm = TRUE))

tapply(percentConc$Median, percentConc$County, mean)

##  Genesee    Wayne 
##  33.2125 115.5000

This just got interesting. At first it appeared as though Genesee County had significantly higher concentrations of Chloride than Wayne County. However once we aggregated median concentrations by Site and then averaged those by County It appears that Wayne County has 5X the amount of chloride in its surface water. To put this to rest we will conduct one more filter to remove sites with less than 3 samples to remove outlier measurements at unique sites. Remember running even one water sample requires multiple labs, USGS employees sampling at a site, and tens of thousands of dollars. So 3 samples is a big deal in this world (I should know I used to sample and analyze water for 4 years for the US Geological Survey)

HighSampleSizePercentConc <- filter(percentConc, totalSamples >= 3)
tapply(percentConc$Median, percentConc$County, mean)

##  Genesee    Wayne 
##  33.2125 115.5000

See Interactive Map of Initial Chloride Concentrations

https://jmhorn2.cartodb.com/viz/4a730cfe-dab1-11e5-b919-0e5db1731f59/public_map

Conclusion

We have finally arrived closer to the truth. In general the rivers and lakes in Genesee County appear to have a much lower chloride concentration than those in Wayne County. We originally thought initial chloride concentrations would be high in addition to any chloride ions produced from the additional chlorine added to kill bacteria however this does not appear to be the case. It should be noted that Detroit also sourced its water from Lake Huron. This analysis also looked at bacterial measurements but found Flint, MI did not have enough samples taken from the USGS to warrant a similar analysis on bacteria concentrations. The lack of initial high corrosivity in the rivers relative to nearby counties as seen in the interactive map suggests that initial chloride concentrations may not have been the main contributor to corrosivity and instead the addition of chlorine to remove bacteria may have been the main contributor to the water corrosivity.

See more Code at Github

Attempting Automated Fraud Analysis on Edgar 10Q filings using SEC EDGAR ftp, Python, and Benford’s Law

Posted on December 26, 2015 by jmhorn

The purpose of this post is to describe a side project I did to analyze SEC filing in an attempt to spot irregular values in 10Q filings.

Background

Benford’s Law
Benford’s law is a well known natural phenomenon discovered many many years ago in 1938 by physicist Frank Benford. Under certain key assumptions it relates that the probability distribution of the occurrence of the first and second digits of a list of numbers spanning multiple digits have been observed to follow a well observed distribution. Now for the statisticians out there there are key assumptions in which this natural 1st and 2nd digit distribution observation is valid and other cases where this is invalid. Often times folks will take these assumptions for granted and use Benford’s Law as a hard test when really its application in accounting has been a flag or notification that something should be further investigated. For example we shouldn’t be using Benford’s Law when the list of digits under investigation is too small (e.g. 11, 55, 155, 499) because the occurrence of the leading digits will be more biased. We will delve more into this importance later.

Edgar SEC:
Edgar stands for Electronic Data Gathering, Analysis, and Retrieval system. It is the digital collection of SEC filings where a user can pull up financial files of any company they desire and these documents are all publicly available. There are two means of access. One can use the Edgar SEC webpage to manually pull files or utilize their ftp service to more systematically retrieve files.

Edgar Central Index Key:
The SEC does not want bots constantly pinging their servers and retrieving files for good reason. This would create a load balancing nightmare. So to allow their servers and users to still retrieve information programmatically they created the Central Key Index or CIK. The CIK is a unique identifier foreign key that is ascribed to a company. For example if I wanted to download all the financial documents of Lehman Brothers I would first have to know the CIK that corresponds to that company. Once I have the CIK I can then use that key in the ftp to programmatically retrieve all past documents and current filings (which doesn’t make sense because Lehman Brothers no longer exists).

10Q Filings:
10Q filings are on of many files that companies are mandated to publish to the SEC in order to comply as a publicly traded company. Among these files are the quarterly earnings, expenditures, internal tradings, and other files. The 10Q files are of particular interest because often contain data that companies must publish and yet are not audited. 10K files on the other hand are annual files that contain expenditures, capital costs, cash flow, etc.. that are audited. This makes 10Q files more ripe for embellishment because there is pressure to meet quarterly expectations and the documents are not thoroughly audited compared to 10K files.

Training Dataset
There are known incidents of fraud all throughout the history of the stock market. The best part is that all these companies had to file SEC documents before, during, and after these incidents became known. So fraudulent companies like Enron, etc.. have known dates and readily available filings, in this case we were interested in 10Q filings, to serve as our “training” data set. I created a list of 10Q files corresponding to companies at a particular point in time known to have been conducting fraud.

Thesis
We will create a list of known 10Q documents pertaining to a period in time when companies that were audited and deemed to be conducting fraud. We will conduct Benford’s law on the numbers in these documents and compare their distributions against companies not identified as fraudulent in the same time period. We will try and use a diversified group of companies for both training and test groups. Ideally we would like to obtain an idea for the rate of false positives and true positives to understand the predictive ability of our small algorithm.

Method
All code is publicly available at my local github account.

Download SEC Indices
I successfully downloaded all SEC CIK’s into the form of a SQL database on my local drive.


def FTPRetrieve(Primary, Path, SavePath):

    import ftplib
    server="ftp.sec.gov"
    user="anonymous"
    password="YouEmailAddress@gmail.com"
    try:
        ftp = ftplib.FTP(server)
        ftp.login(user,password)
    except Exception,e:
        print e
    else:
    #edgar/data/100240/0000950144-94-000787.txt
        ftp.cwd("edgar/data/" + Primary + "/")
        try:
            ftp.retrbinary("RETR " + Path ,open(SavePath + Path, 'wb').write)
            #ftp.retrbinary("RETR " + filename ,open(filename, 'wb').write)
            ftp.quit()
        except:
            print "Error"

import sqlite3
import csv
import glob
def tosqlite(file):
    with open(file) as f:
        idx = csv.reader(f, delimiter='|')
        for row in idx:
            cur.execute('INSERT INTO idx (cik, cname, form, date, path) VALUES (?, ?, ?, ?, ?)', row)
 
#con = sqlite3.connect('~/DSTribune/Code/EdgarIndex/edgaridx.db')
con = sqlite3.connect('edgaridx.db')


with con:
    con.text_factory = str
    cur = con.cursor()
    cur.execute('DROP TABLE IF EXISTS idx')
    cur.execute('CREATE TABLE idx(Id INTEGER PRIMARY KEY, cik TEXT, cname TEXT, form TEXT, date TEXT, path TEXT)')
#    for idxfile in glob.glob('~/DSTribune/Code/EdgarIndex/*.idx'):
    for idxfile in glob.glob('*.idx'):
        print idxfile
        tosqlite(idxfile)

Download 10Q Training and Test Data sets:
From there I identified the companies I knew to have been fraudulent within a given date range. I observed that the years 2001 – 2004 contained a high frequency of known filing fraud and so focused on this time period. This would be my training dataset.


import sqlite3
import re
from FTPEdgar import FTPRetrieve
from ExtractNumbersV2 import BenfordsLaw

from datetime import datetime

tstart = datetime.now()
print tstart

#The following file paths are examples of locations that I locally saved my sqlite database, change to a place you see fit!
SavePath = r"/home/PycharmProjects/Edgar/EdgarDownloads/"
LedgerPath = r"/home/PycharmProjects/Edgar/"
conn = sqlite3.connect(r"/home/Code/EdgarIndex/edgaridx.db")
c = conn.cursor()
#------------------------------------------------------------------------


#Companies found to be fraudulent during period of study
#ENRON: 72859    (2001)
#Waste Mang: 823768   (1999)
#Worldcom: 723527   (2002)
#TYCO corp: 20388   (2002)
#HealthSouth 785161   (2003)
#Centennial Technologies 919006
#Peregrine Systems Inc 1031107   (2002)
#AOL: 883780
#AOL TimeWarner: 1105705   (2002)
#Adelphia Communications Corp: 796486   (2002)
#Lehman Brothers Holdings Inc: 806085   (2010)
#AIG: 5272   (2004)
#Symbol Technologies: 278352  (2002)
#Sunbeam Corp: 3662  (2002)
#Meryl Lynch and Co Inc: 65100
#Kmart: 56824    (2002)
#Homestore Inc: 1085770   (2002)
#Duke Energy Corp: 1326160   (2002)
#Dynergy: 1379895   (2002)
#El Paso Energy Corp: 805019   (2002)
#Haliburton: 45012   (2002)
#Reliant Energy Inc: 48732  (2002)
#Qwest Communications: 1037949   (2002)
#Xerox:  108772     (2000)
#Computer Associates: 356028   (2000)
#Unify Corp: 880562   (2000)

#compList = [72859, 823768, 723527, 20388, 785161, 919006, 1031107, 883780, 1031107, 883780, 1105705, 796486, 806085, 5272, 3662, 65100, 56824, 1085770, 1326160, 1379895, 805019, 45012, 48732, 1037949, 108772, 356028, 880562]

#Companies not idetnfied as fraudulent during period of study

#Intel:  50863
#Microsoft: 789019
#Starbucks: 829224
#Walmart: 217476
#Amazon: 1018724
#Qualcomm: 804328
#AMD Inc: 1090076
#Verizon: 1120994
#Ebay: 1065088
#Home Depot: 354950
#Geico: 277795
#Costco: 734198

#compList = [50863, 789019, 829224, 217476, 1018724, 1090076, 1120994, 1065088, 354950, 277795, 734198]
compList = [1321664, 92380, 18349, 1127999, 1171314, 78003, 789019, 91419, 716729, 318154, 814361, 318771, 796343]


for company in compList:
    for row in c.execute("SELECT * FROM idx WHERE form = '10-Q' AND cik = '" + str(company) + "';"):#"' AND DATE(substr(date,1,4)||substr(date,6,2)||substr(date,8,2)) BETWEEN DATE(19960101) AND DATE(20040101);"):
        print row[0], row[1], row[2], row[3], row[4], row[5]


        ID = str(row[0])
        Primary = str(row[1])
        Company = str(row[2])
        Document = str(row[3])
        Date = str(row[4])
        MyPath = str(row[5])
        MyFile = re.search('\d+-\d+-\d+.txt',MyPath).group()
        #print ID, Primary, Company, Document, Date, MyPath, MyFile

        try:
            FTPRetrieve(Primary, MyFile, SavePath)
        except:
            print "could not find " + Company
        else:
            BenfordsLaw(LedgerPath,SavePath,MyFile,Company,Date,Primary,ID,Document)

    #c.execute("SELECT * FROM idx WHERE form = '10-Q';")

    #print c.fetchall()
    tend = datetime.now()
    print tend

c.close()

Analyze each file after subsequent download:

While our system is downloading these 10Q filings (There are rate limits) we can actually use Python to simultaneously parse and analyze all numbers of interest within each 10K filing. This allows us to ping EDGAR at a sustainable rate while using that same cpu power to parse, analyze, and record results in a csv file. By combining the downloading and analysis on the same cpu we open the door for multiple instances or batch jobs if we want to do larger scale analysis in the future.

def BenfordsLaw(LedgerPath, SavePath, filename, Company, Date, Primary, ID, Document):

    import math
    import scipy.stats
    import numpy as np

    #Find values only with commas
    import re
    RawList = [re.findall(r'\d+[,]\d+',line)
        for line in open(SavePath + filename)]

    NoEmpty = filter(None, RawList)
    DeleteIndex = []
    AddIndex = []
    #Then remove those commas
    Counter = 0
    for item in NoEmpty:
        Counter += 1
        if len(item) &gt; 1:
            for targetElement in range(1,len(item)):
                AddIndex.append(item[targetElement])
            DeleteIndex.append(Counter - 1)


    NoEmpty = [i for j, i in enumerate(NoEmpty) if j not in DeleteIndex]

    CleanNoEmpty = []


    for i in NoEmpty:
        CleanNoEmpty.append(i[0])

    FinalList = CleanNoEmpty + AddIndex

    CleanFinalList = []
    numDist = [0,0,0,0,0,0,0,0,0]

    for item in FinalList:
        CleanFinalList.append(item.replace(',', ''))



    counts = []
    CleanFinalList2 = []
    for number in CleanFinalList:
        # i = "An, 10, 22"
        last = number[-1]
        secLast = number[-2]
        pop = number[1]
        # pop = 'n', the second character of i
        if int(pop[0]) != 0 and last != 0 and secLast != 0 and len(number) &gt;= 4:
            counts.append(int(pop[0]))
            numDist[int(pop)-1] = numDist[int(pop)-1] + 1
            CleanFinalList2.append(number)

    print CleanFinalList2
    print counts

    def Benford(D):
        prob = (math.log10(D + 1) - math.log10(D))
        return prob


    chi2, p = scipy.stats.chisquare( counts )
    msg = "Test Statistic: {}\np-value: {}"
    print( msg.format( chi2, p ) )

    from scipy import stats
    from scipy.stats import ks_2samp
    PList = []
    DList = []
    unifPList = []
    unifDList = []

    xk = np.arange(10)
    xk = xk[1:10]
    unifxk = np.arange(10)

    pk = (Benford(1), Benford(2), Benford(3), Benford(4), Benford(5), Benford(6), Benford(7), Benford(8), Benford(9))
    unifpk = (0.1197,0.1139,0.1088,0.1043,0.1003,0.0967,0.0934,0.0904,0.0876,0.0850)
    custm = stats.rv_discrete(name='custm', values=(xk, pk))
    unifCustm = stats.rv_discrete(name='custm', values=(unifxk, unifpk))

    IterCt = 1000
    for iter in range(IterCt):

        R = custm.rvs(size=len(counts))
        R = R.tolist()
        placeholder = ks_2samp(counts, R)
        PList.append(placeholder[1])
        DList.append(placeholder[0])

        unifR = unifCustm.rvs(size=len(counts))
        unifR = unifR.tolist()
        unifplaceholder = ks_2samp(counts, unifR)
        unifPList.append(unifplaceholder[1])
        unifDList.append(unifplaceholder[0])

    AveP = sum(PList)/IterCt
    AveD = sum(DList)/IterCt

    unifAveP = sum(unifPList)/IterCt
    unifAveD = sum(unifDList)/IterCt

    DistPercent = []
    for i in numDist:
        DistPercent.append(float(i)/float(sum(numDist)))

    print DistPercent
    print AveP
    print AveD
    print unifAveP
    print unifAveD

    output = [AveP, AveD, unifAveP, unifAveD, len(counts), Company, Date, Primary, ID, Document]

    #This script needs to be run twice. Once for the training set and once for the test set.
    #fd = open(LedgerPath + "NonSuspected.csv",'a')
    fd = open(LedgerPath + "NonSuspected.csv",'a')
    for column in output:
        if isinstance( column, int ):
            fd.write('%.8f;' % column)
        else:
            fd.write('%s;' % column)
    fd.write('\n')
    fd.close()

The script above exports the results of the analysis as a csv. We run it once for the training dataset with known fraud and once again for the test list or companies not identified as fraudulent. The way we ascertained if the distributions of the 1st and 2nd digits followed the known Benford 1st and 2nd digit occurence distributions was to conduct a chi squared test of independence to determine the probability that the two distributions come from the same process. The null hypothesis is that the two distributions are independent of each other while the alternative hypothesis is that the two distributions originated from the same process and are not independent but are related. If the two samples are unlikely to be produced given the null hypothesis (they were created independent of each other) than we would expect a low probability or p-value under the null hypothesis. Often the scientific community will use a p-value of 0.05 or 5% as a rule of thumb however this will depend on the area of study as these p-value or significance thresholds can be empirically derived. What I am more curious about is if the distribution of p-values comparing the fraudulent vs. non-fraudulent group differs significantly for both the 1st and 2nd digit.

Results

Before comparing the two samples I first eliminated p-values where the quantity of valid numbers within each filing were below 100 in order to eliminate any bias from small quantities of numbers in each filing. In order to compare the distribution of p-values for the chi squared comparison test I took a log scale base 10 of the p-values for both test and training set. I then setup intervals of 0, -0.5, -1, -1.5 all the way to -42 in order to create a probability mass function or PMF. The reason I wanted to do a PMF is because when directly comparing two distributions one should not compare histograms directly if the sample size for both samples are not exactly the same (which is often not the case). So to make up for this I took the percentage within each log base 10 range (0.5 increments) and created a PMF as seen above. Unfortunately the results come back pretty inconclusive. I would have hoped for two distinct and easily separable distributions however these results don’t visibly show that.

Conclusion
While Benford’s law is very powerful it may not be appropriate for 10Q documents. For one the algorithm works better on raw values and many of the values in the 10Q documents may be sums or output that resulted from operations on raw values. I think that Benford’s would work better on raw accounting logs that have more transaction data. Also I ran into a lot of numerical data points in 10Q files that were merely rounded estimates. I did my best to eliminate these in python. Finally Benford’s law works best when there is a wide range in the # digits.

Future
I think there is still potential to programatically analyze EDGAR files however 10Q files may not be close enough to the metal of company accounting necessary for the job. I am open to ideas though in the spirit of this analysis.

~Mr. Horn

D3.js

Posted on November 16, 2015 by jmhorn

I’ve spend the past couple months working on various geospatial / prediction models. But this week I wanted to post an update on a side interest which is D3. I borrowed a book called D3.js in Action. So far it has been a lot of fundamental on how to utilize the DOM and D3 in combination and how to use the values in the DOM to influence the visualization. Why learn D3.js? Well it appears to be the next evolution in non-proprietary open source data visualization. While many data scientists probably just throw together a couple of plots in R or matplotlib in Python D3 serves a different purpose. Rather than generic plots or proprietary libraries I think D3 is meant to get root access to displaying data on the web. I have been extremely focused on geo-spatial the past couple of months and have looked at various mapping solutions like MapBox, CartoDB, Fusion Tables, etc.. just to name a few. What I like about D3.js is that while it may not be as polished for maps yet I think it is a library that will serve me long into my data career for more than just maps or bar charts. Data tool sets that will serve me 10 years from now are the tool sets that I invest my time in learning.