Quantcast
Viewing latest article 3
Browse Latest Browse All 21

R-Studio: Putting the Lahman Database on Steroids

It’s easy to calculate advanced baseball metrics with the Lahman database. I’ve always compared the database to indoor plumbing, something we all use every day but sometimes take for granted.

The beauty of the database is it’s not cluttered with useless information, just simple counting stats but often times we’re looking for more.

If your fingers are getting tired from typing in the formula for plate appearances, ad nauseum, then what you need to do is give your copy of Lahman a healthy shot of PEDs!

The tactic is simple, export your Lahman tables into R-Studio, add your desired calculations and export that bad boy right back into your database.

1. Connect to the Data

I guess you could simply export a csv file but who has time for that? There’s a couple R packages that will allow you to connect directly to the database and pull a data frame based on a SQL query. In this case I’m using the RMySQL package to connect to a MySQL server. Alternately, the RODBC package is pretty good for an SQL database.

You can download the unaltered tables in csv format here if you don’t want to fool around with a database connection.

library(RMySQL)

drv <- dbDriver("MySQL")
con <- dbConnect(dbDriver("MySQL"), user = "your_name", password = "your_password", dbname = "lahman")

Then just grab your data with a simple SQL query. Note, I'm selecting everything because I want to alter the complete table(s).

#SQL for Batting table
batting <- dbSendQuery(con,
                        "SELECT *
                        FROM Batting")
#Fetch batting into data frame
b <- fetch(batting, n = -1)

Wash, rinse, repeat with the “pitching” and “fielding” tables.

2. Write Some New Stats

Now we get to the fun part! The full code for this can be found on my GitHub page. I'm just going to give you a few so you get the idea.

b$PA <- (b$AB + b$BB+ b$HBP + b$SF + b$SH) #Plate Appearences

b$OBP <- round(((b$H+b$BB+b$HBP)/(b$AB+b$BB+b$HBP+b$SF)), 3) #On base %

b$SLG <- round(((b$H+b$'2B'+2*b$'3B'+3*b$HR)/b$AB), 3) #Slugging

b$ISO <- round((((b$'2B') + (2*b$'3B') + (3*b$'HR')) / b$AB), 3) #Isolated Power

b$OPS <- round(((b$H + b$BB + b$HBP) / (b$AB + b$BB + b$HBP + b$SF)) + (((b$H-b$'2B'-b$'3B'-b$HR)
        + (2 * b$'2B') + (3 * b$'3B') + (4 * b$HR))/b$AB), 3) #On Base Plus Slugging

b$BABIP <- round(((b$H-b$HR)/(b$PA-b$SO-b$BB-b$HR)), 3) #Batting Average on Balls in Play

b$ContactRate <- round(((b$AB-b$SO)/b$AB), 3) #Batter contact rate

b$Kpct <- round((b$SO/b$PA), 3) #Strikeout Rate

b$BBpct <- round((b$BB/b$PA), 3) #Base on Balls rate

Pretty slick eh? If you want wOBA and FIP, there's an easy way to do that too!

3. Write the Data Back to the Lahman Database

This is the most important piece of the puzzle. Again, if you like doing things the old fashioned way, you can simply use the write.csv command but the database connection we established earlier allow us to write directly to the database.

dbWriteTable(con, name='testbatting', value=b)
dbWriteTable(con, name='testpitching', value=p)
dbWriteTable(con, name='testfielding', value=f)

IMPORTANT: Note that in the code above I'm creating new tables instead of altering the existing ones. The reason for this is simple safety and caution. I name the tables “testbatting” and then go into the database to ensure that they were created properly. Once I feel everything is safe, I can drop the “batting” table and rename my test table.

That's it folks. Like I said, not really groundbreaking stuff but I get excited when I figure out any way I can save some time and this method has saved me a ton!

Photo by badspeller


Viewing latest article 3
Browse Latest Browse All 21

Trending Articles