• JUPYTER
  • FAQ
  • View as Code
  • Python 3 Kernel
  • View on GitHub
  • Execute on Binder
  • Download Notebook
  1. practical-python-data-viz-guide
  2. notebooks

Data Prep¶

For this vignette, we'll be using acquisition loan data from Fannie Mae's public datasets. There is sadly not yet a Python script to facilitate the reading in of all the data—which are stored as pipe-delimited .txt files, released quarterly on ~1 year lag—so I had to infer some of my data cleaning steps from the R code that Fannie made available to download and clean the data.

If you are not immediately familiar with these data (or mortgage finance-related data in general), I highly recommend referencing the 1st table—"Acquisition File Layout"—in this dictionnary file here, along with the corresponding glossary here, which provides an explanation of the meaning of the values for each variable.

For our purposes here, I will only be using acquisition data from the 1st and 4th quarter of 2017; thus, I will need to read in 2 .txt files. The code below is generalizable up to however many text files you want to use; you need only change the path_to_data variable to point at whatever path stores your .txt files.

Setup¶

In [1]:
# basic packages
import numpy as np
import pandas as pd
import datetime

# for data importing
import os
import csv

# for data cleaning
from janitor import clean_names, remove_empty
In [2]:
# store the datetime of the most recent running of this notebook as a form of a log
most_recent_run_datetime = datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
f"This notebook was last executed on {most_recent_run_datetime}"
Out[2]:
'This notebook was last executed on 2019-08-05 22:09'

Reading in the Data¶

In [3]:
# location of my .txt data files
path_to_data = '../data/'

# instantiate an empty list to store the file names to read in
filelist = []

# loop over each filename in the specified directory
for filename in os.listdir(path_to_data):
    # check if the file path corresponds to a file
    # ensure the file in question is a .txt file
    # ensure we haven't already added that file to the list
    if os.path.isfile(path_to_data + filename) \
        and filename.endswith(".txt") \
        and filename not in filelist:
            # append the file to our list of files
            filelist.append(path_to_data+filename)
In [4]:
# take a look at the files we'll be reading in
filelist
Out[4]:
['../data/Acquisition_2017Q4.txt', '../data/Acquisition_2017Q1.txt']
In [5]:
# instantiate an empty df that we'll use to store all our data
acq_df = pd.DataFrame()

# list of variable names for all the fields in the .txt files (adapted from the aforementioned docs)
acq_var_names = ['LOAN_ID', 'ORIG_CHN', 'Seller.Name', 'ORIG_RT', 'ORIG_AMT', 'ORIG_TRM', 'ORIG_DTE','FRST_DTE',
                 'OLTV', 'OCLTV', 'NUM_BO', 'DTI', 'CSCORE_B', 'FTHB_FLG', 'PURPOSE', 'PROP_TYP', 'NUM_UNIT', 
                 'OCC_STAT', 'STATE', 'ZIP_3', 'MI_PCT', 'Product.Type', 'CSCORE_C', 'MI_TYPE', 'RELOCATION_FLG']

# loop over the .txt files, read them in, and append them to make our master acquisitions df
for f in filelist:
    # specify that our delimiter is a pipe, ignore the header, and use pre-specified variable names
    temp_df = pd.read_csv(filepath_or_buffer = f, sep="|", header=None, names=acq_var_names)
    # ensure that concatenation is row-wise and ignore the index values as they don't convey meaning here
    acq_df = pd.concat(objs=[acq_df, temp_df], axis=0, ignore_index=True)
In [6]:
# taking a look at the data structure we have so far
acq_df.head()
Out[6]:
LOAN_ID ORIG_CHN Seller.Name ORIG_RT ORIG_AMT ORIG_TRM ORIG_DTE FRST_DTE OLTV OCLTV ... PROP_TYP NUM_UNIT OCC_STAT STATE ZIP_3 MI_PCT Product.Type CSCORE_C MI_TYPE RELOCATION_FLG
0 100004724719 R GUILD MORTGAGE COMPANY 4.375 228000 360 09/2017 11/2017 95 95 ... SF 1 P HI 967 30.0 FRM NaN 1.0 N
1 100005682269 R WELLS FARGO BANK, N.A. 3.375 75000 120 10/2017 12/2017 79 79 ... SF 1 P MS 394 NaN FRM 765.0 NaN N
2 100005754525 C WELLS FARGO BANK, N.A. 4.750 79000 360 11/2017 01/2018 80 80 ... SF 1 I MO 653 NaN FRM 799.0 NaN N
3 100006973104 C SUNTRUST BANK 4.250 295000 360 11/2017 01/2018 91 91 ... SF 1 P OH 452 30.0 FRM 789.0 1.0 N
4 100007420970 R FREEDOM MORTGAGE CORP. 4.500 178000 360 09/2017 11/2017 63 63 ... SF 1 P CA 924 NaN FRM NaN NaN N

5 rows × 25 columns

Cleaning the Data¶

For the most part, our data here are pretty clean. In the section below, we'll just make a few convenience changes, execute a couple checks, and create a few new variables.

In [7]:
# use pyjanitor package to take care of basic data cleaning
acq_df = (
    acq_df
    # clean the column names, remove any leading/trailing underscores
    .clean_names(strip_underscores=True)
    # remove any rows that are entirely NA
    .remove_empty()
)
In [8]:
# create a few new fields as recommended by the aformentioned docs

# find minimum credit score of borrower and co-borrower
acq_df['cscore_min'] = (
    acq_df[['cscore_b','cscore_c']].min(axis=1)
)

# find origination value = origination amount / origination loan-to-value ratio
acq_df['orig_val'] = (
    acq_df['orig_amt'] / (acq_df['oltv']/100)
)

# check if the ocltv is null; if it is, set it to the oltv
acq_df['ocltv'] = (
    np.where(acq_df['ocltv'].isnull(), acq_df['oltv'], acq_df['ocltv'])
)
In [9]:
# inspect our final cleaned data
acq_df.head()
Out[9]:
loan_id orig_chn seller_name orig_rt orig_amt orig_trm orig_dte frst_dte oltv ocltv ... occ_stat state zip_3 mi_pct product_type cscore_c mi_type relocation_flg cscore_min orig_val
0 100004724719 R GUILD MORTGAGE COMPANY 4.375 228000 360 09/2017 11/2017 95 95 ... P HI 967 30.0 FRM NaN 1.0 N 693.0 240000.000000
1 100005682269 R WELLS FARGO BANK, N.A. 3.375 75000 120 10/2017 12/2017 79 79 ... P MS 394 NaN FRM 765.0 NaN N 737.0 94936.708861
2 100005754525 C WELLS FARGO BANK, N.A. 4.750 79000 360 11/2017 01/2018 80 80 ... I MO 653 NaN FRM 799.0 NaN N 799.0 98750.000000
3 100006973104 C SUNTRUST BANK 4.250 295000 360 11/2017 01/2018 91 91 ... P OH 452 30.0 FRM 789.0 1.0 N 760.0 324175.824176
4 100007420970 R FREEDOM MORTGAGE CORP. 4.500 178000 360 09/2017 11/2017 63 63 ... P CA 924 NaN FRM NaN NaN N 675.0 282539.682540

5 rows × 27 columns

Final Data Trimming¶

For the sake of size / efficiency, I am going to create a dataset composed of only loans originated in two months: January 2017 and December 2017.

In [10]:
# filter original acq_df to just loans with origination dates in jan or dec 2017
jan_and_dec_17_acqs = acq_df.loc[
    (acq_df['orig_dte'] == '01/2017') | (acq_df['orig_dte'] == '12/2017')
]

# inspect the features of the resulting dataset
row_count, column_count = jan_and_dec_17_acqs.shape
f"The final dataset filtered to just Jan2017 and Dec2017 originations has {row_count} rows and {column_count} columns."
Out[10]:
'The final dataset filtered to just Jan2017 and Dec2017 originations has 158168 rows and 27 columns.'
In [11]:
# lastly, we'll save out this dataset for use elsewhere
jan_and_dec_17_acqs.to_csv(path_or_buf='../data/jan_and_dec_17_acqs.csv', index=False)

This website does not host notebooks, it only renders notebooks available on other websites.

Delivered by Fastly, Rendered by OVHcloud

nbviewer GitHub repository.

nbviewer version: 8b013f7

nbconvert version: 7.2.3

Rendered (Sat, 05 Jul 2025 01:43:50 UTC)