Extract new rows in updated CSV file: Simple Tricks with Pandas


Extract new rows in updated CSV file: Simple Tricks with Pandas

2019/01/04

Tags: python pandas CSV

Problem

Let’s describe the problem visually. Say, I have a file with the following contents:

Year Code Price
1995 X1 20
1996 X2 30
1997 X4 20
2000 X4 69

Then, we get an updated CSV file with new rows added somewhere:

File with new rows

Year Code Price
1995 X1 20
1996 X2 30
1996 X3 40
1997 X4 20
2000 X4 69
2001 X1 25

We now need a solution that outputs the following based on previous two files:

Year Code Price
1996 X3 40
2001 X1 25

Above rows are new rows that were added, not necessarily appended, to the new CSV file.

Solution using Pandas

I now present a succinct solution to this. I must also warn here that I am a complete beginner in Pandas, and there should be better ways to approach this. Here is the 8-liner that achieves this:

import sys
import pandas as pd

a = pd.read_csv(sys.argv[1])
a.set_index(a.columns.values.tolist(), inplace=True)
b = pd.read_csv(sys.argv[2])
b.set_index(b.columns.values.tolist(), inplace=True)

new_dataframe = a[~a.index.isin(b.index)]
new_dataframe.reset_index().to_csv(sys.argv[3], index=False)

With this script in place, we can do:

python difference.py old_csv_file new_csv_file diffed_rows

The file diffed_rows will now have rows that were not in old_csv_file.

Discussion

Let’s dissect it line by line:

a = pd.read_csv(sys.argv[1])
...
b = pd.read_csv(sys.argv[2])

This reads CSV files specified as command-line arguments to the script.

a.set_index(a.columns.values.tolist(), inplace=True)
b.set_index(b.columns.values.tolist(), inplace=True)

The trick here is that, since we do not have an unique identifier for rows, we create a row index based on all the values of the row. The inplace=True argument will set indices in the existing a and b dataframes rather than creating another copy of them.

With indices in place, we can do index-based queries like:

new_dataframe = a[~a.index.isin(b.index)]

Self-explanatory, this line will create a new dataframe which is a set equivalent of A-B.

Finally, we convert this new dataframe’s index as a data by resettting its index. And, the resulting output is written to a CSV file:

new_dataframe.reset_index().to_csv(sys.argv[3], index=False)

A bit of background

I was maintaining a Django webapp that relied on the data source that periodically provided CSV dump of their database. The original dump had around 10000 rows and each new update would add around 500 new rows randomly somewhere in the file. To create a update logic, few concepts came to mind: using get_or_create(which would have been horribly inefficient) or doing some kind of bulk UPSERTs. And, I was in no mood for severe optimizations. Rather than hitting database to check existing entries and only update new ones, I decided that doing a diff on old and new CSV data would be the most simplest solution for my scenario.