Let’s describe the problem visually. Say, I have a file with the following contents:
Then, we get an updated CSV file with new rows added somewhere:
File with new rows
We now need a solution that outputs the following based on previous two files:
Above rows are new rows that were added, not necessarily appended, to the new CSV file.
Solution using Pandas
I now present a succinct solution to this. I must also warn here that I am a complete beginner in Pandas, and there should be better ways to approach this. Here is the 8-liner that achieves this:
import sys import pandas as pd a = pd.read_csv(sys.argv) a.set_index(a.columns.values.tolist(), inplace=True) b = pd.read_csv(sys.argv) b.set_index(b.columns.values.tolist(), inplace=True) new_dataframe = a[~a.index.isin(b.index)] new_dataframe.reset_index().to_csv(sys.argv, index=False)
With this script in place, we can do:
python difference.py old_csv_file new_csv_file diffed_rows
diffed_rows will now have rows that were not in
Let’s dissect it line by line:
a = pd.read_csv(sys.argv) ... b = pd.read_csv(sys.argv)
This reads CSV files specified as command-line arguments to the script.
a.set_index(a.columns.values.tolist(), inplace=True) b.set_index(b.columns.values.tolist(), inplace=True)
The trick here is that, since we do not have an unique identifier for rows, we create a row index based on all the values of the row. The
inplace=True argument will set indices in the existing
b dataframes rather than creating another copy of them.
With indices in place, we can do index-based queries like:
new_dataframe = a[~a.index.isin(b.index)]
Self-explanatory, this line will create a new dataframe which is a set equivalent of
Finally, we convert this new dataframe’s index as a data by resettting its index. And, the resulting output is written to a CSV file:
A bit of background
I was maintaining a Django webapp that relied on the data source that periodically provided CSV dump of their database. The original dump had around 10000 rows and each new update would add around 500 new rows randomly somewhere in the file. To create a update logic, few concepts came to mind: using
get_or_create(which would have been horribly inefficient) or doing some kind of bulk UPSERTs. And, I was in no mood for severe optimizations. Rather than hitting database to check existing entries and only update new ones, I decided that doing a diff on old and new CSV data would be the most simplest solution for my scenario.