Dealing with Large Data Files and Bad git commits

AKA Git Squash

Photo by Florian Olivo on Unsplash

As I dive deeper into my job search, I’ve taken up a new data science project to widen my skill set and also better-relate my career goals. The main goal of this project is to work with satellite imagery and Concurrent Neural Networks. More specifically, I retrieved data from Kaggle and will be classifying cloud types in order for a better understanding of weather patterns.

I’ve downloaded a dataset that I’ve realized is larger than I am used to working with and have gotten absolutely stuck trying to push it to GitHub!

I have found the issue and thought others might find a written solution helpful on here.

What happened:

I had just ‘git commit’ the very first changes to my new repo after connecting to the Kaggle API and downloading my dataset which was over 5GB.

Git push…

My terminal hit me with the following:

remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.remote: error: See http://git.io/iEPt8g for more information.To https://github.com/oac0de/Understanding_Cloud_Formations! [remote rejected] main -> main (pre-receive hook declined)error: failed to push some refs to 'https://github.com/oac0de/Understanding_Cloud_Formations'

At this point, I remember that GitHub’s standard data transfer limit is 100 MB! I am way over that. Essentially Git has committed my changes into history but the GitHub website is restricting my usage for obvious reasons to save space on their servers.

What was I to do here? I thought I could just delete the files and re-commit the changes and then push, right? Nope. I still got the error message.

As I am continuously learning, when you ‘git commit’ something you are literally etching your changes in git stone, for all eternity. Not really. But you are actually saving all of your changes and commits into your git history. This is actually one of the reasons git is such a robust version control software. Thanks to trusty StackExchange, there is a workaround.

To put a bandaid on the git push issue, so that you can at least upload your code, remove your large files.

git rm --cached [your_large_file_name]

Next, you perform what GitHub calls “squashing” and essential squash your latest git commits into one. Here we will squash the last two commits because I had messed up a second time.

git reset --soft HEAD~2

Squashed! Now simply enter a message for the combined commit:

git commit -m "New message for the combined commit"
git push

Voila. Our code is pushed up to our remote repository, just without our large datafiles.

The more I experiment and research, the more I find having better knowledge of the inner workings of Git can really save a lot of time. But then again when an unknown type of problem arises, we can really thank the selfless professional coders who devote expertise to us noobs on StackExchange. Lifesavers!

I’m a recent Data Science graduate with a B.S. in Environmental Science. Currently seeking job opportunities. Constantly learning!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store