File too large checked in: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 177: Line 177:


  git fetch  # update origin/master
  git fetch  # update origin/master
git show origin/master  # write this sha hash id down as "shaHashIdOnto". We are rebasing onto it.


  git rebase origin/master  # this is the magic.
  git rebase origin/master  # this is the magic.
Line 205: Line 207:
GIT REBASE INTERACTIVE  
GIT REBASE INTERACTIVE  


  git rebase -i shaHashId   # this must be very carefully chosen.
Look at the history, all your commits are at the tip.
Find how far back they go. Then use the sha hash Id of
the parent of your commmits (probably the same as the value in origin/master).
The goal here is to focus on which commits contain the bad large files
that you do not want. We need to remove it from those commits so it never happened.
 
git log --stat shaHashIdOnto..HEAD  # confirm these are your commits.
 
Save the output for use below.
 
  git rebase -i shaHashIdOnto   # this must be very carefully chosen.
 
The rebase command is going to pop up a list of commits with the default action "pick".
It should contain the full list of all your unpushed commits on this branch.
 
pick f26dd66 Oops large file
pick ce36c98 Oops large file and other stuff to keep.
pick f772d66 Other good stuff to keep
 
If you have a line for a commit that is no longer needed,
for example, the only thing in that commit was the large file
that you are trying to get rid of, then simply delete that line.
Then the commit will simply be removed and disappear from the rebase result.
 
If the commit contains the large file but other stuff you want to keep,
change it from "pick" to "edit". The system will stop at that commit,
and let you edit it.
 
After changing the default commit list, we delete the first entry, change the 2nd to edit
 
edit ce36c98 Oops large file and other stuff to keep.
pick f772d66 Other good stuff to keep
 
As rebase stops at "Oops large file and other stuff to keep."
Remove the offending file from the index.
git rm --cached someLargeFile
 
Amends the commit, -C HEAD instructs git to reuse the old commit message.
git commit --amend -C HEAD
 
Finally, git rebase --continue goes ahead with the rest of the rebase operation.
git rebase --continue
 
If all else fails, can do
git rebase --abort





Revision as of 08:14, 25 April 2021

FILE TOO LARGE CHECKED IN and HOW TO FIX IT

When I do git push I see this error:

Exceeds file size limit 2200000. 

WHY BIG FILES ARE NOT ALLOWED

The kent repo has a limit (currently 2.2 MB) on file sizes being checked in. The restriction has been implemented as a hook in the central shared repo that developers push to. We already did not want large files to be checked-in, and during the transition from CVS to git, many huge test files were removed. Also, github has size restrictions which have to be honored. And people will find kent repo excessively bloated and hard to use without this size restriction. This is a repository of source code text, which is small.

WHY PEOPLE CHECK IN BIG FILES

Because developers are encouraged to make standard tests subdirectory for their kent utilities, there are testing files which get checked in, and unless care is exercised, it is very easy for programmers who deal with giant genomics files to accidentally check them in. Also, sometimes people want to check in PDF documents and some reasonably sized JPG or PNG images. Please use JPG when it is a camera image for better compression and smaller size. PNG is lossless compression, which is bigger, and good for diagrams non-photographic things with a small number of colors. And sometimes, people just make a mistake, or forget about the limit.

WHY DO I FIND OUT ABOUT IT SO LATE?

When you clone a repo, hooks are not cloned, so there is no easy way to give them to all users. There are some incomplete and limited ways to add a hook that would detect it during git commit. We are looking into ways to improve this so you could get earlier warning about a file being to large.

WHY IS IT SO HARD TO FIX

Since git is a powerful source code control system, you might hope that it would easily handle this situation. However, because git builds immutable trees, which are a good thing for so many purposes, removing something or changing it requires changing the git history of the branch. We must avoid pushing large files to the shared repo main branch. Once it goes there, hundreds of users all over the world will pick it up automatically, and there is no way to go around fixing up all of those copies to remove large files from their history.

However, git can indeed fix the history of a branch in your local git tree which has not been pushed. And that is what we are going to do here.

FIXING YOUR LOCAL BRANCH WITH LARGE FILE CHECKED IN

In order to fix your branch, you are going to have to use some form of git rebase on it, otherwise, it could never be fixed.

A common case is where a user realizes the mistaken large file, and uses git rm to remove it, or uses git add to replace it with a smaller version of the file, such as a test file or jpg image or pdf, and git commit. So the large file no longer exists on the tip of their branch. However, it does exist in the history.

As usual with all of this stuff, if you have unchecked in stuff, check it in or use stash to clean up your repo for action.

git add     # this is often a good choice.
git commit

or

git stash  # only if needed

SQUASH?

This will only work if you have already removed the large file, which many people may have already done. If not, you can do this:

 git rm someFileLarge

or edit the large file to reduce its size and re-add it

 git add someFileNowSmaller

follow up with the usual

 git commit

So there is no large file on your branch tip. The system is smart enough to skip large files that no longer exist when it does the squash.

People often squash your development branch anyways, which makes code-review easier since it is just one big commit.

IF YOUR LARGE FILE IS ON A DEV BRANCH

 git checkout master

As usual, may have to handle git conflicts during any merge

 git merge --squash myDevBranch

Rename the squashed branch so you know it was done

 git branch -mm myDevBranch myDevBranchSquashed

Eventually, you will need to delete myDevBranchSquashed to recover its space if you care.

IF OUR LARGE FILE IS ON MASTER BRANCH

turn your master branch into a dev branch, and then create a new master branch, and squash that onto it. Only do this if it makes sense.

 git fetch  # update origin/master
 git branch -m master tempMaster
 git branch master origin/master

Look at .git/config to fix master branch tracking if needed.

 git checkout master
 git merge --squash tempMaster
 git push

After a few days, you can delete tempMaster if you do not need it, this should also allow git garbage collection to clean that large file from your own local repo.

 git branch -D tempMaster
 

The benefit of SQUASH is that it is simple and you are done.

The disadvantage is that you lose your commit history, and all those changes just became one big commit on master branch. This is just right for many users.


GIT CHERRY-PICK?

NOT RECOMMENDED If you only have a handful of commits, and you know which ones they are, you can try to use this method. It is a tedious. You would have to use git log to find which specific commits need to be saved. You might have to turn master branch into a dev or temp branch as above, create new master, and then pick specific commits from the temp branch onto master. You may still need to do a git rebase -i if you cannot not make the large file go away simply by skipping a no longer needed commit or two.

GIT REBASE

Use the squash method (see above) if that works for you.

But otherwise, use git rebase.

Git rebase is our friend for crises like this. But it has to be used properly.

HAVE YOU MERGED FROM MASTER?

In particular, if you have merged from master already, before you noticed the large file error message later during pushing, you could easily have dozens of your own commits and hundreds of commits made by other people from pulling in from the master branch which has commits from the entire team, it might even be months since you last successfully pushed, but you already pulled several times.

So if you have done even one merge from master before you discovered the problem, which is pretty common to happen, then you should proceed with GIT REBASE TO TIP.

If you ABSOLUTELY certain that you have NOT git pulled even once on your problem branch, then skip this step and go ahead to the GIT REBASE INTERACTIVE section.

GIT REBASE TO TIP

git rebasing of your entire branch onto the tip of the master branch tree is super useful here because it will automatically gather all of the commits together and put them at the tip of the branch. This gets rid of the merge commits from master, and simplifies the history. Note that this is just the first step, and does not fix the large file issue itself.

The rebase-to-tip avoids a big problem that you would otherwise have with git rebase interactive, since there could be hundreds of commits made by others from those pulls from master you did earlier. Sadly, git rebase make you handle merge conflicts, but at least if all of yours are gathered together at the end, you are looking at 7 of your own commits altogether rather than 806 commits made by dozens of people working on code that you did not touch and know nothing about and are in no position to have to deal with merge conflicts in. So putting just your own commits altogether at the master tip totally avoids having to rebase and resolve conflicts through everybody elses work.

Because master is used so commonly, that is what appears here in our example, but it should be easy for developers to adapt this if needed to another branch.

Do this if you are not already on master or use a dev branch if that is in need of repair.

git checkout master   # or your dev branch
git fetch  # update origin/master
git show origin/master  # write this sha hash id down as "shaHashIdOnto". We are rebasing onto it.
git rebase origin/master   # this is the magic.

If you get conflicts, you must resolve them. Yes, it is a minor pain, and you think, hey, I already resolved some of these earlier, why do I have to do it again? But rebase is not smart enough to do that for you. We are only doing this because we had no other way to fix the large file issue. Just be glad you do not have to re-do conflicts for other users too. Sometimes you get lucky and the merges are simple.

vi conflicted-file    # resolve conflicts by editing carefully
git add conflicted-file
git rebase --continue

You can use this if something goes horribly wrong:

git rebase --abort

Sometimes it may get stuck on an empty one where nothing happened, or it was optimized out, just run this to skip to proceed.

git rebase --skip

Now all of your commits are together at the tip, and they have not been pushed to master yet of course.


GIT REBASE INTERACTIVE

Look at the history, all your commits are at the tip. Find how far back they go. Then use the sha hash Id of the parent of your commmits (probably the same as the value in origin/master). The goal here is to focus on which commits contain the bad large files that you do not want. We need to remove it from those commits so it never happened.

git log --stat shaHashIdOnto..HEAD  # confirm these are your commits.

Save the output for use below.

git rebase -i shaHashIdOnto    # this must be very carefully chosen.

The rebase command is going to pop up a list of commits with the default action "pick". It should contain the full list of all your unpushed commits on this branch.

pick f26dd66 Oops large file
pick ce36c98 Oops large file and other stuff to keep.
pick f772d66 Other good stuff to keep

If you have a line for a commit that is no longer needed, for example, the only thing in that commit was the large file that you are trying to get rid of, then simply delete that line. Then the commit will simply be removed and disappear from the rebase result.

If the commit contains the large file but other stuff you want to keep, change it from "pick" to "edit". The system will stop at that commit, and let you edit it.

After changing the default commit list, we delete the first entry, change the 2nd to edit

edit ce36c98 Oops large file and other stuff to keep.
pick f772d66 Other good stuff to keep

As rebase stops at "Oops large file and other stuff to keep." Remove the offending file from the index.

git rm --cached someLargeFile

Amends the commit, -C HEAD instructs git to reuse the old commit message.

git commit --amend -C HEAD

Finally, git rebase --continue goes ahead with the rest of the rebase operation.

git rebase --continue

If all else fails, can do

git rebase --abort


FOLLOWUP

Finally without a large file in the branch history, we can push to shared repo. This is the whole reason we did all that work, so that we could do this. (If you repaired a dev branch, you will probably do something else here.)

git push   # of course if others pushed since your last update, you may have to git pull.

If you earlier used git stash to put something aside, you can use it restore the unchecked in work:

git stash pop   # ONLY if you saved it aside with git stash earlier, and it makes sense.