epubjs

<?xml version='1.0' encoding='utf-8'?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Pro Git - professional version control</title> <meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/> <link href="stylesheet.css" type="text/css" rel="stylesheet"/> <style type="text/css"> @page { margin-bottom: 5.000000pt; margin-top: 5.000000pt; }</style> </head> <body class="calibre"> <h2 class="calibre4" id="calibre_pb_72">Git Objects</h2> Git is a content-addressable filesystem. Great. What does that mean? It means that at the core of Git is a simple key-value data store. You can insert any kind of content into it, and it will give you back a key that you can use to retrieve the content again at any time. To demonstrate, you can use the plumbing command <code class="calibre10">hash-object</code>, which takes some data, stores it in your <code class="calibre10">.git</code> directory, and gives you back the key the data is stored as. First, you initialize a new Git repository and verify that there is nothing in the <code class="calibre10">objects</code> directory: <pre class="calibre9"><code class="calibre10">$ mkdir test $ cd test $ git init Initialized empty Git repository in /tmp/test/.git/ $ find .git/objects .git/objects .git/objects/info .git/objects/pack $ find .git/objects -type f $ </code></pre> Git has initialized the <code class="calibre10">objects</code> directory and created <code class="calibre10">pack</code> and <code class="calibre10">info</code> subdirectories in it, but there are no regular files. Now, store some text in your Git database: <pre class="calibre9"><code class="calibre10">$ echo 'test content' | git hash-object -w --stdin d670460b4b4aece5915caf5c68d12f560a9fe3e4 </code></pre> The <code class="calibre10">-w</code> tells <code class="calibre10">hash-object</code> to store the object; otherwise, the command simply tells you what the key would be. <code class="calibre10">--stdin</code> tells the command to read the content from stdin; if you don't specify this, <code class="calibre10">hash-object</code> expects the path to a file. The output from the command is a 40-character checksum hash. This is the SHA-1 hash - a checksum of the content you're storing plus a header, which you'll learn about in a bit. Now you can see how Git has stored your data: <pre class="calibre9"><code class="calibre10">$ find .git/objects -type f .git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4 </code></pre> You can see a file in the <code class="calibre10">objects</code> directory. This is how Git stores the content initially - as a single file per piece of content, named with the SHA-1 checksum of the content and its header. The subdirectory is named with the first 2 characters of the SHA, and the filename is the remaining 38 characters. You can pull the content back out of Git with the <code class="calibre10">cat-file</code> command. This command is sort of a Swiss army knife for inspecting Git objects. Passing <code class="calibre10">-p</code> to it instructs the <code class="calibre10">cat-file</code> command to figure out the type of content and display it nicely for you: <pre class="calibre9"><code class="calibre10">$ git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4 test content </code></pre> Now, you can add content to Git and pull it back out again. You can also do this with content in files. For example, you can do some simple version control on a file. First, create a new file and save its contents in your database: <pre class="calibre9"><code class="calibre10">$ echo 'version 1' > test.txt $ git hash-object -w test.txt 83baae61804e65cc73a7201a7252750c76066a30 </code></pre> Then, write some new content to the file, and save it again: <pre class="calibre9"><code class="calibre10">$ echo 'version 2' > test.txt $ git hash-object -w test.txt 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a </code></pre> Your database contains the two new versions of the file as well as the first content you stored there: <pre class="calibre9"><code class="calibre10">$ find .git/objects -type f .git/objects/1f/7a7a472abf3dd9643fd615f6da379c4acb3e3a .git/objects/83/baae61804e65cc73a7201a7252750c76066a30 .git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4 </code></pre> Now you can revert the file back to the first version <pre class="calibre9"><code class="calibre10">$ git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > test.txt $ cat test.txt version 1 </code></pre> or the second version: <pre class="calibre9"><code class="calibre10">$ git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a > test.txt $ cat test.txt version 2 </code></pre> But remembering the SHA-1 key for each version of your file isn't practical; plus, you aren't storing the filename in your system - just the content. This object type is called a blob. You can have Git tell you the object type of any object in Git, given its SHA-1 key, with <code class="calibre10">cat-file -t</code>: <pre class="calibre9"><code class="calibre10">$ git cat-file -t 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a blob </code></pre> <h3 class="calibre5">Tree Objects</h3> The next type you'll look at is the tree object, which solves the problem of storing the filename and also allows you to store a group of files together. Git stores content in a manner similar to a UNIX filesystem, but a bit simplified. All the content is stored as tree and blob objects, with trees corresponding to UNIX directory entries and blobs corresponding more or less to inodes or file contents. A single tree object contains one or more tree entries, each of which contains a SHA-1 pointer to a blob or subtree with its associated mode, type, and filename. For example, the most recent tree in the simplegit project may look something like this: <pre class="calibre9"><code class="calibre10">$ git cat-file -p master^{tree} 100644 blob a906cb2a4a904a152e80877d4088654daad0c859 README 100644 blob 8f94139338f9404f26296befa88755fc2598c289 Rakefile 040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0 lib </code></pre> The <code class="calibre10">master^{tree}</code> syntax specifies the tree object that is pointed to by the last commit on your <code class="calibre10">master</code> branch. Notice that the <code class="calibre10">lib</code> subdirectory isn't a blob but a pointer to another tree: <pre class="calibre9"><code class="calibre10">$ git cat-file -p 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0 100644 blob 47c6340d6459e05787f644c2447d2595f5d3a54b simplegit.rb </code></pre> Conceptually, the data that Git is storing is something like Figure 9-1. <img src="18333fig0901-tn.png" alt="Figure 9-1. Simple version of the Git data model." title="Figure 9-1. Simple version of the Git data model." class="calibre6"/> You can create your own tree. Git normally creates a tree by taking the state of your staging area or index and writing a tree object from it. So, to create a tree object, you first have to set up an index by staging some files. To create an index with a single entry - the first version of your text.txt file - you can use the plumbing command <code class="calibre10">update-index</code>. You use this command to artificially add the earlier version of the test.txt file to a new staging area. You must pass it the <code class="calibre10">--add</code> option because the file doesn't yet exist in your staging area (you don't even have a staging area set up yet) and <code class="calibre10">--cacheinfo</code> because the file you're adding isn't in your directory but is in your database. Then, you specify the mode, SHA-1, and filename: <pre class="calibre9"><code class="calibre10">$ git update-index --add --cacheinfo 100644 \ 83baae61804e65cc73a7201a7252750c76066a30 test.txt </code></pre> In this case, you're specifying a mode of <code class="calibre10">100644</code>, which means it's a normal file. Other options are <code class="calibre10">100755</code>, which means it's an executable file; and <code class="calibre10">120000</code>, which specifies a symbolic link. The mode is taken from normal UNIX modes but is much less flexible - these three modes are the only ones that are valid for files (blobs) in Git (although other modes are used for directories and submodules). Now, you can use the <code class="calibre10">write-tree</code> command to write the staging area out to a tree object. No <code class="calibre10">-w</code> option is needed - calling <code class="calibre10">write-tree</code> automatically creates a tree object from the state of the index if that tree doesn't yet exist: <pre class="calibre9"><code class="calibre10">$ git write-tree d8329fc1cc938780ffdd9f94e0d364e0ea74f579 $ git cat-file -p d8329fc1cc938780ffdd9f94e0d364e0ea74f579 100644 blob 83baae61804e65cc73a7201a7252750c76066a30 test.txt </code></pre> You can also verify that this is a tree object: <pre class="calibre9"><code class="calibre10">$ git cat-file -t d8329fc1cc938780ffdd9f94e0d364e0ea74f579 tree </code></pre> You'll now create a new tree with the second version of test.txt and a new file as well: <pre class="calibre9"><code class="calibre10">$ echo 'new file' > new.txt $ git update-index test.txt $ git update-index --add new.txt </code></pre> Your staging area now has the new version of test.txt as well as the new file new.txt. Write out that tree (recording the state of the staging area or index to a tree object) and see what it looks like: <pre class="calibre9"><code class="calibre10">$ git write-tree 0155eb4229851634a0f03eb265b69f5a2d56f341 $ git cat-file -p 0155eb4229851634a0f03eb265b69f5a2d56f341 100644 blob fa49b077972391ad58037050f2a75f74e3671e92 new.txt 100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a test.txt </code></pre> Notice that this tree has both file entries and also that the test.txt SHA is the "version 2" SHA from earlier (<code class="calibre10">1f7a7a</code>). Just for fun, you'll add the first tree as a subdirectory into this one. You can read trees into your staging area by calling <code class="calibre10">read-tree</code>. In this case, you can read an existing tree into your staging area as a subtree by using the <code class="calibre10">--prefix</code> option to <code class="calibre10">read-tree</code>: <pre class="calibre9"><code class="calibre10">$ git read-tree --prefix=bak d8329fc1cc938780ffdd9f94e0d364e0ea74f579 $ git write-tree 3c4e9cd789d88d8d89c1073707c3585e41b0e614 $ git cat-file -p 3c4e9cd789d88d8d89c1073707c3585e41b0e614 040000 tree d8329fc1cc938780ffdd9f94e0d364e0ea74f579 bak 100644 blob fa49b077972391ad58037050f2a75f74e3671e92 new.txt 100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a test.txt </code></pre> If you created a working directory from the new tree you just wrote, you would get the two files in the top level of the working directory and a subdirectory named <code class="calibre10">bak</code> that contained the first version of the test.txt file. You can think of the data that Git contains for these structures as being like Figure 9-2. <img src="18333fig0902-tn.png" alt="Figure 9-2. The content structure of your current Git data." title="Figure 9-2. The content structure of your current Git data." class="calibre6"/> <h3 class="calibre5">Commit Objects</h3> You have three trees that specify the different snapshots of your project that you want to track, but the earlier problem remains: you must remember all three SHA-1 values in order to recall the snapshots. You also don't have any information about who saved the snapshots, when they were saved, or why they were saved. This is the basic information that the commit object stores for you. To create a commit object, you call <code class="calibre10">commit-tree</code> and specify a single tree SHA-1 and which commit objects, if any, directly preceded it. Start with the first tree you wrote: <pre class="calibre9"><code class="calibre10">$ echo 'first commit' | git commit-tree d8329f fdf4fc3344e67ab068f836878b6c4951e3b15f3d </code></pre> Now you can look at your new commit object with <code class="calibre10">cat-file</code>: <pre class="calibre9"><code class="calibre10">$ git cat-file -p fdf4fc3 tree d8329fc1cc938780ffdd9f94e0d364e0ea74f579 author Scott Chacon <schacon@gmail.com> 1243040974 -0700 committer Scott Chacon <schacon@gmail.com> 1243040974 -0700 first commit </code></pre> The format for a commit object is simple: it specifies the top-level tree for the snapshot of the project at that point; the author/committer information pulled from your <code class="calibre10">user.name</code> and <code class="calibre10">user.email</code> configuration settings, with the current timestamp; a blank line, and then the commit message. Next, you'll write the other two commit objects, each referencing the commit that came directly before it: <pre class="calibre9"><code class="calibre10">$ echo 'second commit' | git commit-tree 0155eb -p fdf4fc3 cac0cab538b970a37ea1e769cbbde608743bc96d $ echo 'third commit' | git commit-tree 3c4e9c -p cac0cab 1a410efbd13591db07496601ebc7a059dd55cfe9 </code></pre> Each of the three commit objects points to one of the three snapshot trees you created. Oddly enough, you have a real Git history now that you can view with the <code class="calibre10">git log</code> command, if you run it on the last commit SHA-1: <pre class="calibre9"><code class="calibre10">$ git log --stat 1a410e commit 1a410efbd13591db07496601ebc7a059dd55cfe9 Author: Scott Chacon <schacon@gmail.com> Date: Fri May 22 18:15:24 2009 -0700 third commit bak/test.txt | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) commit cac0cab538b970a37ea1e769cbbde608743bc96d Author: Scott Chacon <schacon@gmail.com> Date: Fri May 22 18:14:29 2009 -0700 second commit new.txt | 1 + test.txt | 2 +- 2 files changed, 2 insertions(+), 1 deletions(-) commit fdf4fc3344e67ab068f836878b6c4951e3b15f3d Author: Scott Chacon <schacon@gmail.com> Date: Fri May 22 18:09:34 2009 -0700 first commit test.txt | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) </code></pre> Amazing. You've just done the low-level operations to build up a Git history without using any of the front ends. This is essentially what Git does when you run the <code class="calibre10">git add</code> and <code class="calibre10">git commit</code> commands - it stores blobs for the files that have changed, updates the index, writes out trees, and writes commit objects that reference the top-level trees and the commits that came immediately before them. These three main Git objects - the blob, the tree, and the commit - are initially stored as separate files in your <code class="calibre10">.git/objects</code> directory. Here are all the objects in the example directory now, commented with what they store: <pre class="calibre9"><code class="calibre10">$ find .git/objects -type f .git/objects/01/55eb4229851634a0f03eb265b69f5a2d56f341 # tree 2 .git/objects/1a/410efbd13591db07496601ebc7a059dd55cfe9 # commit 3 .git/objects/1f/7a7a472abf3dd9643fd615f6da379c4acb3e3a # test.txt v2 .git/objects/3c/4e9cd789d88d8d89c1073707c3585e41b0e614 # tree 3 .git/objects/83/baae61804e65cc73a7201a7252750c76066a30 # test.txt v1 .git/objects/ca/c0cab538b970a37ea1e769cbbde608743bc96d # commit 2 .git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4 # 'test content' .git/objects/d8/329fc1cc938780ffdd9f94e0d364e0ea74f579 # tree 1 .git/objects/fa/49b077972391ad58037050f2a75f74e3671e92 # new.txt .git/objects/fd/f4fc3344e67ab068f836878b6c4951e3b15f3d # commit 1 </code></pre> If you follow all the internal pointers, you get an object graph something like Figure 9-3. <img src="18333fig0903-tn.png" alt="Figure 9-3. All the objects in your Git directory." title="Figure 9-3. All the objects in your Git directory." class="calibre6"/> <h3 class="calibre5">Object Storage</h3> I mentioned earlier that a header is stored with the content. Let's take a minute to look at how Git stores its objects. You'll see how to store a blob object - in this case, the string "what is up, doc?" - interactively in the Ruby scripting language. You can start up interactive Ruby mode with the <code class="calibre10">irb</code> command: <pre class="calibre9"><code class="calibre10">$ irb >> content = "what is up, doc?" => "what is up, doc?" </code></pre> Git constructs a header that starts with the type of the object, in this case a blob. Then, it adds a space followed by the size of the content and finally a null byte: <pre class="calibre9"><code class="calibre10">>> header = "blob #{content.length}\0" => "blob 16\000" </code></pre> Git concatenates the header and the original content and then calculates the SHA-1 checksum of that new content. You can calculate the SHA-1 value of a string in Ruby by including the SHA1 digest library with the <code class="calibre10">require</code> command and then calling <code class="calibre10">Digest::SHA1.hexdigest()</code> with the string: <pre class="calibre9"><code class="calibre10">>> store = header + content => "blob 16\000what is up, doc?" >> require 'digest/sha1' => true >> sha1 = Digest::SHA1.hexdigest(store) => "bd9dbf5aae1a3862dd1526723246b20206e5fc37" </code></pre> Git compresses the new content with zlib, which you can do in Ruby with the zlib library. First, you need to require the library and then run <code class="calibre10">Zlib::Deflate.deflate()</code> on the content: <pre class="calibre9"><code class="calibre10">>> require 'zlib' => true >> zlib_content = Zlib::Deflate.deflate(store) => "x\234K\312\311OR04c(\317H,Q\310,V(-\320QH\311O\266\a\000_\034\a\235" </code></pre> Finally, you'll write your zlib-deflated content to an object on disk. You'll determine the path of the object you want to write out (the first two characters of the SHA-1 value being the subdirectory name, and the last 38 characters being the filename within that directory). In Ruby, you can use the <code class="calibre10">FileUtils.mkdir_p()</code> function to create the subdirectory if it doesn't exist. Then, open the file with <code class="calibre10">File.open()</code> and write out the previously zlib-compressed content to the file with a <code class="calibre10">write()</code> call on the resulting file handle: <pre class="calibre9"><code class="calibre10">>> path = '.git/objects/' + sha1[0,2] + '/' + sha1[2,38] => ".git/objects/bd/9dbf5aae1a3862dd1526723246b20206e5fc37" >> require 'fileutils' => true >> FileUtils.mkdir_p(File.dirname(path)) => ".git/objects/bd" >> File.open(path, 'w') { |f| f.write zlib_content } => 32 </code></pre> That's it - you've created a valid Git blob object. All Git objects are stored the same way, just with different types - instead of the string blob, the header will begin with commit or tree. Also, although the blob content can be nearly anything, the commit and tree content are very specifically formatted. </body> </html>