As some of you might be aware, there's been a lot of banter on Twitter on BranchCache around its de-duplication (de-dupe) feature. So since dear old friend Johan Arwidmark asked me to type something up I thought I would do so, although I can’t promise to keep 10xTweets short. BranchCache is a complex beast, but I will do my best to keep in short. If you have questions later ping me on Twitter or any other way.
First of some trivia. Many might not know, the DeDup feature add-on to BranchCache was the very reason that pushed us to start company 2Pint Software. It's just that awesome!
Where should we start? Well, Johan has a set of very large .vhd files that he needs to transfer to his student computers when he arrives at a new destination and doing some of his awesome training. He is then faced with the issue of:
- X machines multiplied by Y data = buckets of data. And even on a Gigabit network this will take some time. 20 machines with 107GB of data = 2TB = 2140 GB = 2191360 MB. With a throughput of about 100MB on a 1GB network switch that leads to theoretical transfer speed to about 6 hours. 2191360/100= 21914 seconds which is about 6 hours. A tad bit slow! So hence why he wants to speed things up.
- Zipping and unzipping/de-duplicating the images takes time, wouldn’t it useful if that could be done during transfer somehow?
Enter BranchCache - Full De-Dupe Smartness!
Not a lot of people know (or understand) that not only does BranchCache do P2P stuff, it also does de-dup on the transfer and on its own Cache storage. This killer feature came with Windows 8 and while most people were bitching and moaning about the new start menu there were some fundamental changes to a lot of networking stuff. One of them being BranchCache.
So how do we think Johan should do it? We think we have a better solution for him, which is free and part of the OS that he runs, so he already owns this technology.
- Johan starts off by making sure his master VHD’s are stored on a Windows Server. He then makes sure he has the BranchCache feature installed on this server and that IIS is installed. He then creates a virtual directory that allows him to access the files, even over the internet. ACL’s of course.
- He then runs the BC Export commands in his favorite prompt, PowerShell of course. The trick is to use the HTTP export feature and not the SMB ones, more on that over in our FAQ section.
- He imports the exported BC cache data that he exported from the server, on to his laptop using PowerShell. He does this after he sets his client key to the same as the server, more on why later. The size of the data is about the same size as the de-duped files on the server, so about 16GB. If he doesn’t have a local cache that he cares about he can make a straight file copy from the servers BC store, but that is 1337 stuff! One step at the time.
- He flies over to Seattle with his laptop, avoiding to spill coffee on it.
- He plugs it in to the network and kicks off the build/download or whatever on the lab machines. What he does then is that he starts a BITS transfer of the images from the server at home! The server then sends the BranchCache hash of the files to each client requesting it. Then they go out on the network and asks for any machine having a hash that matches his files.
- Johan laptops pipes up and gives out the cached data. Each client then only transfers the cache data and uses the hashes to re-generate the full 107GB on each machine. But the nice thing here is that there is no need for a compression/decompression since that is done during transfer. Since BranchCache is also P2P it will spread the load of the 16GB hash data as this will also be transferred to each computer. So you get bandwidth-aware, de-duped P2P and the finest form of file transfer there is.
So the transfer should in theory take about as long as it does to transfer 16GB of data once over the network, as they will P2P the shit out of the rest. But that is in theory and I would think it would still take some time, but this should be hell-of-a-lot faster than what he is doing now.
So to reiterate, the 16GB of Cache data will be P2P spread to the other clients, i.e. very fast. The clients will then recreate the full 107GB from the hash values + the 16GB BC data. Even if Johan updates the images back in Sweden, most of the data will be recreated from Johans laptops Cache, as the de-dupe algorithm handles changed files very well. MS has a bunch of patents on this algorithm, and it's wickedly cool.
Pro: Fast, P2P, no fiddling!
Cons: Requires Win8 or later, a bit of IIS server infrastructure (Can be 100% local though but that is very 1337 and requires 2Pint Software Tools!)
Next time you build a class-room Johan, do it 2Pint Style! Deal?