Comments on: Copy and Checksum Large Files
REBOL Technologies

Comments on: Copy and Checksum Large Files

Carl Sassenrath, CTO
REBOL Technologies
24-Jun-2006 4:07 GMT

Article #0281
Main page || Index || Prior Article [0280] || Next Article [0282] || 11 Comments || Send feedback

Last week I wrote a short example of how to use checksum ports, and last year I gave an example of how to use the /seek refinement to deal with large files. The code below combines these two concepts in a function that copies a file, even if the file is larger than memory (e.g. MPG, MP3, WAV). It will also compute and return the checksum of the file's data.

This is a robust "commercial quality" file copy function that you can use in your applications. If you find a bug, please let me know and I will correct it here.

REBOL [
    Title: "Copy File with Optional Checksum"
    Author: "Carl Sassenrath"
    License: 'MIT
]

copy-file: func [
    "Copy a file. Return WORD for failure or return optional checksum."
    from [file!]
    dest [file!]
    /sum "checksum the data"
    /local
    data
    path
    ff ; from file port
    tf ; to file port
][
    path: split-path dest

    foreach [block err-word] [
        [make-dir/deep path/1] dir-failed
        [ff: open/binary/read/seek from] read-failed
        [tf: open/binary/write dest] write-failed
        [if sum [sum: open [scheme: 'checksum]]] sum-failed
        [
            while [not tail? ff] [
                print index? ff
                data: copy/part ff 100000
                insert tail tf data
                if sum [insert sum data]
                ff: skip ff length? data
            ]
            ;print index? ff
        ] copy-failed
    ][
        if error? try block [
            if port? sum [close sum]
            if tf [close tf]
            if ff [close ff]
            return err-word
        ]
    ]

    data: none
    if sum [
        update sum
        data: copy sum
        close sum
    ]
    close tf
    close ff
    data ; checksum value or none
]

print copy-file/sum %movie.mpg %movie2.mpg
ask "done"

Notes:

  1. The code has only been tested on REBOL 2.6.2. The code requires a newer REBOL that supports the /seek refinement (Core 2.6).

  2. If you are new to REBOL, note the way the foreach is used to perform error checking for each step and return the appropriate error word for failures.

  3. The make-dir line is correct as written. If you do a source on make-dir you will see that it becomes a no-op if the dir exists. Adding an additional exists? check is not needed.

  4. The "from file" (ff) is opened with /read access. This is done to cause an error if the file cannot be opened. Without it, the file will open as an empty file, even if it does not exist.

  5. The checksum port defaults to the SHA1 (secure hash) algorithm.

  6. The code remembers to close the ports if an error occurs.

  7. File data are copied in chunks of 100000. This number is arbitrary, and you can set it to whatever buffer size you prefer. Smaller numbers may slow the transfer. Larger numbers will require more memory.

  8. Uncomment the print lines if you want to see it working. You could also modify those lines to show a progress bar.

11 Comments

Comments:

Dan Lee
24-Jun-2006 12:04
The checksum port and /seek refinements could have a lot of useful applications -- a BitTorrent client comes to mind.

Are there other useful port schemes that haven't been discovered yet? What about a 'compress scheme?

Peter Wood
24-Jun-2006 12:04
:) Will you be adding copy-file to the script library?
Brian Hawley
25-Jun-2006 0:54
It's refreshing to see REBOL code that does error handling well. Nice!
Volker
26-Jun-2006 8:29
But that way of error-handling could use another blog :)
Christophe
26-Jun-2006 8:29
:) Great piece of REBOl programming again :) I like the innovative way 'foreach is used...

I /really/ did need this function for copying large files (GB+) through a bad network.

Although, I found a little problem with the function for a 2GB+ file I wanted to copy locally.

The problem is situated in the fact that the size of the file is expressed in *negative* value (why ?), which cause the tail? function to response positivily, and not entering the copy loop.

Illustration (add some debug lines into the func code):

copy-file: func [ "Copy a file. Return WORD for failure or return optional checksum." from [file!] dest [file!] /sum "checksum the data" /local data path ff ; from file port tf ; to file port ][ path: split-path dest

foreach [block err-word] [ [?? 1 make-dir/deep path/1] dir-failed [?? 2 ff: open/binary/read/seek from] read-failed [?? 3 tf: open/binary/write dest] write-failed [?? 4 if sum [sum: open [scheme: 'checksum]]] sum-failed [?? 5 ? ff print length? ff print tail? ff while [not tail? ff] [ ?? 6 print index? ff data: copy/part ff 100000 insert tail tf data if sum [insert sum data] ff: skip ff length? data ] ;print index? ff ] copy-failed ][ if error? try block [ if port? sum [close sum] if tf [close tf] if ff [close ff] return err-word ] ]

data: none if sum [ update sum data: copy sum close sum ] close tf close ff data ; checksum value or none ]

print copy-file/sum %/C/catdb.txt %/C/temp/catdb.txt

OUTPUT:

1 2 3 4 5 FF is a port of value: make port! [ scheme: 'file host: none port-id: none user: none pass: none target: %catdb.txt path: %/C/ proxy: none access: none allow: none buffer-size: none limit: none handler: none status: 'file size: -2025230419 date: 9-Jun-2006/9:12:19+2:00 url: none sub-port: none locals: none state: make object! [ flags: -2147483103 misc: [3948 -2147483615 "/C/catdb.txt"] tail: -2025230419 num: 0 with: "^M^/" custom: none index: 1 func: 1 fpos: 0 inBuffer: none outBuffer: none ] timeout: none local-ip: none local-service: none remote-service: none last-remote-service: none direction: none key: none strength: none algorithm: none block-chaining: none init-vector: none padding: none async-modes: none remote-ip: none local-port: none remote-port: none backlog: n

Brian Hawley
27-Jun-2006 17:42
This code is almost commercial quality - all that is missing is setting file attributes and handling multiple file forks on filesystems like HPFS and NTFS.
MARTIN
10-Nov-2009 20:45:50
hi I am new to rebol. I don't understand the foreach loop used in the sample below. can someone more knowledgeable explain to me how to read this. It looks quite complicated.

foreach [block err-word] [ [make-dir/deep path/1] dir-failed [ff: open/binary/read/seek from] read-failed [tf: open/binary/write dest] write-failed [if sum [sum: open [scheme: 'checksum]]] sum-failed [ while [not tail? ff] [ print index? ff data: copy/part ff 100000 insert tail tf data if sum [insert sum data] ff: skip ff length? data ] ;print index? ff ] copy-failed ][ if error? try block [ if port? sum [close sum] if tf [close tf] if ff [close ff] return err-word ] ]

Henrik
11-Nov-2009 0:17:09
Hi, Martin

The foreach loop might look complicated, because it's traversing two elements at a time. There are 3 blocks used as arguments to the foreach. The first block after foreach, the [block err-word] indicates the words that need to be set on traversal.

When foreach encounters a number of words in a block like that, it sets them to the corresponding number of elements in the data block and then jumps that number of elements, before setting new ones.

Thus, something like:

foreach [a b] [1 2 3 4 5 6] [print ["Data:" a b]]

results in:

Data: 1 2
Data: 3 4
Data: 5 6

The second block is the one foreach loops through. If you read it carefully, it consists of a block of data ([make-dir/deep path/1]), then a word (for example dir-failed), then another block of data, and then another word, etc.

In this block, the blocks inside it are actions to perform, and the words inside it are failures to return, but as we know with REBOL, the block is really only harmless data. It only becomes useful "live" code in the action body below.

The final third block is the action body, which is where the values block and err-word gathered through one iteration of the loop is applied:

The try block tries to evaluate the block word that was set in the loop at first to [make-dir/deep path/1].

If there is an error, try would return that error in a way that can be handled by the error? function. If there is no error from evaluation, the final value that the block generates is returned, just like normal code.

So if there is an error (error? returns true), all ports are closed and finally return err-word, which returns the word set for err-word, which in this case would be dir-failed.

The structure is like this, because this allows Carl to set up a sequential list of actions to perform, and associate them with an error word, if the action fails. Also he needs to leave the block and the copy-file function, if the action fails, which is done with return err-word.

To understand the functions better, try to use the ? function on the functions mentioned above. Hope this helps a bit.

Graham
15-Apr-2010 15:31:09
Please note that for Rebol2, this only works for files less than 2Gb in size as it uses 32 bit integers.
JackO
11-Apr-2012 10:11:09
Hi, just saw Graham's comment on 15-Apr-2010 re. the 2Gb size limit. Would this NOT apply if the code was used under r3 ?
buddyontobuddy
4-Jul-2012 21:30:54
This post has helped me for an article which I am writing. Thank you for giving me another point of view on this topic. Now I can easily complete my article. Cheers my article

Post a Comment:

You can post a comment here. Keep it on-topic.

Name:

Blog id:

CS-0281


Comment:


 Note: HTML tags allowed for: b i u li ol ul font p br pre tt blockquote
 
 

This is a technical blog related to the above topic. We reserve the right to remove comments that are off-topic, irrelevant links, advertisements, spams, personal attacks, politics, religion, etc.

Updated 31-May-2023   -   Copyright Carl Sassenrath   -   WWW.REBOL.COM   -   Edit   -   Blogger Source Code