×

Good News Everybody!

The new search engine is LIVE!

Please report any problems to david (at) midrange.com.




    Hi David,

        As you might expect from the response so far, it is nearly
impossible, in the general case, to determine text/binary format with 100%
reliability.  Most attempts check for magic numbers ( like 0xCAFEBABE for
Java classes ) and values greater than 127 to detect non-ASCII values.

    The problem is that with Unicode variations and other encodings, along
with user ( programmer, that is ) generated binary files with no formal
headers, the previous techniques have either a lot of misses or invalid
matches.  The AS/400 has our old friend CCSID 65535 to denote binary files,
unlike other systems, but it's not used consistently, and has ended up being
a barrier rather than an aid.

    This link:

http://www.gsp.com/cgi-bin/man.cgi?section=1&topic=file

gives an idea of what's involved for the "file" command on Unix type
systems.

    There's a Windows, program called TrID - File Identifier that uses an
XML-based database and allows you to add your own entries that may help.
It's at:

http://mark0.net/soft-trid-e.html

    Also, Mozilla has a module that attempts to determine binary/text.  I'd
guess it's probably used in Firefox as well.  The source is available for
these products in C/C++.  You could possibly use or convert that.  This
page:

http://gemal.dk/blog/2003/12/18/autodetect_correct_mime_type_from_textplain_content/

has links to a full discussion.  It uses the term "Firebird", which I think
mostly evolved to Firefox.  These can be downloaded at:

http://www.mozilla.org/download.html

    All of these are probably still going to miss or mis-identify some
files.  Depending on your volume, it might make sense to have humans
validate CCSID's and use those, or possibly you could generate a database of
files and directories fro inclusion/exclusion.  HTH,


                                                         Joe Sam

Joe Sam Shirah -        http://www.conceptgo.com
conceptGO       -        Consulting/Development/Outsourcing
Java Filter Forum:       http://www.ibm.com/developerworks/java/
Just the JDBC FAQs: http://www.jguru.com/faq/JDBC
Going International?    http://www.jguru.com/faq/I18N
Que Java400?            http://www.jguru.com/faq/Java400


----- Original Message ----- 
From: "David Gibbs" <david@xxxxxxxxxxxx>
To: "Java Programming on and around the iSeries / AS400"
<java400-l@xxxxxxxxxxxx>
Sent: Tuesday, February 14, 2006 11:45 AM
Subject: Determining if a file is text or binary?


> Folks:
>
> Does anyone know of a technique to determine if a file (in the IFS) is
> text or binary (in java)?
>
> I need to copy various files from the IFS to another server ... these
> files can be in any number of CCSID's.
>
> If the file is text, I want to write it to the other server in the
> target servers native text format ... if the file is binary, I don't
> want to do any translation.
>
> I can use the toolkit's CharConverter routines to convert the text from
> the files CCSID to the native text format ... but if it's binary, I
> don't want to use that routine.
>
> Any suggestions?
>
> Thanks!
>
> david
> -- 


As an Amazon Associate we earn from qualifying purchases.

This thread ...

Follow-Ups:
Replies:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2026 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.