判断文件的编码

首先，不同编码的文本，是根据文本的前两个字节来定义其编码格式的。定义如下：

ANSI：无格式定义；

Unicode：前两个字节为FFFE；

Unicode big endian：前两字节为FEFF；

UTF-8：前两字节为EFBB；

知道了各种编码格式的区别，写代码就容易了.

publicstaticString get_charset( File file ) {
String charset ="GBK";
byte[] first3Bytes =newbyte[3];
try{
boolean;
BufferedInputStream bis =newBufferedInputStream(newFileInputStream( file ) );
bis.mark(0);
intread = bis.read( first3Bytes,0,3);
if( read == -1)returncharset;
if( first3Bytes[0] == (byte)0xFF&& first3Bytes[1] == (byte)0xFE) {
charset ="UTF-16LE";
checked =true;
}
elseif( first3Bytes[0] == (byte)0xFE&& first3Bytes[1] == (byte)0xFF) {
charset ="UTF-16BE";
checked =true;
}
elseif( first3Bytes[0] == (byte)0xEF&& first3Bytes[1] == (byte)0xBB&& first3Bytes[2] == (byte)0xBF) {
charset ="UTF-8";
checked =true;
}
bis.reset();
if( !checked ) {
// int len = 0;
intloc =0;
while( (read = bis.read()) != -1) {
loc++;
if( read >=0xF0)break;
if(0x80<= read && read <=0xBF)// 单独出现BF以下的，也算是GBK
break;
if(0xC0<= read && read <=0xDF) {
read = bis.read();
if(0x80<= read && read <=0xBF)// 双字节 (0xC0 - 0xDF) (0x80
// - 0xBF),也可能在GB编码内
continue;
elsebreak;
}
elseif(0xE0<= read && read <=0xEF) {// 也有可能出错，但是几率较小
read = bis.read();
if(0x80<= read && read <=0xBF) {
read = bis.read();
if(0x80<= read && read <=0xBF) {
charset ="UTF-8";
break;
}
elsebreak;
}
elsebreak;
}
}
//System.out.println( loc + " " + Integer.toHexString( read ) );
}
bis.close();
}catch( Exception e ) {
e.printStackTrace();
}
returncharset;
}From: http://ajava.org/code/I18N/14816.html

原文链接: https://www.cnblogs.com/xyzlmn/archive/2010/01/02/3168318.html

欢迎关注

微信关注下方公众号，第一时间获取干货硬货；公众号内回复【pdf】免费获取数百本计算机经典书籍

原创文章受到原创版权保护。转载请注明出处：https://www.ccppcoding.com/archives/6258

非原创文章文中已经注明原地址，如有侵权，联系删除

关注公众号【高性能架构探索】，第一时间获取最新文章

转载文章受原作者版权保护。转载请注明原作者出处！