首先,不同编码的文本,是根据文本的前两个字节来定义其编码格式的。定义如下:
ANSI: 无格式定义;
Unicode: 前两个字节为FFFE;
Unicode big endian: 前两字节为FEFF;
UTF-8: 前两字节为EFBB;
知道了各种编码格式的区别,写代码就容易了.
- publicstaticString get_charset( File file ) {
- String charset ="GBK";
- byte[] first3Bytes =newbyte[3];
- try{
- boolean;
- BufferedInputStream bis =newBufferedInputStream(newFileInputStream( file ) );
- bis.mark(0);
- intread = bis.read( first3Bytes,0,3);
- if( read == -1)returncharset;
- if( first3Bytes[0] == (byte)0xFF&& first3Bytes[1] == (byte)0xFE) {
- charset ="UTF-16LE";
- checked =true;
- }
- elseif( first3Bytes[0] == (byte)0xFE&& first3Bytes[1] == (byte)0xFF) {
- charset ="UTF-16BE";
- checked =true;
- }
- elseif( first3Bytes[0] == (byte)0xEF&& first3Bytes[1] == (byte)0xBB&& first3Bytes[2] == (byte)0xBF) {
- charset ="UTF-8";
- checked =true;
- }
- bis.reset();
- if( !checked ) {
- // int len = 0;
- intloc =0;
- while( (read = bis.read()) != -1) {
- loc++;
- if( read >=0xF0)break;
- if(0x80<= read && read <=0xBF)// 单独出现BF以下的,也算是GBK
- break;
- if(0xC0<= read && read <=0xDF) {
- read = bis.read();
- if(0x80<= read && read <=0xBF)// 双字节 (0xC0 - 0xDF) (0x80
- // - 0xBF),也可能在GB编码内
- continue;
- elsebreak;
- }
- elseif(0xE0<= read && read <=0xEF) {// 也有可能出错,但是几率较小
- read = bis.read();
- if(0x80<= read && read <=0xBF) {
- read = bis.read();
- if(0x80<= read && read <=0xBF) {
- charset ="UTF-8";
- break;
- }
- elsebreak;
- }
- elsebreak;
- }
- }
- //System.out.println( loc + " " + Integer.toHexString( read ) );
- }
- bis.close();
- }catch( Exception e ) {
- e.printStackTrace();
- }
- returncharset;
- }From: http://ajava.org/code/I18N/14816.html
原文链接: https://www.cnblogs.com/xyzlmn/archive/2010/01/02/3168318.html
欢迎关注
微信关注下方公众号,第一时间获取干货硬货;公众号内回复【pdf】免费获取数百本计算机经典书籍
原创文章受到原创版权保护。转载请注明出处:https://www.ccppcoding.com/archives/6258
非原创文章文中已经注明原地址,如有侵权,联系删除
关注公众号【高性能架构探索】,第一时间获取最新文章
转载文章受原作者版权保护。转载请注明原作者出处!