บทความอื่นๆ ในหมวดเดียวกัน



  Search Articles


    

Reading Excel spreadsheets using ADO.NET

Article By : 
 gobElmo
Reference : 
 http://blog.lab49.com/archives/196
Read/Comment : 
 1138 / 17

Microsoft ADO.NET provides a handy, if quirky way to access Excel spreadsheets from Windows applications. The idea is to treat spreadsheets like databases, with each worksheet represented as a "table". Worksheets are expected to be in a table-like format with column headings in the first row and rows of data beneath. For example, the following code reads worksheet "foo" from spreadsheet file C:BAR.XLS into a DataTable:

DataTable fooData = new DataTable ();
OleDbConnection dbConnection =
  new OleDbConnection
    (@"Provider=Microsoft.Jet.OLEDB.4.0;"
     + @"Data Source=C:BAR.XLS;"
     + @"Extended Properties=""Excel 8.0;HDR=Yes;""");
dbConnection.Open ();
try
{
    OleDbDataAdapter dbAdapter =
        new OleDbDataAdapter
            ("SELECT * FROM [foo$]", dbConnection);
    dbAdapter.Fill (fooData);
}
finally
{
    dbConnection.Close ();
}


If you want to process the data row by row rather than snarfing it into a DataTable, you can do it this way:

OleDbConnection dbConnection = new OleDbConnection (@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:BAR.XLS;Extended Properties=""Excel 8.0;HDR=Yes;""");
dbConnection.Open ();
try
{
    OleDbCommand dbCommand = new OleDbCommand ("SELECT * FROM [foo$]", dbConnection);
    OleDbDataReader dbReader = dbCommand.ExecuteReader ();

    // Say we are interested only in the columns "YearOfBirth" and "Country":
    int yearOfBirthIndex = dbReader.GetOrdinal ("YearOfBirth");
    int countryIndex = dbReader.GetOrdinal ("Country");

    while (dbReader.Read ())
    {
	string yearOfBirth = dbReader.GetValue (yearOfBirthIndex).ToString ();
	string country = dbReader.GetValue (countryIndex).ToString ();

	// ...
    }
}
finally
{
    dbConnection.Close ();
}

But what if you don’t know the name of the sheet you want to read? As you can see from the examples, the ADO.NET interface requires you to name the worksheet - but in many cases you just want to read the first worksheet regardless of its name. It would be cool if ADO.NET provided a suitable notation like this:

// I WISH THIS WORKED
OleDbCommand dbCommand = new OleDbCommand ("SELECT * FROM [0#]", dbConnection);

… but it doesn’t. You must tell ADO.NET the specific name of the sheet you want to read. The solution is therefore to read the spreadsheet schema to find out the sheet names, thus reducing the second problem to the first one, like this:

OleDbConnection dbConnection = new OleDbConnection (@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:BAR.XLS;Extended Properties=""Excel 8.0;HDR=Yes;""");
dbConnection.Open ();
try
{
    // Get the name of the first worksheet:
    DataTable dbSchema = dbConnection.GetOleDbSchemaTable (OleDbSchemaGuid.Tables, null);
    if (dbSchema == null || dbSchema.Rows.Count < 1)
    {
        throw new Exception ("Error: Could not determine the name of the first worksheet.");
    }
    string firstSheetName = dbSchema.Rows [0] ["TABLE_NAME"].ToString ();

    // Now we have the table name; proceed as before:
    OleDbCommand dbCommand = new OleDbCommand ("SELECT * FROM [" + firstSheetName + "]", dbConnection);
    OleDbDataReader dbReader = dbCommand.ExecuteReader ();

    // And so on...
}
finally
{
    dbConnection.Close ();
}

The main quirk about the ADO.NET interface is how datatypes are handled. (You’ll notice I’ve been carefully avoiding the question of which datatypes are returned when reading the spreadsheet.) Are you ready for this? ADO.NET scans the first 8 rows of data, and based on that guesses the datatype for each column. Then it attempts to coerce all data from that column to that datatype, returning NULL whenever the coercion fails!

I suppose that makes a twisted kind of sense if you have spent a lot of time working with relational databases, but there are some unfortunate consequences of this design which aren’t obvious at first. For example, say your spreadsheet contains the following columns:

YearOfBirth    Country	PostalCode
1964	       USA	10005
1970	       USA	10001
1952	       Canada	K2P1R6
1981	       Canada	L3R3R2
1974	       USA	10013

ADO.NET will correctly guess that the YearOfBirth column is numeric, and that the Country column is of type string. But what about the PostalCode column, which contains a mix of numbers and strings? In this case ADO.NET chooses the type based on the majority of the values (with a tie going to numeric). In this example 3 of the 5 postal codes are numeric, so ADO.NET will declare the column to be numeric. Therefore it will attempt to cast each cell to a number, which will fail for the Canadian postal codes - which will therefore come out as NULL values. Ha ha. Isn’t that fun?

Even more entertaining, there is absolutely no way to make this 100% reliable - although with some pain, you can improve the situation. Here’s what you need to do. First add the "IMEX=1" option to your connection string like this:

OleDbConnection dbConnection = new OleDbConnection (@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:BAR.XLS;Extended Properties=""Excel 8.0;HDR=Yes;IMEX=1;""");

That tells ADO.NET to honor the following registry key when reading the spreadsheet:

Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/ImportMixedTypes

This registry key tells ADO.NET what to do when it encounters mixed types in the first 8 rows. It can either be set to the string "Majority Type" (for the default behavior) or to "Text" (which forces the column to be of type string). Note that you are still screwed if the first 8 postal codes are numeric and the 9th is Canadian. (Not to mention that the "Text" option invokes handling that fails on strings over 255 characters, but let’s skip that for now.)

There’s also a second relevant registry setting (which is honored regardless of the IMEX option):

Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/TypeGuessRows

That says how many rows to scan to guess the datatype. The default is 8, but you can set it anywhere from 0-16 decimal (0 meaning "scan the first 16384 rows", and all other values meaning what they say). Putting this all together, the most reliable way to read a US/Canadian postal code is to use the following registry settings:

TypeGuessRows = 0
ImportMixedTypes = Text

That’s pretty close to perfect, although it will still fail if the first 16384 postal codes are numeric and any of the subsequent ones aren’t.

This is a Bad Design for so many reasons I don’t know where to start. First, the behavior of the spreadsheet importer should not depend on global registry settings - that is just inviting mysterious, data-dependent errors whenever other applications or users change those settings. All of those settings should be in the connect string and nowhere else. Second, there should be an option to say "I’m not sure what data is coming, but I want all of it - please coerce everything to something universal like an arbitrary-length string". Third, the interface should be stream-based, not file-based. If you are reading the spreadsheet data from the network, you shouldn’t have to save it to a temporary file in order to parse it. Fourth, you shouldn’t have to read the spreadsheet schema if you just want to select the worksheet by index (e.g. you want to read the first worksheet, whatever it happens to be called).

Comment :  1 7/5/2551 18:05:31

Dont forget

using System.Data.OleDb;

  By :  gobElmo
Comment :  2 2/9/2551 13:50:50
  By :  xWnAnJqJzXG
Comment :  3 27/4/2552 12:31:26

Hi,

If you want to read data from a .xlsx file, you have to use Microsoft.ACE.OLEDB.12.0 as the provider. This is the new Access database engine OLE DB driver and is also capable of reading Excel 2003. Use it to read xlsx (Excel 2007) data

  By :  Chetankumar Akarte
Comment :  4 27/4/2552 12:32:54

Get more info on Read and Display Data From an Excel File (.xsl or .xlsx) in ASP.NET here…http://www.tipsntracks.com

  By :  Chetankumar Akarte
Comment :  5 13/5/2552 5:31:35
  By :  AaHMsbJoiU
Comment :  6 4/9/2552 12:52:09
  By :  xLHRMeMuw
Comment :  7 20/9/2552 21:42:30
  By :  LAhPGuYLNYOnH
Comment :  8 6/10/2552 19:31:42
  By :  SUPtqYeEdSSTmOF
Comment :  9 12/10/2552 8:39:47
  By :  ZgVpUZQEmchQyPvA
Comment :  10 18/10/2552 23:19:31
  By :  OEfWQDRjpKVZIZDhEQW
Comment :  11 27/10/2552 1:06:51
  By :  NRaqXKwIxNcCSghHoR
Comment :  12 25/11/2552 3:52:02
  By :  BGxKAQBWQEyHtC
Comment :  13 2/12/2552 19:42:25
  By :  bwOyFV mueioocjcreu, [url=http
Comment :  14 21/12/2552 10:40:12
  By :  comment2, Doxycin,
Comment :  15 14/1/2553 13:24:25
  By :  NlwZhQAYNi
Comment :  16 27/1/2553 7:04:47
  By :  Very nice site!
Comment :  17 12/2/2553 22:35:00
  By : 

Your comment
 
Your name