Saturday, April 10, 2010

SELECT * FROM spreadsheet (or How to parse a CSV file using PL/SQL)



I recently needed to retrieve/download a comma-separated values (CSV) file from a website, and insert the data in an Oracle database table.

After googling around a bit, I found various pieces of the solution on AskTom, ExpertsExchange and other sites, which I put together in the following generic utility package for CSV files.

Usage


Because I have implemented the main parsing routine as a pipelined function, you can process the data either using straight SQL, or in a PL/SQL program.



For example, you can retrieve a download a CSV file as a clob directly from the web and return it as a table with a single statement:


select *
from table(csv_util_pkg.clob_to_csv(httpuritype('http://www.foo.example/bar.csv').getclob()))


And maybe do a direct insert via INSERT .. SELECT :

insert into my_table (first_column, second_column)
select c001, c002
from table(csv_util_pkg.clob_to_csv(httpuritype('http://www.foo.example/bar.csv').getclob()))


You can of course also use SQL to filter the results (although this may affect performance):

select *
from table(csv_util_pkg.clob_to_csv(httpuritype('http://www.foo.example/bar.csv').getclob()))
where c002 = 'Chevy'


Or you can do it in a more procedural fashion, like this:

create table x_dump
(clob_value clob,
 dump_date date default sysdate,
 dump_id number);


declare
  l_clob clob;

  cursor l_cursor
  is
  select csv.*
  from x_dump d, table(csv_util_pkg.clob_to_csv(d.clob_value)) csv
  where d.dump_id = 1;

begin

  l_clob := httpuritype('http://www.foo.example/bar.csv').getclob();
  insert into x_dump (clob_value, dump_id) values (l_clob, 1);
  commit;
  dbms_lob.freetemporary (l_clob);

  for l_rec in l_cursor loop
    dbms_output.put_line ('row ' || l_rec.line_number || ', col 1 = ' || l_rec.c001);
  end loop;

end;



Auxiliary functions


There are a few additional functions in the package that are not necessary for normal usage, but may be useful if you are doing any sort of lower-level CSV parsing. The csv_to_array function operates on a single CSV-encoded line (so to use this you would have to split the CSV lines yourself first, and feed them one by one to this function):

declare
  l_array t_str_array;
  l_val varchar2(4000);
begin

  l_array := csv_util_pkg.csv_to_array ('10,SMITH,CLERK,"1200,50"');

  for i in l_array.first .. l_array.last loop
    dbms_output.put_line('value ' || i || ' = ' || l_array(i));
  end loop;

  -- should output SMITH
  l_val := csv_util_pkg.get_array_value(l_array, 2);
  dbms_output.put_line('value = ' || l_val);

  -- should give an error message stating that there is no column called DEPTNO because the array does not contain seven elements
  -- leave the column name out to fail silently and return NULL instead of raising exception
  l_val := csv_util_pkg.get_array_value(l_array, 7, 'DEPTNO');
  dbms_output.put_line('value = ' || l_val);

end;



Installation


In order to compile the package, you will need these SQL types in your schema:

create type t_str_array as table of varchar2(4000);
/

create type t_csv_line as object (
  line_number  number,
  line_raw     varchar2(4000),
  c001         varchar2(4000),
  c002         varchar2(4000),
  c003         varchar2(4000),
  c004         varchar2(4000),
  c005         varchar2(4000),
  c006         varchar2(4000),
  c007         varchar2(4000),
  c008         varchar2(4000),
  c009         varchar2(4000),
  c010         varchar2(4000),
  c011         varchar2(4000),
  c012         varchar2(4000),
  c013         varchar2(4000),
  c014         varchar2(4000),
  c015         varchar2(4000),
  c016         varchar2(4000),
  c017         varchar2(4000),
  c018         varchar2(4000),
  c019         varchar2(4000),
  c020         varchar2(4000)
);
/

create type t_csv_tab as table of t_csv_line;
/


UPDATE 04.04.2012: The latest version of the package itself (CSV_UTIL_PKG) can be found as part of the Alexandria Utility Library for PL/SQL.


Performance


On my test server (not my laptop), it takes about 35 seconds to process 12,000 rows in CSV format. I don't consider this super-fast, but probably fast enough for many CSV processing scenarios.

If you have any performance-enhancing tips, do let me know!

Bonus: Exporting CSV data


You can also use this package to export CSV data, for example by using a query like this.

select csv_util_pkg.array_to_csv (t_str_array(company_id, company_name, company_type)) as the_csv_data
from company
order by company_name

THE_CSV_DATA
--------------------------------
260,Acorn Oil & Gas,EXT
261,Altinex,EXT
262,Amerada Hess,EXT
263,Atlantic Petroleum,EXT
264,Beryl,EXT
265,BG,EXT
266,Bow Valley Energy,EXT
267,BP,EXT

This might come in handy, even in these days of XML and JSON ... :-)

34 comments:

Oracle said...

Helpful and really really helpful.

What we are doing is using shell script and then sql loader.

I will ask my team members to look at it and then implement.

csv4ever said...

execellent post!
can you provide a script to use csv_to_array?

Morten Braten said...

@csv4ever: I have updated the post to include an example of using csv_to_array and get_array_value.

@csv4ever said...

You are an inspiration to beginners like me.
Amazing job Morten!!!

Small and Simple Things said...

Hi!

Really cool stuff!

Can you make the code availabe as a zip file download?

Thanks!

Dave

Jon Trøstheim said...

Very goood work Morten!, you have done very many good posts, if you need help with how to drag&drop, copy&paste data from excel into tables in apex we have 2 customers which use this and it is very usefull. keep up the good work Morten

Dan McGhan said...

Morten,

I remember writing on something that does some of the same things before:
http://www.danielmcghan.us/2009/02/easy-csv-uploads-yes-we-can.html

We should compare notes and test.

Email me at dmcghan@gmail.com if you're interested.

Regards,
Dan

Arave said...

Great post Morten! I wished I had seen it before my last project. Your code performs faster then mine.
Thanks!
Todd

Unknown said...

Very cool solution. You should strongly consider creating a project for this on samplecode.oracle.com as I can see it growing and evolving as more people start to use it. You could also get others to contribute to the project. For a reference point, I run logger.samplecode.oracle.com

Nice work,
Tyler

Morten Braten said...

@Jon: Thanks!

@Dan: Will do when I get some free time... :-)

@Tyler: Thanks! Will consider packaging the code up somewhere if the codebase starts to develop.

Anonymous said...

our pl/xls product supports binary XLS, we implemented an XLS parser in PL/SQL, see plpdf.com

Rathore said...

hi,
i am very new in apex,
could you please tell me how to use
your code in apex.

i installed your pkg
but i don't know how to call your code in apex pl-sql process

could you help please

Morten Braten said...

@Rathore: It depends on what you are trying to accomplish.

Do you want to download a CSV file from a URL? If so, see my second code example (insert/select).

Or do you want to allow the user to upload a CSV file via Apex, and then store the data in a database table? If so, see the Apex documentation for information about uploading and downloading (http://download.oracle.com/docs/cd/E14373_01/appdev.32/e13363/up_dn_files.htm#CJAHDJDA). Once you have the CSV data stored in a CLOB column, you can use my third code example (skip lines 18-21 as you will already have your data in a table).

- Morten

Morten Braten said...

@Rathore: As for running your PL/SQL in Apex, what you need is a Process (of type PL/SQL) that runs when the page is submitted (after validations). You would typically set a Condition to only run the Process when a button is clicked.

- Morten

Sean said...

The results when parsing an odd number of double-quotes seem inconsistent. Sometimes included, sometimes now.


SELECT * FROM TABLE(csv_util_pkg.clob_to_csv('a,b,c,"d
a,b,c",d
a,b"c,d
"
,"
",
""'))

Unknown said...

I have used the code example you have to parse a csv I have in apex and it is going really slow. I have tried multiple things to speed it up. I have tried straight SQL like in your first example, I have tried using a cursor, and I have tried using bulk collect. you have any ideas?

declare

cursor l_cur is
select csv.*
from zfimsmgr.csv_dump d, table(zfimsmgr.csv_util_pkg.clob_to_csv(d.clob_value)) csv
where d.dump_id = 1;

type l_cur_tab is table of l_cur%rowtype;
v_l_cur_tab l_cur_tab;

begin
open l_cur;
loop
fetch l_cur bulk collect
into v_l_cur_tab limit 800;
dbms_output.put_line(v_l_cur_tab.count || 'rows');
exit when l_cur%notfound;
end loop;
close l_cur;
end;

Anonymous said...

Do you have a version optimized for 11g, without all of the dreaded GOTO statements?

Anonymous said...

pouvez vous m'aider à comprendre cet article mon prof nous a demandé à importer un fichier csv via pl/sql en nous donnant ce lien sachant qu'il nous a signaler à compiler le script "setup/types.sql" et le package "ora/csv_util_pkg"
merci de m'aider

Elis said...

Hi, I need your help.
Its takes 1 hour to read a file with 66000 lines. I thought this very bad time, any ideas to improve the performance?

RichardL said...

Hi As per previous post, Im also experiences slow extraction times.. I have 20,000 records.. with an original CSV file of 4.5megs and its taking 8 minutes or so.. to extract from a blob->clob->oracle table.. Anyone any ideas on how to speed the extraction up?

Many thanks

Anonymous said...

Hi

In function csv_util_pkg.clob_to_csv, try to remove the concatenations:
p_csv_clob || l_line_separator ---> p_csv_clob /*|| l_line_separator*/

You will have to manage it differently.

Cheers

Sebastian

Unknown said...

Do u have any idea how to fast query when uploading csv from apex and then insert into a table as blob column. After that trying to insert another on table with that csv columns from blob.
I have tried 514924 rows.It took more than 1 hr to completely insert all rows.

andysantosa said...
This comment has been removed by the author.
Anonymous said...

I'm just wondering. How is this package dealing with fields that are enclosed in ".

So:
1,"my name is, Richard the wise",01-01-2015

And this nice construct:
1,"my name is,
Richard the wise",01-01-2015

I.e. a CRLF or LF in the field?

Most of the time I find myself using shell scripts again, to use SQL*Loader (as its very fast) and occasionally the SED regular expression features. Many others use Perl.

Thoughts appreciated!

Morten Braten said...

@Anonymous: The package handles values enclosed in quotes as a single column. The package does not currently handle line endings in the middle of lines, whether they are enclosed in quotes or not.

- Morten

Michael said...

Hi there,

I have created a new solution called LOB2Table for this task.

See my project at sourceforge: https://sourceforge.net/projects/lob2table/

It's reliable, handles LFs in delimiters and supports also fixed width formats.

And last but not least: it's fast. In typical scenarios it parses up to 15,000 rows per second.
In my tests LOB2Table is about 10 to 20 times faster than csv_util_pkg.

For example, a CLOB containing data with 20,000 rows each with 20 columns (in total about 5 MBytes) gets parsed in less than 10 seconds. (On my workstation the exact timing is 2.283 seconds).



Peter Veprek said...

Just a tip for large CSV data import

I was once in situation when I needed to load huge amount of data (like 250 000 000 rows) from billing system into a DB for some calculations. The problem with downloading the file is obvious - it is too large and performance was not very good.
Instead, I saved the file locally and made it external table (create directory, where the file is, create table organization external...) - the point here is that when you do insert or select, Oracle uses data pump which is very fast.

Clau said...

I want to dowload this package and I can't find the link in this post. When I go to Alexandria page redirectme to github and github redirect me to this page.
How can I download this? thanks in advance!

Morten Braten said...

@Claudia: The link to the Alexandria PL/SQL Utility Library on GitHub is

https://github.com/mortenbra/alexandria-plsql-utils

The CSV_UTIL_PKG specifically is here:

https://github.com/mortenbra/alexandria-plsql-utils/blob/master/ora/csv_util_pkg.pks
https://github.com/mortenbra/alexandria-plsql-utils/blob/master/ora/csv_util_pkg.pkb

- Morten

Drew said...

I see you have to create a type for the CSV parsing functionality. It only has 20 fields for the column data??? Does this mean the application is limited to a CSV file with 20 or less columns? If my CSV file has 30 columns, can I just add 10 more entries to the t_csv_line type?

Morten Braten said...

@Drew: If you need more columns you must add them to the t_csv_line type, but also amend the code in csv_util_pkg.clob_to_csv (see https://github.com/mortenbra/alexandria-plsql-utils/blob/master/ora/csv_util_pkg.pkb#L232 and https://github.com/mortenbra/alexandria-plsql-utils/blob/master/ora/csv_util_pkg.pkb#L277) to add additional columns.

- Morten

Eirikur Emilsson said...

Hi Morten

Do you think this approach could handle 200k rows?

Coco Sanchez said...

Hi, Morten.


Do you have a pl/sql loader that load by column names not by position based?
Thanks in Advance.

DG said...

Hi, Morten.

I am running similar kind of query as
"select csv_util_pkg.array_to_csv (t_str_array(company_id, company_name, company_type)) as the_csv_data
from company
order by company_name"

But getting an error ORA-06553
ORA-06553: PLS-306: wrong number or types of arguments in call to 'ARRAY_TO_CSV'

Please help.