Data generation with TPC-H’s dbgen for load testing


2011-06-26 update:

I am not sure if there are any changes in the latest make and gcc packages. Anyway, I noticed when run make, I encountered the message below:

make: g: Command not found
make: [qgen] Error 127 (ignored)

To fix this, find where gcc is at, then created a symbolic link g that points to gcc. All is well afterwards:
[root@ip-10-245-209-196 dbgen]# which gcc
/usr/bin/gcc
[root@ip-10-245-209-196 dbgen]# cd /usr/bin/
[root@ip-10-245-209-196 bin]# ln -s gcc g

End update
Recently I found myself doing some data loading benchmark testing with table partition. Data loading and storing for BI/DW/DSS stuff almost always involves data partitioning. SQL Server partition has a nice feature called partition switch, where you can swap data in and out of a partitioned table. In MySQL, it is called exchange, but it hasn’t been released. Perhaps other RDBMS have similar things as well.

To facilitate testing, I need some data that lends itself easily for partition, which led me to TPC-H’s dbgen tool.

dbgen is pretty easy to build and compile on Linux. It also has a Windows version. I didn’t bother trying because getting it built on Linux was such a quick and easy process. However, the documentation of dbgen, if it can be called as such, leaves a lot to be desired. But by experimenting and surfing around, I was able to get what I needed.

Instructions here are for CentOS. Building it on other flavors of Linux should also be easy.

1. yum install make and yum install gcc if you don’t have that already;

2. wget source code;

3. tar xf tarball;

4. cd dbgen

5. vi makefile.suite and change 4 lines. See uncommented lines below:
[sourcecode language=”text”]
################
## CHANGE NAME OF ANSI COMPILER HERE
################
CC = gcc
# Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
# SQLSERVER, SYBASE, ORACLE
# Current values for MACHINE are: ATT, DOS, HP, IBM, ICL, MVS,
# SGI, SUN, U2200, VMS, LINUX, WIN32
# Current values for WORKLOAD are: TPCH
DATABASE= SQLSERVER
MACHINE = LINUX
WORKLOAD = TPCH
[/sourcecode]

Note: It does not have MySQL and others for DATABASE setting, but that really doesn’t matter, because the file generated, which is delimited by a pipe, |, can be used for loading into any system.

6. make -f makefile.suite
This creats the dbgen binary that we can use in the next step.

7. ./dbgen -h gives you a brief description of switches that can be used. I am just interested in creating the lineitem table, because it has a good combination of integer, decimal, date, and character fields. The date inside the tables spans between 1992-01-01 and 1998-12-31, which is good for a partitioned table based on date. The DDL for this table is inside the dss.ddl file in the same directory. Please modify it as necessary. In fact, Lubor Kollar has it ready for SQL Server here, and Vadim Tkachenko has it for MySQL here.

Here is the command I used and its results:

[sourcecode language=”text”]
[root@centos dbgen]# ./dbgen -v -T L -s 2
TPC-H Population Generator (Version 2.14.0)
Copyright Transaction Processing Performance Council 1994 – 2010
Generating data for lineitem table/
Preloading text … 100%
Do you want to overwrite ./lineitem.tbl ? [Y/N]: Y
done.
[/sourcecode]

-v: verbose, -T L: lineitem only, -s 2: scale factor of 2. My understanding is that it roughly indicates that the end file will be close to 2 gig.

It turns out the end file is about 1.5 gig, with close to 12 million rows in it.

[sourcecode language=”text”]
-rw-r–r– 1 root root 1.5G Mar 30 21:33 lineitem.tbl
[root@centos dbgen]# wc -l lineitem.tbl
11997996 lineitem.tbl
[/sourcecode]

You can play around with different key switches to get the size of file you want. For example, quoting from Lubor Kollar:

dbgen –T L -s 4 -C 3 -S 1

Using the –s option, we set the scale to 4 that generates a Lineitem table of 3GB. Using the –C option we split the table into 3 portions, and then using the –S option we chose only the first 1GB portion of the Lineitem table.

Running the command above, I got a file named lineitem.tbl.1 with close to 8 million rows, about 976 meg.

When trying to load it into SQL Server with SSIS bulk copying task, remember the delimiter is | and line separator is LF (line feed). For MySQL, something like this should do:

[sourcecode language=”text”]
load data local infile ‘/root/dbgen/lineitem.tbl.1’ into table lineitem fields terminated by ‘|’ lines terminated by ‘\n’;
[/sourcecode]

, , , ,

2 responses to “Data generation with TPC-H’s dbgen for load testing”

  1. Thank you for this article, it was very useful for me to build the dbgen.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.