kde-playground/akonadi/server/tests/enron_email_dataset
2015-09-23 09:37:02 +00:00
..
add_attachments.php akonadi: import 1.13.0 2015-09-23 09:37:02 +00:00
courier_to_kmail_maildirformat.php akonadi: import 1.13.0 2015-09-23 09:37:02 +00:00
create_vcards.php akonadi: import 1.13.0 2015-09-23 09:37:02 +00:00
extract_contacts.php akonadi: import 1.13.0 2015-09-23 09:37:02 +00:00
README akonadi: import 1.13.0 2015-09-23 09:37:02 +00:00
run.sh akonadi: import 1.13.0 2015-09-23 09:37:02 +00:00
to_valid_maildir.php akonadi: import 1.13.0 2015-09-23 09:37:02 +00:00

These scripts are designed to make use of the Enron email dataset (see http://www.cs.cmu.edu/~enron/). They try to retrieve contact information from the messages and create an address book (VCARD format) from them.

To start the whole process, just run './run.sh'. It needs wget, tar, gzip, cat, sort, uniq and PHP.

Following is a list of the scripts:
- extract_contacts.php
  Reads the mail files, analyzes their addressee information (names, company names, email addresses) and outputs it.

  Takes about 4 minutes on a Pentium 4 2.4 GHz to read 128326 mail files.

- create_vcards.php
	Reads the output from extract_contacts.php and creates a VCARD address book from it. Additional information, like telephone numbers, birthdays, photographs, is randomly generated and added to the VCARDs.

  Takes about 20 seconds on a Pentium 4 2.4 GHz to create 32072 VCARDS.

- add_attachments.php
  Randomly adds fake attachments to the email messages, in order to create a more realistic dataset.

(c) 2007 Robert Zwerus <arzie@dds.nl>