Things to check or fix before the migration

Mailman 2 was more lax about headers and we found problems which can hinder the migration.

Wrong date format

We found posts with dates using GMT+00:00, which is not a proper timezone specification, but you can easily fix this error with the following one-liner:

sed -ri 's/\(GMT\+00:00\)/(GMT)/' /var/lib/mailman/archives/private/*.mbox/*.mbox

Missing Message-Id

Some messages may lack a Message-Id field entirely and this information is lost. Without this field it is impossible to import.

Hyperkitty ≥ 1.2 automatically fixes it but earlier versions need the following workaround.

With this Ruby script (and associated Gemfile) you can generate fake unique Message-Id field for posts lacking it. The association is kept in the new_message_ids.yml file so it is safe to run it multiple times as generated value will be stable (useful if you sync the list mbox regularly before the final switch to production). The mbox files are found in the usual /var/lib/mailman/archives/private/ directory.

Procedure to run the script:

bundle install
bundle exec ./mailman2_archive_fix.rb

Cleaning the previous search index

If you attempted an import previously then it is recommended to purge the previous indexes, as the index regeneration would just add data and it can take quite some space.

rm -rf /var/www/mailman/fulltext_index
mkdir /var/www/mailman/fulltext_index
chown mailman_webui: /var/www/mailman/fulltext_index
chmod 0755 /var/www/mailman/fulltext_index

Launching the import process

To loop on each mailing-list and simplify the process it is recommended to use a script made by Fedora folks and installed by the mailman3 role:

/var/www/mailman/bin/ -d <mail-domain> /var/lib/mailman/

If you need to skip some lists from being imported, you can provide a comma separated list using the --exclude option.

Afterwards, the search index needs to be regenerated:

ionice -c3 django-admin update_index --pythonpath /var/www/mailman/config --settings settings_admin

This can take many hours depending on the size of the imported data, but the installation can go to production without waiting for it to complete.

Solutions for migration problems

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0x?? in position ??: ordinal not in range(??)

This is caused by badly encoded mail headers. Currently experience showed only SPAM produced such broken emails.

On Hyperkitty 1.1.5, it is possible to skip these emails and continue importing the rest of the mailbox using this patch.

DataError: invalid byte sequence for encoding “UTF8”:…

It is a variant of the previous problem but in this case the importer script skips the bad email despite the trace.

Nevertheless the previous patch is probably necessary as the import script is probably going to stop processing further lists.

RuntimeError: maximum recursion depth exceeded while calling a Python object

Hyperkitty links the posts of every threads to be able to navigate between them. If a thread is very long (>1000 posts), then the program will crash; we found this situation in archives of CI build notifications. It is possible to increase the maximum using this patch.