Hello everyone this is the fourth index of dixto and in this index, we will show something a bit more technical. We will be preprocessing one of the datasets that we have downloaded from the last index.

As usual here if the video link from youtube but this article will be more detailed.

If you want to follow along, get the raw dataset here.

First, we will be using packages that come with python like os and CSV.

In this tutorial, we will need to install anaconda as we will be using jupyter notebook. To install anaconda you can visit https://www.anaconda.com/ and download then install anaconda. Then open a terminal and just type

jupyter notebook

It will launch the notebook and you will be able to write and debug your app easily.

Firstly, we need to import the packages that we will be using:

import os
import csv

Then there is really important before coding is to browse the data, and understand the format and the structure of the data in order to find the processing method.

Here is the structure of our folders:

Emails => eron1 => hamm / spam => (single file containing spam)

So we will need to browse those folders and for that need, we will use the package os of python. And we need the os.walk provided by that package in order to browse the folders and get the files inside.

We need to loop from 0 to 7 because we have 7 email folders and loop in an array of [“hamm”, “spam”] because each folder contains those 2 folders. And then we walk into the folder (for example emails/eron1/spam) and get the content of each file. Then we just need to remove some words using replace method on string.

path = "/media/ltphen/Ulife/LTPhen von Ulife/tools/dataset/emails"
emailsContent = []

for i in range(1,7):
    for label in ["ham", "spam"]:
        pathName = path+"/"+"enron"+str(i)+"/"+label
        for _, _, files in os.walk(pathName):
            for file in files:
                filePath = os.path.join(pathName, file)
                content = read_file(filePath)
                 emailsContent.append((content.replace("Subject:", "").replace("\n", ""), label == "spam"))
                write_in_csv(emailsContent)

The last step is to use the return from the previous to write a CSV file. And we can do that easily with the CSV package from python.

The first part is to write the header of the CSV and we loop over the results from the previous section to write that into the CSV file.

def read_file(path):
    f = open(path, "r", encoding = "ISO-8859-1")
    return f.read()

def write_in_csv(content):
    with open("result.csv", "w+") as out:
        csv_file = csv.writer(out)
        csv_file.writerow(["content", "spam"])
        for item in content:
            csv_file.writerow(item)

And then the complete code can be found on GitHub here:

Thanks for reading and see you in the next index.

September 10, 2022
How to Preprocess a raw spam dataset : Dixto [3]

How to Preprocess a raw spam dataset : Dixto [3]

September 10, 2022How to Preprocess a raw spam dataset : Dixto [3]

How to Preprocess a raw spam dataset : Dixto [3]

September 10, 2022
How to Preprocess a raw spam dataset : Dixto [3]