Undeniably discouraged from my previous romp in creating my own I decided to pull my next data set from Kaggle. Kaggle is a website that hosts competitions for machine learning. When I’m more experienced I’d like to jump in and try one. For now, I went over to the data set collection they have there. After spending a couple weeks visiting Italy, I was in the mood for tourism data. My choice was the Tourists Visiting Brazil data set which contains lots of fun information about tourism to Brazil from 1989 to 2015. I downloaded the CSV file, and got to work.
Setting up the database
I created my database tourist, and table touristdata. I added the columns for id, continent, country, destState (destination state within Brazil), transport, years, months and ncount. Due to year, month and count having pre-established meanings I changed them slightly.
The data was then copied in, with a comma used as a delimiter with the CSV file. I was ecstatic to once again be hit by encoding issues like I had with my French words database. Despite having the exact same set up (UTF8) as my now working French data base, the special characters would not copy in. The word thrown up in error was ‘Março’ for March. I would rather have my months in English anyway so I translated them all over. After translating all the months another encoding error was thrown up. I de-accented the following states:
- Pará to Para
- Paraná to Parana
- São Paulo to Sao Paulo
- Outras Unidades da Federação to Outras Unidades da Federacao
- Ceará to Ceara
- Amapá to Amapa
I’m aware this isn’t the best practice. However, as long as I keep track of the list, then I can change them back when writing up insights.
I had a chuckle when I noticed that I changed Rio de Janeiro to Rio de January. I fixed that up and successfully copied my data over to PostgreSQL. Time to query, right?
Wasting my time with tables
I made some tables that didn’t end up being useful, however I still learnt useful things so it wasn’t really a waste. I wanted to try normalising the data I was given to get experience working with multiple tables. I created two new tables, one for origin and one for time. Origin contained the continent and country of the visitor. To pull out all the continent/country combinations for my new origin table I used the following query:
SELECT DISTINCT ON (continent,country) continent, country FROM touristData ORDER BY continent, country DESC;
When there is large amounts of repeating data, the distinct function is very useful. I was able to pull out all the countries and their respective continents easily. The countries used varied by year so this ensured I captured them all.
I then made a table for the time related columns, months and years. I used excel to create a meaningful id for each. Still relying on it as a crutch.
It was around this point that I realised the count was related to nearly all the columns so that the table would need to be kept as it was. I decided to add the new . I created a new column called timeym that represented the id I made for my time years and months table. To fill the column with the info from the other table I used this:
UPDATE touristdata SET timeym = timeym.timeid FROM timeym WHERE touristdata.months = timeym.months AND touristdata.years = timeym.years;
So I matched the years and months on both tables, and then used SET to make the column equal the id column from the superfluous table.
So my attempt in reducing the original table by spreading the data over multiple table ended up with me adding another column to the original. Smashing success.
In my next post I will start querying the data to try and find some interesting things.