In which I start querying the data.
After setting up my table in my previous post I started querying the data. However, data doesn’t exist in an empty space. It represents something. I aimlessly put in random queries, but what is the point of that. Before moving forward I decided to come up with some questions that I would try and answer. These included:
- What year had the highest total number of visitors? What month?
- Which country/ continent did the highest number of visitors originate from?
- How has this changed over time?
- Which state is visited the most?
- Which is the most popular mode of transport? How has this changed over time?
- How is the choice of transport affected by the originating country?
My questions grew increasingly more complex as I moved on. If there is one thing my Psychology degree instilled in me, it’s a love of interactions. There are so many different variables affecting how a human makes their choice that the statistics for Psychology requires a way to look at that. However, my questions here are still very basic.
Querying basic questions
My simplest questions had me asking about numbers for a single, specific category. These were very easy to query. An example is finding out how many tourists visited in a given year. My code used for this query was:
SELECT years, SUM(ncount) FROM touristdata GROUP BY years ORDER BY SUM(ncount) DESC;
This is an excerpt of the output produced by the query:
Most visitors by year.
We can see a general trend of increased tourism. The same formula of SELECT, FROM, GROUP BY, ORDER BY was used for other similar questions such as which country or continent did the most tourists come from. My timeid column I lamented wasting time on in the previous post was useful. I could easily query which months were overall more popular, to see seasonal trends, as well as which individual months were very popular historically. The three most popular months were during the 2014 FIFA World cup, and a successful summer season following it.
June 2014, January 2015 and December 2014 were the most popular months.
Using window functions for a more involved question
How has mode of transport varied over time? With this question I delved into a question involving two columns interacting to get the final count. My goal was to get a percentage of each. I thought I would easily figure it out. I was wrong. Because
I came at this from a few angles. After reading some posts on Stack Overflow I was making use of the WITH function to create something I could then use in my percentage calculation. This didn’t work, as outlined in my own question I ended up posting on Stack Overflow. I quickly received some help. People are nice, to take the time out of their day to help others who are learning. My savior, Gordon Linoff, had this to say:
For the record, I upvoted him but I’m too new for it to be recorded.
The solution worked. I only had to change “PARTITION BY year” to “PARTITION BY td.years” to get it up and running. I will also be making use of the table alias in the future to save myself some typing. This is how the output looked:
What are window functions?
I’m glad that I asked because I was not at all familiar with window functions. Window functions are described in the PostgreSQL documentation, and I also found this write-up with a simple example useful. Essentially, rather than making a join on what could be a massive set of data you pull out a select ‘window’ of rows which you perform a calculation on. They differ from aggregate functions because of grouping all the rows into a single output, the rows retain their separate identities. It is possible to use all sorts of built-in and user-defined
For next time
It’s nice to have this data, and I can try and pull some deductions from it. However, tables with numbers isn’t the clearest format. What I’m in need of is statistics and graphs. Visual representation of data is where the magic happens. Therefore I’m going to come back after I’ve taught myself some R and made some pretty graphs. Part 3 will present findings from the data.