French Word Frequency – Part 1

I decided to jump in and start working on my first project. In my last post I said I was going to use Microsoft SQL Server. However, based on some advice I ended downloading and using PostgreSQL, an open source application, because enterprises like things to be free as much as I do. It had a windows installer and my only problem with installation was with
the port it was trying to listen to, but a quick google solved that.

Step 1: Getting my data
For my data I decided I wanted something relevant to me. I am currently living in France, and learning French is a big priority for me. However there is a lot of vocabulary to learn. One dictionary, Le Grand Robert, has 100,000 words and 350,000 definitions. Considering this,  I decided to focus on learning the most important words first. To me, the most important words are those used most frequently. I know in English, that a small portion of my vocabulary makes up a large portion of what I use to communicate. So I found a helpful .pdf that contains the 5,000 most frequently used words derived from 23 million spoken and written words. The PDF file repeats the 5000 words several times in different formats. This section seemed easy enough to work with and had the information I wanted:

pdf.PNG
Extract from PDF file.

The word, the class of word, the translation and its frequency rank are included. The information would have been more helpful in a clean CSV or something like that. However, I’m here to learn and nothing helps learning more than a challenge.

Step 2: Making my data usable

I opened up python, and google and got to work. I played around with about 5 entries copied from the large list so I didn’t have to wait for it to process each time. At first all I was thing about was lists and sublists and I was playing around with that. But then I took a moment to reflect and think about how the data would actually be imported. I realised that lists were not the way to go and all I needed was a way to recognise when the new entries started. My final solution was to create a new list and then add a semicolon after each number, I would be able to use the semicolon as a delimiter to specify each new entry.

When I was happy with what I had written I put the 5000 entries into a .txt file, imported them in and spit them out into a new text file with my added delimiters. I imported into excel, separated using the delimiters and rearranged the columns and rows. I did a fair amount of cleaning of the data. Artifacts from copying from the PDF were removed. The ranks and french words were not correctly lining up. The original file did include a few random semicolons when I thought that it did not. I replaced them with spaces in the original file. The entries correctly lined up after they were removed.  This is an example of how the files looked in notepad before and after:

txtText file improvements, thanks to Python.

Step 3: Turning my data into a PostgreSQL database

Now to move on from Python to PostgreSQL. I encountered more problems than I expected creating the database.

Step3.1: Learn how to use PostgreSQL.

The two main applications that came with my download were SQL Shell (psql) and pgAdmin4, the first is the command line and looks almost intimidating in its simplicity.

psql
SQL Shell.

The second program is the GUI management tool. I did not find it very intuitive. That might be due to my lack of experience but I had to google where certain options were when normally I can just click my way around a new program.

Most of my learning curve wasn’t in working out the layout, it was in learning about servers, and users, and ports, and connections. I do not have a background on any of these topics. I fiddled around enough that I can comfortably create new databases, tables, and users, through the command line or through pgAdmin 4. Permissions are still a bit of a mystery and I have only worked off of my local server so far. I was able to set up my table and import my data, but then I hit a wall.

pgadmin.PNGIt took me a long time to find where the tables were hidden.

Step 3.2: Maybe I shouldn’t have used French, but this shouldn’t be so difficult?

The proper heading for step 3.2 is ‘Learning how to encode different languages into my database.’ This is the stage where  I got truly stuck. I had problems in the other steps, but I felt like I was making progress as I worked on them and they seemed more reasonable to me. Encoding means that when I first used COPY to import my CSV database my lovely french words looked like this:

french-accent-troublesFun fact: ‘mot’ is the french word for ‘word’.

Fantastic. Why? Why are you showing me the copyright symbol? The correct symbols show up when I’m online, they’re there in the excel file I’m importing from, what’s the problem? It’s something I’ve never really had to think about.
So, I learnt about character sets and encoding. I got directed to this support page about 100 times. I decided that UTF-8 was the one for me so I looked up how to create databases with specific encoding, altered example scripts and… it wasn’t working. I was getting errors thrown at me. Often the errors were from me trying to specify the LC_COLLATE and LC_CTYPE. These elements change how the characters are indexed and sorted. Many of the errors were stemming from using the LINUX format. My question here is why is that even a difference that exists? I decided to simplify. I would set my LC_COLLATE and LC_CTYPE to ‘C’, a sort of default that would cause problems, but I just wanted accents over my ‘a’s.
Tried again, still not working. I reinstalled with ‘C’ as my default. Tried to set it as UTF-8 still not working. The script I was running should have worked according to everything I read. It just didn’t. I still had copyright and capital Spanish A all through my database.

I started to regret my project choice a bit. I’m a sucker for that sunken cost fallacy though, so I wasn’t going to give up. I decided to take a break for a couple days. I worked on my french by copying out endless verb tables instead of my basic vocab. I started looking for a nice clean dataset for my next database so I could focus on querying. As I went to , I decided to open up my saved script and try again in yet another new database. It worked. I don’t know why. But now my letters looked like this:

french-solved
It’s so nice to see you à.

So I solved that problem. Somehow. Thanks to my reading I knew I had another one coming. My sorting thanks to ‘C’. I ran:
SELECT frequency, mot FROM frenchWords ORDER BY mot ASC ;

sortThat black line represents well over 4000 other entries.

Besides the cleaning needed in the first two rows, you’ll notice that ‘à’ is not the first entry. That’s because my string sort order and character classification were set to ‘C’. I need to specify a locale that will rank the characters in the way we wold expect.

Step 4: Query! Oh wait,no, there’s more cleaning

One of my main desires was to query the class types. I’m terrible with remembering word genders. I wanted to find out which words were masculine nouns and which were feminine nouns and common patterns in those groups. I decided to bring up one exception to see how it was displayed in the table. Le tour Eiffel is the Eiffel tower, masculine tour. La tour de France is the bicycle tour through the beautiful French countryside, feminine tour.

SELECT * FROM frenchwords WHERE mot = ‘tour’;

tourColumn headings are simply suggestions.

So my word class types aren’t all in the class column. Either I’ll need to split the  row in to two rows, but then I’ll double up on the frequency rank which is acting as my primary key and needs to be unique. Also that rank is for the combination of the two forms so it doesn’t seem right to split it up. Or, I can create another column like I have split the translations up. A mostly blank column apart from the few rare cases.

I think I’m going to take a break from this data set for a little while. I want to work on my querying. I will come back though.

 

 

Advertisements

One thought on “French Word Frequency – Part 1

  1. Pingback: Tourists Visiting Brazil – Part 1 – Sarah Rowley

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s