Julien's data blog

BASIC IN-BROWSER TEXT PROCESSING USING COMPROMISE.JS

Though JavaScript might not be as obvious a choice as Python when it comes to Natural Language Processing libraries, its ecosystem actually features some highly performing text processing packages. And this actually makes perfect sense, as such dependencies are very much needed to build mobile or web based applications such as chatbots for instance. Finding the right tool for the job Over the past few months, I have experimented a bit with the following Node packages:

Mon, Mar 7, 2022

EVALUATE TOPICS COHERENCE WITH PALMETTO

Over the last couple of years, I have been dabbling a bit with Topic modelling, and to this day I still find this niche subset of computational linguistics to be quite fascinating. Some very interesting research papers have been published over the past decade, and recent forays into the field of natural language processing such as transformers, combined with the development of some new libraries, have certainly brought a welcome breath of fresh air to that field.

Tue, Feb 22, 2022

METHOD CHAINING WITH PANDAS

This article is going to be slightly shorter than what I usually tend to post, but I hope you will enjoy it nonetheless! I like a good podcast I was recently looking for some podcasts I could listen to while running, and stumbled upon a series of interesting interviews with Matt Harrisson. The name sounded familiar, and I realised that I had purchased a couple of his books and thoroughly enjoyed them.

Sun, Dec 12, 2021

MAKE YOUR WEBSITES PRETTIER WITH MVP.CSS

A couple of weeks ago, as I was browsing through recent comments on Hacker News when I stumbled upon a conversation around minimalistic web design. As my HTML and CSS skills are quite limited, I thought I might take a look at some of the resources that were being shared there and see how they could benefit my own work. Make your ugly HTML files prettier So, what is MVP.css? According to its author, Andy Brewer, it is a “a minimalist stylesheet for HTML elements”.

Sat, Dec 4, 2021

I LOVE YOU, TWEET-PREPROCESSOR

This is going to be a short article, but I really wanted to share a pretty neat library named tweet-preprocessor that I stumbled upon while reading some random stuff on Hacker News. The cool bit I have a confession to make: I have never been able to remember anything about regular expressions. To this day, I still struggle to implement even the most basic character filtering routine. I find myself having to go through the same traumatising process each and every time.

Thu, Nov 18, 2021

SAVE A PANDAS DATAFRAME AS A SQL TABLE USING SQLALCHEMY

There are many reasons why we might want to save tabular data into a SQL database rather than simply outputting Pandas dataframes into .csv files. We might, for instance, have an automated script that runs daily and extracts a certain amount of tweets using the Tweepy library. The script could be pulling tweets from different users, or use different hashtags and / or search terms. This would be best saved into separate tables, gathered within a single database.

Sun, Sep 5, 2021

POS TAGGING AND NAMED ENTITIES RECOGNITION USING SPACY

I have often found that one of the easiest and most effective ways to approach short textual data like comments or tweets, is to try and discover high-level patterns and visualise them. Topic modelling requires a bit of trial and error, while looking for recurring contextual words might be more suited for larger chunks of unstructured data such as blog articles or novels. So, most of the times, I start by peforming the following two tasks:

Thu, Aug 5, 2021

MY FAVORITE PLOTS USING MATPLOTLIB - PART I

Before I start, let me just be clear and say that there are many, many great things that one can do with Matplotlib. And most importantly, similar results can be achieved in multiple different ways. What you will see in this article is by no means any sort of universal truth, and if you like to visualise your data in a different way, then that’s perfectly fine! This is not a tutorial The purpose of this article isn’t to go through the fundamentals of Matplotlib and Seaborn, but simply for me to share some of the plotting functions that I have been using the most while working as a data analyst.

Sun, Jul 11, 2021

TOPIC MODELLING - PART 2

What you are about to read is the second part in a series of articles dedicated to Topic modelling. Please note that like the previous one, this article contains snippets of text and code from my dissertation thesis: “Information extraction of short text snippets using Topic Modelling algorithms.” The text that follows was written by me in 2021, and will mainly center around the theoretical aspect of Topic Modelling. Research papers, when cited, are properly referenced.

Tue, Dec 29, 2020

INTRODUCTION TO TF-IDF

Before reading the following article, you should probably take a look at my introduction to Topic Modelling. Please note that this blog post contains snippets of text and code from my dissertation thesis: “Information extraction of short text snippets using Topic Modelling algorithms.” While the next articles will likely be more focused on how to extract meaningful topics from unstructured data as well as on some frameworks that can be used to evaluate our results, this article will be mainly focused on the theoretical aspect.

Tue, Dec 1, 2020