WEEK9 — First layer of Categories

Yuan
4 min readOct 21, 2019

I’d stopped updating my blog for more than two weeks, which time period was almost exactly overlapped with the area of “Dark swamp of despair” in the graph beneath.

So, what did I struggle with for the past half month?

How to classify data?

One type of my data is the headlines on the cover of Sanlian Lifeweek, which I’d never thought I would have so much trouble to classify. But the fact is “a world” is so complicated — even it is “a world” only consisted of headlines. Reading through all the headlines of 16 years, I felt I’d almost learned all the variations of “headlines”.

Generally, the syntax of the headlines can be described as “contents & perspectives” .

However, there are still multiple variations of the ways of narration on top of this structure, let alone other variations like language style.

For example, there is a headline, “高尔夫的阶层之变”(11/06/2006),which means “the changing of the class of people who play golf”. It is a very typical instance that’s trying to narrate complicated social changing from a small perspective, which is even more so in its Chinese version, emphasizing the changing of “golf” almost at the same level as that of “people” .

Finally, after looking into the article “高尔夫的阶层之变”, I decided to label it with “sports” and “economy”.

However, I thought it was also about the changing lifestyle of people. Therefore, in order to decide whether I should also label it with “personal livelihood”, I needed to look into another layer of “perspective”, which is its “scope”, macroscopical or microcosmic. If the article is written in a small scope, then I might need to add the label “personal livelihood”.

Thankfully, most articles trended to approach topics in a big scope.

What is more intimidating is that the example above is just a relatively easy one in the whole process of my headline classification.

So, at this point, I think I need to stop checking my classification, just label the headlines with the strongest themes they present, and move forward. Because there are just too many headlines and the contents of the articles sitting in between almost all the categories, which in terms of approaching the facts is good though.

What are the categories of the first layer of data? What is the standard to classify data?

Here are the 12 categories that I’d come up after classification, in which process I’d changed again and again. Like what happens in a normal learning process, a new instance always holds the possibility to change our mental model.

The categories on the left are mainly differentiated by “scope” which are mainly focus on an event, a problem or a figure (or a group of figures). And the categories on the right are differentiated by “property” which approach topics in a more extensive way.

To a great extent, no matter the categories on the left or right sides, most of them would overlap with each other when it comes to classification. So, my principle is, I only label the headlines with its dominating categories.

Still, I found it is really hard to keep my standard consistent throughout the whole process of my classification. Therefore, I think I need other ways along sides, which standard is more concrete, like filtering keywords.

Why did I want to do this project?

So exactly like the mental murmurs in that “EMOTIONAL JOURNEY”, I really needed to remind myself again and again about why did I initially want to do this project. Otherwise, all the words rumbling in my head were “This sucks. I have no idea what I’m doing”, because I know only too well that this classification can’t be very accurate. Because I’m a human being with biases. Only until then, I started to understand why most projects about text analysis are based on filtering keywords. Even though many keywords in the headlines have nothing to do with the main contents in the articles.

Therefore, for me, one very difficult question is, if the classification can’t be very accurate, then what’s the point to do this project, of which all the subsequent analyses are based on it?

I still can’t find a satisfying answer to it.

I think I’ll look at this experience as a lesson, and move on with what I’ve had.

--

--