Usually, I first write a spider to grab the raw code of the articles of some online source. The exact code depends on the actual news source, but the method is pretty standard.
If the source is a relatively rarely updated blog or magazine, all of its archive can be downloaded at once. The spider starts with the archives and grabs all the raw html into a database. You can harvest the yield of several years in just a few minutes this way.
If the source is a frequently updated news site, I usually start the grabber on an RSS feed and let it revisit and grab the new articles. (I don't bother with the archives.) This way the grabber needs to operate for a few days or weeks to get a decent amount of text (some million of characters). I let it check back once every hour to not to overwhelm the site.
If you code a spider, you need only some minor changes to adapt it to work with other sources. This adaptation usually involves:
- the source url (is it an rss, some latest news page or an archive page?)
- the link format (a mask for preg_match_all)
- the mask of the main text of the articles (for preg_match)
- masks of common text blocks to be removed (related articles block, author's block etc.)
The links to the articles are fetched from the rss/archive and opened one by one (file_to_string or curl). The main text area is cut from the html. This will change from site to site, but look for the part where the actual article starts and ends, and try to come up with a pattern for preg_match. Cleaning up the cutout may need further adaptation: identify constant block of texts which need to be removed. I usually remove blocks with related articles, the author's block etc. These parts are not typed, and leaving them in the database would flaw the statistics if our aim is keyboard layout optimization. The next step is a general cleaning: removing of html tags, code blocks etc.
During grabbing, I store the raw html, but also try to generate the cleaned text immediately. To achieve this, I execute a function with the following steps:
- Replace some tags (p, br, h1, h2, li etc.) with linebreak (to properly calculate Enters). This can change to site to site. Some of them uses
tags, some of them
- Identify and replace some constant texts in the articles - publishers tend to use premade info chunks, which aren't really typed, so I like to exclude these from my statistics.
- Replace everything between script tags.
- Remove every html tag.
- Replace multiple consecutive spaces with one space.
- Replace multiple consecutive likebreaks with one linebrake.
This algorithm returns a clear text version of the article.
Additionally, I calculate and store the length of the plain text version in the data table at this point for validation.
Once the database is filled with freshly harvested text, I check the shortest and longest articles based on the calculated length. Records with 0 bytes are more than suspicious, just like articles with many tens of thousands of characters. I check some of these records manually, tweak the cleaner algorithm if needed and run it again on the whole database. No need for another grab, if you've stored the raw html.
At this point I simple read all the text from the database and iterate through it character by character, setting a counter for every character, character pair, trigram and word. The result is a large array, containing all the occurrences. This can take some time, but no more than a few seconds even for some millions of characters.
And that's it. If needed, you can order these counters by frequency or output the array in whatever format you need: a list, a more human readable table, php array, json, csv etc.
Validation round 2=
The same goes for funny characters, which may be the result of UTF encoding errors.
Search for these chars in your database and check the articles. Sometimes it turns out they are legit, other times you just fix the issue manually or rework your cleaning algo and run it again.
There aren't really exact rules for this validation. 10M chars of text from a business related news portal resulted in 85 different characters, while 8M chars of another magazine (tech, business and popular science) resulted in 217 different chars and after further investigation, all the funny international chars and symbols seemed legit. (You can consider leaving it as it is: only 70 from 217 chars' frequency reached 0.001%, which means they don't really mess up any optimization model. The sum is 1354 funny chars in the 7.8M chars total, which is 0.0174%.)
I set the url attribute as unique and use insert into ignore. This way the same article is only grabbed once even if it's pushed to another page of the archives by a newly published article.
Going from newer to older articles prevents skipping an entry too.
Az első flash