New CBC Programming guide launched

Under the Hood column

This past week we released a new version of the CBC Program Guide. This was a much needed upgrade as the old version was virtually useless because it constantly crashed.

The Nitty Gritty
This new version is written in Java using Struts and Hibernate. These are enterprise frameworks that eliminate the need to create custom libraries. These “custom libraries” are what caused a lot of the instability with the previous version of the Program Guide.

The previous version stored all of its data in an Oracle database, with no expiry policy. This means that you could go back years to see what was on CBC Television in 2003, for example. The new version stores all of its data in a PostgreSQL database with an expiry policy. There really is no need to know what was on CBC.

The new guide is also extremely extensible. Able to output in multiple formats including HTML, XML, and JSON. This allows other CBC developers to leverage the data located in the guide.

Quick Rundown
Program Guide information is available for all CBC properties: Television, Newsworld, Radio One, and Radio 2. It also includes A to Z guides of all CBC Programs and Personalities. The new guide has been redesigned so that it is easy for you to view what is currently on air right now, which is highlighted in blue.

You are able to “segment” your day into early morning, morning, afternoon, or evening. So you only can only see 5-8 hours blocks at a time. You are also able to view the full day, or a schedule for the entire week. Clicking on a show title will bring you to the program page which will allow you to see air times, plus a description about the show and its personalities. You can also filter your schedule by program category. So you can only see Sports, Comedy, Drama, etc.

The print friendly version of the guide is well formatted and easy to read.

The Future
With the new Program Guide framework we are able to provide a lot of new features. Some ideas floating around include RSS feeds of your favorite show (air times, descriptions, etc..). The ability to include program/personality information in our search engine is also a possibility.

The new Program Guide will be used during the Olympics to allow you to know exactly what event will be on-air when. You will be able to access this information from the Olympics page or the Program Guide main page.

Expect to see more features and pages that utilize the new Program Guide in the near future!

6 Comments » Email This Post
  Under the Hood Posted at 2:34 pm (22 Jul 2008)



Farewell Webtrends, Hello Hitbox!

And… farewell Blake.

The following is the last of the “Under the Hood” columns that have appeared on this blog for more than a year, courtesy of CBC.ca tech guru Blake Crosby.

Blake worked with CBC.ca for almost six years, first joining the team to babysit the servers during the Salt Lake City Olympics. He went on to work on other Olympic and elections sites, among others, and won an award for his work on CBC.ca’s Media Resource Locator tool (see his earlier column.)

Blake can fly!Blake has moved on to a company called VerticalScope. Long term, he’s working toward a career in aviation – you can track his progress on his flying blog.

Thanks, Blake!
~PG
————–

Farewell Webtrends, Hello Hitbox!

There have been some behind the scenes changes in the way we process and crunch the web server log files.

Webtrends
The software we were using previously was called Webtrends. It processes the raw log files from the web servers and produces graphs and charts.

The main advantage to using Webtrends is the fact that it processes the raw web server logs. Anytime someone fetches content from our web servers, it is recorded in a log file. Whether this be a mobile phone, Internet Explorer, your grandma’s 386, or your text only browser - it’s all tracked.

Items such as your IP address, the page you were requesting, the type of browser you were using, and the date and time were recorded. This provided a solid source of data to process.

One of the downsides was the Webtrends limitation that the log files needed to be in chronological order. This is impossible with our website, as we have many different log file sources that are all out of order. There was a lot of overhead to merge all these log files into a chronologically correct source of data for Webtrends.

Changing Business Requirements
HBX Analytics With the recent “upgrade” of the internet to Web 2.0, CBC needed to upgrade their website with more “Web 2.0″ features. This included items such as the most viewed stories, or most e-mailed stories. This real time data was available from the web server logs, but Webtrends couldn’t process the data fast enough for it to be useful.

This is where Hitbox, our new system, shines. The Hitbox product comes from a company called Visual Sciences (formerly WebSideStory.) It works the same way as Webtrends, except it offers real time data of people visiting the website. This is not done using log files, but javascript instead.

For every single page you visit on CBC.ca, a cookie for “a.cbc.ca” will be set. This cookie is used by Hitbox to track your movements throughout the website, and is recorded in real time. Although no identifiable information is recorded, we can see how individual users use the website.

That means content producers can track the performance of various areas of their sites in real time - understanding what stories are most popular, the times of day with heaviest usage, the most common navigation paths through the site, what links users follow to and from stories, and so on. By watching specific live stats instead of waiting for a report, the content itself can better reflect users’ actual behaviour.

7 Comments » Email This Post
  Under the Hood Posted at 11:16 am (28 Nov 2007)



Wiki Mania!

Under the Hood

It seems that computer programmers like to use Hawaiian names to describe their applications. We use two such applications:

  1. Media Wiki (”quick” in Hawaiian is “wikiwiki”)
  2. Akamai (Hawaiian for “intelligent” or “smart”)

CBC has their own wiki and uses the same software that powers The Wikipedia called Media Wiki. Anyone and any department is welcome to use it. We currently have 259 articles and growing.

CBC Wiki

CBC Wiki front page. Click for larger

Some of its uses:

  • Training guide for new hires
  • Personal pages for employees who use them as note pads or a place to keep track of project progress
  • Information about infrastructure, policy, phone numbers, etc..
  • A listing of standards to be used by content producers

The wiki is only available to CBC employees at wiki.tor.cbc.ca . It is pretty technical as only the IT departments are currently using it but should be useful to other departments as well including Radio and TV.

1 Comment » Email This Post
  CBC.ca web site, Under the Hood Posted at 2:51 pm (07 Aug 2007)



What’s The Password?

Under the Hood

Manging multiple machines or accounts involves remembering quite a few usernames and passwords. Every member on the team also has to use the same usernames and passwords as well. The challenge is to find a secure way of storing these login credentials so that everyone has an easy way to access them.

Originally we stored all the usernames and passwords in what we called the “Password Book”. This book was a simple notepad with the pertinent information stored in a cabinet close by. The main problem with this system was that the information was not available unless you were actually in the office. This made on-the-road troubleshooting almost impossible.

Currently we store all of the information in a blowfish encrypted database. The database is stored on a Wiki page (which I will talk about in another post) where it can be downloaded and stored on a usb key.

This database is then read by a program called Password Gorilla. The advantage to using Password Gorillia is that the client is available on a multitude of platforms, including mobile ones like Windows CE (but not Black Berry).

In reality, we just need to remember one password: the password to the encrypted database, and we have access to all the usernames and passwords we need in order to do our jobs.

10 Comments » Email This Post
  , CBC.ca web site, Under the Hood Posted at 9:54 am (10 Jul 2007)



We Know Where You Live…

Under the Hood

… 82% of the time.

Weather Widget

One of the cooler features of the new site design is the weather “widget” in the top left hand corner of the page. Some people might have noticed that the weather report being displayed is for the major city that is closest to you. How did our website know what city you are in?

It’s All About The IP Address
Every computer on the Internet has an IP address. These set of numbers allow networks to send the data you requested to your computer and not the one down the street. The IANA is the organization that keeps track of IP addresses on the Internet. They assign a block of addresses to 5 different portions of the world: Africa, Asia/Pacific, North America, Latin America and the Caribbean, and Europe, Middle East, and China. The fact that this division is made is the first step in figuring out your physical location based on your IP address. At this point, we can narrow down which region of the world you are located. Obviously not accurate enough to use in our application.

In each one of these regions there is a organization that then redistributes IP blocks allocated to them by the IANA to various people or companies in that region. For our discussion, we will focus on ARIN (American Registry for Internet Numbers) which handles IP addresses in North America. ARIN would be similar to a telephone company that distributes telephone numbers when you move to a new house. Except in this case, they hand out IP addresses to ISPs and Corporations.

It’s Public Information
ARIN (as do all “registries” as they are called) require that “owners” of IP addresses have a physical address on record. This information is available to the public at various locations.

Lets take a real world example: My IP address for my computer at home is: 24.137.199.17. If I type that address in the ARIN website, I get the following result:

Aurora Cable Internet ACI2 (NET-24-137-192-0-1)
                                  24.137.192.0 - 24.137.223.255
Aurora Cable Internet HE-24-137-192-0-19 (NET-24-137-192-0-2)
                                  24.137.192.0 - 24.137.223.255

Great! I know that a company called “Aurora Cable Internet” is responsible for my IP address and any other IP address that is inside the ranges listed above.

The string after the company name (”ACI2″ and “HE-24-137-192-0-19″) will be able to provide us with more information about that company. Plugging in “ACI2″ into the same search box on the ARIN site reveals (among other information):

OrgName:    Aurora Cable Internet
OrgID:      ACI-38
Address:    350 Industrial Parkway South
City:       Aurora
StateProv:  ON
PostalCode: L4G-3L6
Country:    CA

Now we all know that I live in Aurora, ON! More importantly though, we have just associated a physical location with a IP address.

What’s this 82% business?
The database that we use is only 82% accurate within 40kms of true location.

So what happens if you are in the 18%? The weather defaults to Toronto (further proof that we’re the centre of the universe). From there you can click on the “change city” link to save a new location.

I live in Canmore, but the weather for Calgary shows up. Why?
We’ve only discussed one part of how this system works. We can now identify which city you are probably located in. However, we have weather data for hundreds of cities across Canada and we need to determine which city with a weather station is closest to you.

Currently the weather widget only knows about 21 cities in Canada (we will be expanding this shortly). To find the closest city with weather station to Canmore we use a calculation called “great circle distance“. Put simply, its the distance between two points on a sphere (the sphere in this case is the Earth).

Great Circle Formula
Holy Math Batman! The Great Circle Formula

Using the latitude and longitude of Canmore we calculate the great circle distance to the 21 supported cities and return the one that is closest and display that data in the widget. This calculation only happens once and the information is saved as a cookie in your browser.

Tip of the Iceberg
The weather widget is just a sample of some of the possibilities with the location based website customization. Some of the possibilities can include:

  • Biasing your searches on the website to rank items that are physically closer to you more important than others. For example if you search for “Conrad Black” and you are located in Vancouver, show news stories of Conrad Black in Vancouver at the top of the results list.
  • Customizing the news page. Similar to the above, display your local news before or above the national news.
  • Display the proper radio and television schedules for your area automatically.
  • Show you more “relevant” advertisements based on your location.

I’m sure you will see more of this type of customization as cbc.ca evolves over the next little while.

15 Comments » Email This Post
  CBC.ca web site, Under the Hood Posted at 10:37 am (05 Jun 2007)



Who’s Visiting CBC.ca?

Under the Hood

One of the great things about the Internet world (that our TV and Radio departments are envious of) is that we can get statistics about our visitors in real time. All of these statistics are anonymous, they don’t contain information such as your address or your name. We can, however, mine some pretty interesting information from the log files that our web servers generate.

It’s a Windows World
I’m sure its no surprise that the majority of our users are browsing the website using a windows machine and Internet explorer. In fact, 75% of hits to CBC.ca were done from users running Windows XP. Similarly, 72% of the hits to CBC.ca were done from Internet Explorer. Take a look at the breakdown for Jan 1st, 2007 to April 1st, 2007:

Visitor By Browser:

Internet Explorer 72.22%
Mozilla 17.46%
Safari 4.62%
Others 5.70%

Visitors By Operating System:

Windows XP 76.39%
Windows 2000 7.97%
Machintosh PowerPC 4.82%
Others 4.15%
Macintosh 2.06%
Windows 98 2.01%
Linux 0.87%
Windows ME 0.76%
Windows 2003 0.50%
Windows NT 0.33%

How We Get This Information
Every time your browser fetches a page from CBC.ca the web server tracks that “hit” in a log file. Here is what an example log entry looks like:

69.17.178.81 - - [24/Aug/2005:21:03:17 -0400] "GET / HTTP/1.1"
200 4532 "http://www.google.com/"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rev:1.7.0)
Gecko/20050716 Firefox/1.0.6"

Without going into too much detail, you can see the following information:

  • The date and time the request for the page was made
  • If the user came to this page from another one.
  • The browser version, type, and operating system
  • The size of the page the user downloaded

Using the information found in these log entries, we can come up with quite a few types of statistics, such as:

  • What country the user is coming from
  • What the busiest time of day for CBC.ca is
  • What the most popular pages are
  • What 3rd party websites are generating the most traffic to us
  • If there are any broken links on our site
  • What language the user ’s operating system and/or browser supports

How CBC.ca Uses This Information
Understanding who our users are and what type of browsers and operating systems they use is an important part in designing the services that are offered on CBC.ca.

We also use this information to do something we call “dayparting”. If we discover that the majority of the traffic to Business/Money section of the site during the lunch hour, then we may change the way items are displayed in the line-up on the front page. For example, we may promote more “business” related stories.

11 Comments » Email This Post
  CBC.ca web site, Under the Hood Posted at 4:22 pm (25 Apr 2007)



All About Radio 2 Schedules

Under the Hood

[This post is courtesy of one of our developers, Keith Murray]

As you are probably aware of from the other articles on this site, the folks at Radio Two have rolled out their new lineup of shows. What you may not realize is that at the same time, a small group of folks at CBC.ca have been working overtime to set up a landing page to support the new look and feel of Radio Two.

Whats New?
Some of the cool things on the new landing page are detailed schedules, program descriptions and host bios, and playlists for the shows. The playlists come from an internal system called INews that I know nothing about. The show listings and bios come from another internalsystem called Program Guide, and that’s something I know about.

Before I get all technical and tell you about how the listings and bios get to the new Radio Two landing page, let me describe my relationship with Program Guide.

Some History.
Way back in the 1980s, I would ride my bike to Greenly’s Bookstore in downtown Belleville once a month and buy a copy of The National Radio Guide. It was a glossy magazine published by the CBC that had articles, detailed show descriptions, and bios of on-air personalities. The best thing is that it also had the complete Radio and Stereo (now known as Radio One and Radio Two) schedules. Never again would I miss an interesting episode of Ideas, with Lister Sinclair.

Check out these scans from the February 1989 edition. Ahhh, Morningside, Brave New Waves, Night Lines…

Radio Guide Cover Radio Guide Listings
Click for larger…

The CBC stopped publishing the Radio Guide in the late 1990s, and they now publish listings on the Program Guide.

When I started working at CBC.ca, I found out that the Program Guide application was in need of an owner. Nobody wanted to maintain the code. But for me it was like meeting an old friend, so I gladly took it over. It didn’t take me long to find out why people avoided it.

Mouldy Code
Program Guide is suffering from bit rot. It’s old, and too many features have been hacked on. As well, every time somebody accesses a schedule, the application dynamically looks up the schedule in a database, and creates a nicely formatted page, just for you. Every time. It has to, because the local morning show in Tofino isn’t the same as in Montreal. But, that’s just not very efficient, and Program Guide crashes often.

Ok, I’m finally going to talk about how the schedules on the new landing page work.

When the Radio Two folks approached me with the idea of the new landing page pulling data from Program Guide dynamically, I originally said “No way, it will never handle the load!”. However, it was mentioned that the new Radio Two schedule had no local programming, so it only had to look up the schedule once per day, and it would be the same for all time zones (”half an hour later in
Newfoundland”). That I could live with.

So once per day, the Radio Two schedule, the show descriptions, and host bios are retrieved from Program Guide and stored in a file format called XML. XML files are an easy way to store lots of related information in a simple file that can be read with any text editor. It’s also easy to write other programs that read XML files and do something useful with them. For example, our podcasts are served up via RSS feeds, which are really just XML files.

Here is a snippet of the schedule for Saturday night in XML.


<schedule>
<timeslot>
<startdate>2007/03/25</startdate>
<starttime>20:00</starttime>
<endtime>22:00</endtime>
<title><![CDATA[Canada Live - With Patti Schmidt]]></title>
<episode></episode>
<shortdescription><![CDATA[Listeners will be transported to concert halls, music clubs and festival stages across the country for live performances]]></shortdescription>
</timeslot>
</schedule>

The XML files with the schedule information are transformed into HTML files on the landing page via XSLT. XSLT is a fairly efficient way to take XML files and transform them into something else, usually while adding formatting or other display components to
the data.

Here is a snipped of the XSLT file that correlates the programs with the host bios.


<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html" indent="yes" omit-xml-declaration="yes"/>
<xsl:param name="param1" />
<xsl:variable name="titleKey">
<xsl:value-of select="$param1" />
</xsl:variable>
<xsl:template match="/">
<xsl:for-each select="personalities/personality[./title=$titleKey]“>
name: <xsl:value-of select=”concat(firstName, ‘ ‘ , lastName)”/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

This is what it ends up looking like on the new landing page.

r2_html_example.jpg

Radio 2 HTML ExampleThe core schedule data is still coming from Program Guide. However, with the new Radio Two landing page it’s much more efficient, as the information is only read once per day, for everybody, in any time zone.

Note from Blake: The program guide is slated to be revamped with performance increases and easier to use interface. For all your Radio 2 needs you can go straight to the Radio 2 website and not have to use Program Guide at all.

6 Comments » Email This Post
  CBC.ca web site, Under the Hood Posted at 10:21 pm (26 Mar 2007)



Under The Hood: Test The Nation

Under the Hood

[Note: I've updated some content since the broadcast]

Incase you don’t know, Test The Nation is a show on CBC Television that will allow you to take an IQ test and see if you’re smarter than the panel of surgeons, tattoo artists, or radio DJs.

There are two ways you can take the test. First, is online at cbc.ca/testthenation. The second is while you are watching the show by keeping track of your score on a scorecard.

Who Made It Happen at CBC.ca?
A quick glace at the credit page yields names and titles, but what exactly do these people do?

Project Management ensures that the project runs on time, meets all deadlines, co-ordinates the resources, schedules meetings, and is the link between the client and the project team members.

Platform Support is the team that I work on and provides support to the Technical Lead, Programmers, QA Team, and Design team. When things aren’t working as expected with the webservers, databases, ftp server, or mailing lists the Platform Support team is there to keep everything running smoothly.

Design Lead/Front-End Programming deals with anything that is the front-end of the site, that is, what you see and interact with. This role invovles integrating dynamic content, such as flash applications, or ajax widgets into the site. This also includes programming these applications.

The Research department takes all of the raw numbers from various sources and computes official and audited statistics for tv, radio, and cbc.ca. Their role in Test The Nation is to ensure that the data from the online IQ tests is accurate and in a format that can be digested by other departments for inclusion on TV or the website.

Copywriting is the term given to a person or department who’s role is to ensure that the text content is accurate and error-free. This includes checking spelling, grammar, and ensuring that the copy (their term for “text”) is correct.

Design, like the name says, involves designing the look and feel of the website. This includes colours, fonts, and layout. With projects that are integrated with TV there is usually a style guide to follow so that both TV and Online look unified.

Technical Lead is the person that all non-technical people go to in order to answer to their tech question. This person utilizes all of the resources at his disposal (Platform Team, IT, External Vendors) to get the answer he’s looking for. In the case for Test The Nation this person also did the back-end programming (written in Java of course).

Quality Assurance ensures that everything works properly. This includes testing the website on a multitude of platforms and browsers. As well as testing the back-end components of the site. With Test The Nation their role was to ensure that there was no bugs in both the front-end (Flash) and back-end (Java) code.

ttn2.jpg
A view from the set

So How Does It Work?
The online test is a flash application that delivers the question and records your answer.

When you have completed the test data, such as: how you scored in each section, the time it took to complete the test, your demographics that you inputted at the start of the test, and your computed IQ are stored in a database.

All of the data is funneled through our Content Delivery System. Using their system, we are able to throttle the number of test responses that reach our database. This allows us to ensure that the database never goes down because it’s overloaded. If, for whatever reason, the site goes down, our CDN will queue up and test responses for later delivery when the site returns to normal.

At certain intervals during the show, the statistics department will deliver the latest results from the online IQ test.

Interesting Trends
Keeping an eye on the website traffic during events like this is always fun. Take a look at the following graph:

ttngraph.jpg

The red line indicates when the broadcast aired in the Atlantic region (7pm ET). As the show airs in each timezone you can see a corresponding spike in traffic. The next “blip” is eastern time. The following central, mountain, then pacific. If you were to drill down into each spike, you could see corresponding “mini spikes” during commercial breaks as people browse the site during TV commericals.

Keeping It Fun
During events like this, I like to spice things up a bit by having a pool of some sorts. A few of us in the office are making predictions as to the number of people who will do the online IQ test. Some people guess, others try to make an educated guess by using wacky formulas (like, taking the percentage of the estimated number of tv viewers) or “insider information”. Whatever the case, it will be interesting to see who won.

CBCers were able to take the IQ test before the show aired. I’m ashamed to admit that I scored an IQ of 96. I’m going to use the excuse of being distracted because I was doing the test while watching TV ;)

18 Comments » Email This Post
  CBC.ca web site, Under the Hood Posted at 6:33 pm (18 Mar 2007)



Daylight Savings Time

Under the Hood

Daylight Saving What?
It’s amazing how something invented so long ago can cause grief today. The United States passed a law that would change the rules that we follow for daylight savings time.

As you know, computers keep track of time and can automatically adjust their clocks for daylight savings time. There is no need to “spring forward” or “fall back” your computers’ clock.

All of these DST rules (and all time zone rules) are stored on your computer as part of your operating system. If there are any changes with these rules your computer must be patched or updated in order for it to know about them.

How CBC.ca tracks time
Time is kept in sync with all of the servers using a system called “Network Time Protocol” (NTP).

NTP allows all of the servers to sync their time from really accurate time sources like atomic clocks. This is extremely important in the TV and Radio world and equally important at CBC.ca.

The majority of CBC.ca web servers keep track of time in Eastern Time. The eastern time zone observes DST. This means that if the DST rules change, all of the servers that run CBC.ca must be updated as well.

So what’s the big deal?

Newsletters
The news digests that are sent out are done so on a schedule. This would mean that the emails would arrive an hour earlier (or later).

In fact, twice a year, we need to update this schedule manually because Saskatchewan does not observe DST. In order to keep the Saskatchewan news digest going out on a regular schedule, every spring and fall we need to adjust the schedule on our server manually.

For you techies out there, that means adjusting the cron job each time we switch from daylight savings time to standard time:

###### Sask is in pre-cambrian time. No Daylight saving time:
###### One hour behind ET from Oct until April
#00 13 * * 1-5 perl /sites/cbc.ca/bin/digestmail.pl sask-am-headlines
#00 21 * * 1-5 perl /sites/cbc.ca/bin/digestmail.pl sask-headlines
###### Two hours behind ET from April until Oct
00 14 * * 1-5 perl /sites/cbc.ca/bin/digestmail.pl sask-am-headlines
00 22 * * 1-5 perl /sites/cbc.ca/bin/digestmail.pl sask-headlines

So yes, we will need to update those comments to refelct the new DST rules.

Log Files
Every time you visit the website, certain information is recorded in a log file. We use this data to see which pages people are visiting and how popular certain sections of the site is.

The time of the visit is also recorded and needs to be accurate in order for us to produce accurate statistics.

Program Guide
Accurate time is necessary in order for the Program Guide to be useful. You wouldn’t want to miss your favorite CBC program because the Program Guide is an hour off.

If Only…
If we were using UTC time instead of eastern time for timekeeping, this would be a non issue and we wouldn’t have to patch four different types of operating systems.

12 Comments » Email This Post
  CBC.ca web site, Under the Hood Posted at 11:44 pm (19 Feb 2007)



Publishing News To CBC.ca

Under the Hood

Here is a quick rundown of how news is published to CBC.ca.

The Software
CBC.ca uses a piece of software from Interwoven called TeamSite which is internally branded and referred to as EPT (Editorial Publishing Tool). It was purchased to replace a home grown solution that was buggy and unreliable. It has two main components:

The purpose of Open Deploy is to take the content from TeamSite and publish it to our web servers. EPT allows the news writers to focus on writing news without having to worry about HTML or layout.

The writer is presented with a bunch of fields that must be filled out, such as:

  • Headline
  • Byline
  • Keywords
  • Deck
  • Body
  • Related Links
  • Media (Images or Video)
  • Category

Once the story has been written it is saved for the copy desk to review and make changes as needed. Only the copy desk can publish stories to the website.

The Lineup
A story can be published to the website but nobody will know about it unless it is in the Line Up. The Line Up is a list of stories that appear on the CBC.ca front page, or the News Page for each section.

The line up builder is a tool inside EPT that allows the news writer to adjust which stories appear on the landing pages and in which order.

Sample Line Up

The photo illustrates what a line up is. You can see how that line up is edited inside EPT by taking a look at this screenshot.

The Deployment
Once a story is ready to be published the Open Deploy component reads the data from TeamSite and writes the HTML files to the web servers.

3 Comments » Email This Post
  CBC.ca web site, Under the Hood Posted at 1:11 pm (21 Dec 2006)



Needle in a Haystack

Under the Hood

The search engine that powers both CBC.ca and radio-canada.ca is a Google Search Appliance (GSA). As such you can use most of the tools and search terms on the commercial site on the CBC.ca search engine.

Query Hacking
Let’s start off with some basics. The search engine by default “OR”s your search terms. That means that if you type in “blue house” (without the quotes) the search engine will return hits that have the word “blue” or “house”. If you would like to force the search engine to “AND” your results then you need to wrap your search term in quotations. This will result in only the exact phrase being matched and returned in your results.

If you would like to omit certain terms from your results, you can prefix them with the negative sign (”-”). If you are looking for information on bass fish (and not related to music at all) you can remove all references to “music” by using “bass -music” (without the quotes) as your search query.

You can restrict your search results to a specific file type by using the “filetype:” query prefix. If you are looking for a specific PDF on our site you can use “filetype:pdf” in your query. “budget filetype:pdf” (Without the quotes) will return all of the pdfs that have the word “budget” in them, on our site. The GSA can index over 100 types of files including binary files such as jpgs, tiffs, psds, and flash content. However, we do not crawl and index these types of files. Only “text” content is indexed (this includes, pdf, doc, xls).

The “site:” query prefix allows you to restrict your query to a specific section of the site. Some of you might remember a while ago that we used to allow the user to restrict their search query to news,sports, or arts, by using the tabs at the top of the search results. That has been gone for a while, but you can still achieve the same thing. For example: You want to find sports stories about Tod Bertuzzi and not stories about his home life in BC. You can use the following query to return only sports stories “Bertuzzi site:www.cbc.ca/sports”. This will return 679 instead of 1380 results (if you did not restrict your search to Sports stories). On the other hand if you want to only view stories from BC you would do: “Bertuzzi site:www.cbc.ca/canada/british-columbia”

The “site:” query prefix is handy for searching our newsletters. If you want to search the Quirks and Quarks newsletter for “cats” you can use “cats site:interact.cbc.ca/pipermail/quirks” (without the quotes). If you want to search all of our newsletters amend the “site:” prefix to only read “site:interact.cbc.ca/pipermail”

It is important to not put on a trailing slash when you are using the site: query prefix unless of course you want to restrict your search results to that *exact* url and not be recursive.

URL Hacking

Now on to the more advanced stuff. This can only be done by hacking the URL directly. Once you have your search results page you can further change the results by adding items to the results url.

If you would like to get your search results in XML, you need to change two parameters:
1. Remove the proxystylesheet key (proxystylesheet=CBC)
2. Change the “output” key from output=xml_no_dtd to output=xml

For Fun, you can also get the search results using Radio-Canada’s template:
1. Change the proxystylesheet from proxystylesheet=CBC to proxystylesheet=RadioCanada

There are two options for sorting your search results. “By Date” just sorts the results, regardless of relevancy, in chronological order. “By Relevance” orders the results in an order that the GSA will think is the most relevant. You can combine the two, that is sorting by date and relevancy, by editing the sort key:
1. Change the sort key to sort=date:D:S:d1

If you are looking for French content on CBC.ca (and not radio-canada.ca) you can add the “lr” key to the url:
1. Tack on &lr=lang_fr to your search page url.

Ultimately, if you are looking for English pages on Radio-Canada’s site, you can add “&lr=lang_en” to their search results page url.

Things We’re Working On.

Here are some of the “neat” things we’re working on:

Current weather conditions: You can get the current weather conditions in the search results by suffixing your query with “weather”. For example, you can get Toronto’s latest weather conditions by typing in “toronto weather” as your search query. Right now this is only works for a select number of cities.

Latest News: If you use a query that is contained in one of today’s news stories, you will see a link to that news story at the very top of your results highlighted in a blue background.

Feel free to pose any questions about our search engine or suggest any features you’d like to see in a comment. If you would like more technical detail on how to hack our search results you can find them on the Google API page.

Add Comment » Email This Post
  CBC.ca web site, Under the Hood Posted at 10:47 am (29 Nov 2006)



Skinning CBC.ca

Under the Hood

[Note: this article was published prematurely... it has recently changed.]

With the “recent” redesign of CBC.ca a lot of work has gone into making the site XHTML compliant. The good news is that sections like the homepage and news stories can have their “look and feel” adjusted using cascading style sheets. This is not limited to simple font changes, but hiding content you don’t want to see, or repositioning elements of the site to suit your browsing style.

My CBC.ca
I’m not a graphic designer or HTML wiz, but I spent about an hour today hacking up a CSS file that would change the way I see the homepage and news stories. This is what I came up with:

New Front Page
Click for Larger

Feel free to download and apply this css to your browser.

Getting it to work in your browser
Applying your custom style sheet is easy in either firefox or Internet Explorer:

Firefox:
You can install the Style Sheet Chooser to allow you to specify your own custom style sheet for each site.

Internet Explorer:
You can specify a style sheet to use for all websites. So, the your customized CBC stylesheet might conflict with your other favorite sites. Click on Tools, Internet Options. Click on the “Accessibility” button and specify your style sheet.

How It Works
Think of CSS as a filter that you can apply to text. Your web browser downloads the HTML content then applies the style sheet to make it look the way it is defined in the CSS file. Elements of the site (which are contained in things called div tags) act as containers to hold your text. This is how you can “hide” large chunks of the site by telling your browser to not display that div tag.

Using your own CSS file will not save any bandwidth or reduce site load times. You are still downloading all of the page assets (javascript, images, HTML) regardless if your CSS file tells your browser to display it or not.

How CBC.ca Uses CSS
We use CSS to display a “print friendly” version of the site to your printer. This allows us to only keep one HTML file and not have to generate multiple HTML files of the same content for different outputs (print, mobile, email, etc..).

When you click on the “print” link in a news article, your browser applys the print CSS styles to the HTML and tells your printer to print based on that. You can preview what this looks like by using the “print preview” option in your browser. You will notice that all the navigation and mastheads are gone.

What can you come up with?
I’m curious to see what kind of things you can do with customizing the the site CSS. Feel free to post a link to your customized CSS for others to try in the comments section.

You can design a page with specific accessibility need in mind (for the visually impaired perhaps) or add additional content to the “print friendly” version of the news pages.

8 Comments » Email This Post
  CBC.ca web site, Under the Hood Posted at 2:01 am (09 Nov 2006)