Climate Change Tool Development: July 2009

Thursday, July 23, 2009

Progress, Problems, and Another Release (soon)

The problem with thumbnails not always extracting still hasn't been completely solved. I wanted to use a timer, but a timer doesn't work the same as, say, wait or sleep in C programming. I tried making it recurse on itself until the thumbnail that at least 1000 characters, but that exceeded the recurssion depth too easily when websites load extremely slow. Also, I can't simply put it in a loop because Firefox will think my script is frozen while it loops until the web page sort of loads, and then Firefox freezes. The Mozilla Developer Center says here that it isn't a good idea to extract page thumbnails during the onload event, and I would have to use the MozAfterPaintEvent. But, that is only in Firefox 3.5. Right now I just have a default image if the thumbnail could not be extracted properly.
Yesterday Jon Pipitone was giving me lots of suggestions for improvement. One idea that I really like is to add a filtering function, to remove all pages of a certain website. I don't think this would be too hard to implement, using a simple regular expression comparison while building the graph can easily filter out nodes of the chosen type.
The big issue to add is some sort of relevant ordering to the nodes. This is hard to do because I have to work around the limitations of JSViz. However, what I think I can do - for now - make root nodes appear in a sensible manner. Currently they spawn randomly across the screen, which isn't helpful. I can alter it so that root nodes appear in chronological order, either lined up vertically or horizontally. This would only help with at least a few root nodes, but it is better than nothing. For now, I'll try this and see if people think it helps.
Two recent features I've added: Quick Edit Title button and Live Updates. The Quick Edit Title button is a button on the browser toolbar to instantly open up the window for editing the title, it is basically a shortcut. Instead of getting to a site, thinking of a little note to leave, opening the graph, finding that node, right clicked and entering the new note, you can just click this button and do it fast. I've implemented Live Updates so that you no longer have to reload the graph any time you log a new site, it will update itself in real time.
I haven't gotten much feedback from posting my extension a few weeks ago, but then again there were only 10 downloads. Soon I'll be putting it up on my CSLab page, with all the latest features. I think before I do that I want to have some more reliable way to organize nodes (or at least root nodes) along with testing everything even more. And I have to write up another set of release notes - that's never fun though.

And finally, I found out that JSViz has a limitation for the number of edges come from/to a single node: 18. 18 edges are fine, but adding a 19th edge causes everything to catastrophically fail. The only reason I discovered this was because Ainsley tested her Trac plug-in on a large repository and it failed spectacularly, so I tested this by going to Wikipedia and opening every link on one page in new tabs until it crashed. My theory is that the minimum distance set for the magnetic forces can not be maintained once the 19th node is created, since there is not enough room for all of the nodes to evenly spread out around it. It could also be that the magnetic repulsive forces are so compacted and balanced that the 19th node throws the center node off center, causing it to rocket accross the screen. It has to do with the forces because a root node can have more than 19 edges attached to it. Here are so example files:

Latest Version of BreadCrumbs

(Note: I'm providing it early with no release notes just to illustrate this edge/node limit per node, some features still are not completely ready [ Stop Animation still has bugs, along with some deleting issues]. If you have an older version of BreadCrumbs please go to your Firefox Profile directory and delete "breadcrumbs.sqlite" as that old database does not have a thumbnail column.)

Session file that illustrates the breaking [Copy into a text document and save as .session]

Session file that shows a root with ober 19 nodes/edges [Copy into a text document and save as .session]

Friday, July 17, 2009

Thumbnails

I'm revisiting the the idea of having thumbnails - not as each node, since that would cause the graph to be far too large, but to appear upon mousing over a node. Now that I know more about Javascript and Firefox it wasn't very hard to be able to get a thumbnail of each page. However, ever a thumbnail that's 20% the size of the real webpage will be huge when converted to a string using the Canvas method .toDataURL(). I was making some thumbnails that were, again, nearly 10,000 characters. So, there was no way I could reliably insert a string that long into the SQLite database for each node.
After talking it over with Steve he suggested trying to use the cache to store the thumbnails in temporarily, and then when the user saves the result would be a directory stucture. The idea is similiar to how Firefox can save webpages, by creating the directory structure of the page and saving all of the images in that directory. I figured this would work, but I didn't like the idea of having to save a folder instead of just one file.
I spent at least three hours searching around the documentation about Cache on the Mozilla Development Center but there were no examples of how to use it, and the descriptions of all the Cache related functions were very vague. However, I used this site Mozilla Cross Reference that Blake Winton suggested and found this which was exactly what I needed, to understand how the function can work together. However, I still wasn't sold on the idea of storing in the cache and having to save a directory, not just a simple file.
Before trying to manipulate the cache I tried searching for more extensions that deal with thumbnails. I found that WebReview, one of the first extensions I looked at, had been updated. I looked a lot at the source code and it had changed a lot, so I hoped to get some more information about how it stored thumbnails. Unfortunatly, when I installed it it didn't work at all, the graph portion just kept raising exceptions whenever I tried to open it. And the part that I really wanted to know, what it inserted into the database, wouldn't execute either. So, no hope for finding out how WebReview stores thumbnails.
The next extension I tried was Thumbstrip, and luckily this also has to save a lot of thumbnails. I played around with it for a bit, and saved a session. When I looked at the saved file it turns out that they did it exactly how I initially planned on doing it - the thumbnails were saved in their text form and they were very long. I tested it, going to about 25 sits, then saving it. The resulting save file was over 5mb with nearly 40,000 lines and 6 million characters. But it worked, so I figured I might as well try my initial idea.

I altered my database and tried inserting a few thumbnails - in text form - into it, and also changed the each node's mouse over div tag to show the thumbnail instead of the title. The result?

It does exactly what I wanted it to. Now I just have to flush out a few bugs, since sometimes the page isn't fully loaded and the thumbnail is incomplete, but that shouldn't be too hard to fix with a timer.

New Approach

I've been trying to change the way I program, and this is the first real attempt I've made. It's a simple function - deleting an edge. However, instead of just diving in and coding and running and see what works and what errors I get, I sat down and wrote out on paper what I'd have to do. I took into account as many possible boundary cases as I could think of and sketched exactly what I had to do. Here's the final version of what I wrote out:

deleteEdge(edge, edgeContainer)

1) Remove edge from edgeContainer (SVG object) - it can't be seen now.
2) Extract row edge.idx from the database, getting its Source, Destination, and Title. [destination, source, and title are column names of my database]
3) Attempt to extract similarEntry from the database, where destination = Destination (from step 2) and source != Source (from step 2). This is checking for if there is another edge that leads to Destination.

if (similarEntry is not NULL) {
// Then the Destination of this edge to be deleted has at least one other edge leading to it, so after deletion it will NOT become a root node.
if (similarEntry's row > edge.idx [it's row number]) {
// This means that the edge to be deleted is the first entry in the database where Destination (from step 2) is in the destination column. This is important because the user altered title for each website is stored in that websites first occurance in the database. So, we have to copy the title from this entry to the next reference of Destination - similarEntry (since we know it comes after it since its row number is greater).
4) Update row similarEntry to have title = Title (from step 2).
}
5) Delete row edge.idx from the database.
} else {
// There is NO other entry in the database where destination = Destination and source != Source, so the Destination of this edge we're deleting will become a root node.
6) Update row edge.idx so that source='NULL', making it a root.
}

7) Reload the graph_page.html, to reset the forces.
8) Done.

Once I had this all written up writing the code was simple, and aside from a few typos it worked perfectly on the first run. This is a major improvement over my standard method of writing the first idea that comes in to my head, then running it and fixing bugs over and over until it works.

Friday, July 10, 2009

Features to Come

So I've gotten the first draft of the release notes up. I hope they'll suffice. It took a lot longer to do than I originally had planned due to testing and documentation. I'm pretty picky on documentation and trying to rename variables/functions to make everything more understandable.
The testing process mostly found errors where I had variable names changed. The frustrating bugs came from boundary cases and specific series of events that had to occur which would lead to an error or a site not getting logged. One example is that having multiple tabs open prior to turning logging on causes problems since those sites didn't get logged (no Load event occured to log the url) and thus would create unbounded non-root nodes. In order to solve these I had to add multiple if-else statements to make sure that each property required to successfully log the URL was present. This is extremely hard to do in the case of the user opening pages that get loaded in background tabs, through "open in new tab". The problem arises from the fact that background loading does not trigger any progress or event listeners, so I have no reliable way to log the site. For now I have it set on a timer of 500 milliseconds - it just needs to load the URL so that it does not default to 'about:blank'. However this does not always work. I could possibly change how the "open in new tab" function works, by automatically switching focus to the newly opened tab, but users may not like an extension altering default Firefox functionality. I at least have a backup safeguard so that the extension dosen't break. Whenever a tab is selected if, if the URL has not already been logged it gets entered into the database, but as a root node. This does not accurately re-create the user's browsing session, but it keeps things stable.
Lately I've been reading some textbooks in my free time, the most recent one being "Artifical Intelligence and Software Engineering - Understanding the Promise of the Future". After reading the first couple chapters - a general overview of Software Engineering ideas - I've realized that I tend to follow a Run-Understand-Debug-Edit style of development. I find it is easier to keep a mental model of the program when I do incremental development of this sort. The result is that the program generally works (with the exception of a few unconsidered boundry cases) but is also messy. For example, the main procedure for logging websites currently has many bits and pieces and messy subroutines, since over time I have slowly widen the scope of what it can do. It works, but it's messy. I plan on taking a couple days to flush it out - to read over all the pieces and try to organize it to flow logically. My hope is that this will increase efficiency, reduce line-count, and prevent future bugs.
I've finished with the basics, so now I have a few options on what to tackle next. Not all options are needed, and it really depends on what would be the most useful feature to have:
1) Thumbnails
I've had some problems early on with attempting to use thumbnails, but now that I have more experience with Javascript and Firefox I may be able to figure out some way to make the nodes thumbnails.
2) Relevant ordering of nodes
Right now the layout of the graph has no significantly, it just uses magnetic and spring forces to spread the nodes out as evenly as possible. I could add a new layout type to list everything in chronological order, or a way to display only one specific website and all sites that came from it or linked to it. The graph could be filtered by site, so that a scientist could only see papers from one specific domain, for example.
3) Graph manipulation
More features for users to edit the graph: New Edge, Delete Edge, Collapse Node, anything. These functions would be designed to present a clearer and more fluid graph for not only personal reference but sharing with others as well.
4) Significance of a node
It was suggested two weeks ago, when I did my demo to the grad students, that I add a feature to show how significant or important a website is. I could do this by logging the length of time that was spent on each site, and then alter the size of the node to relate to how long was spent viewing. This would be more useful for the casual user than for a scientist, because the length of time spent on a website for a scientist could, most likely, correlate to the length of the article or paper being read. However, it is still an interesting idea and provides more information about the browsing session to the user at a single glance.

Thursday, July 9, 2009

Release Notes

DOWNLOAD:

Download here from rapidshare. It can only be downloaded 10 times so if the link is down please send me and e-mail so I can reupload it.

INSTALL:

1) I suggest setting up a new profile - just to be safe - as well as keeping an eye on the Firefox Error Console. To set up a new firefox profile see this short document: http://support.mozilla.com/en-US/kb/Managing+Profiles

2) Drag BreadCrumbs.xpi into your Firefox browser ( >= 3.0)

3) Click Install

4) Done!

TO USE:

On the bottom right hand corner of the Firefox status bar will be a red icon with URL on it, that is the main controlling icon for BreadCrumbs.

To begin logging websites, simply left click the icon. It will turn green to signify that it is running. Browse away! It can be clicked again to turn off - essentially pausing the logging - and clicking once more will resume where you left off.

Right clicking the icon will present a context menu.

Save Session: This will save your current session to a file.

Load Session: This will allow you to load a saved session file.

New Session: This will erase any logged websites and start you over with a fresh graph.

View Session: This will allow you to view the graph.

Logged Browsing History (the graph)

Show Session Trail: This shows a list of the links from site to site your session, ordered from earliest (top entry) to most recent (bottom entry). Each entry shows the destination site and the exact date and time. Hovering your mouse over an entry causes it to be highlighted, along with the edge that it describes. The entry can be clicked to bring up a window to enter a new annotation for that link [Please do not include "|||" in the annotation; three pipes].

Reload: [Self explanatory]

Pause Animation: When the page is loaded the nodes of the graph will continually spread out so that all are visible. Once they are visible enough for you you can click the Pause Animation button to stop them from spreading out more. Note: Any clicking on the graph will cause the animation to resume.

The Graph itself:

Nodes: Each node of the graph is a website. Most nodes can be dragged around the screen, but if is is outlined in red then it is a root node and thus cannot be dragged. Hovering over a node causes a tooltip to appear with the title of the site. Right click on a node to open a context menu.

Edit Title: This opens a window to allow you to rename the node to anything you want - please, without the character sequence "|||" [three pipes].

Visit Site: This opens the website in a background tab in your browser.

Collapse Node: [NOT IMPLEMENTED YET].

New Edge: [NOT IMPLEMENTED YET].

Delete Node: This will delete the node and all edges that connect to it. This may result in multiple disconnected graphs, which is fine. [Deleting a node causes an automatic reload to reset the magnetic/spring forces - I am working on a better solution].

Close: This simply closes the context menu.

Edges: There are two types of edges, solid or dotted. Solid edges are formed when a link is a link is clicked (or selected to "open in new tab"), so it corresponds to direct links. Dotted edges are any other type of link: a bookmark, clicking "Home", manually entering a URL, opening a new tab (ctrl+t) then entering a URL, etc. If logging is paused during a session then turned back on in the future, the resulting edge will be a dotted edge since there may have been many sites in between.

Hovering over an edge causes it to be highlighted, along with the corresponding entry in the Session Trail panel (if it is not hidden), and also displays the link annotation. Right clicking on the edge will bring up a context menu.

Edit Annotation: This will open a small window to enter any annotation you wish for the edge [Again, avoid "|||"]. Clicking on the corresponding entry in the Session Trail will also open the window.

Delete Edge: This will delete the edge and reload the graph automatically.

Close: This closes the menu.

KNOWN BUGS/ISSUES:

"open in new tab" sometimes causes the resulting webpage to not become logged, or when it is logged it is set as a root node.
New Edge and Collapse functions are not implemented.
Selecting an improper file in the Load Session option causes it to break.
Saving/Loading a very long session is slow.
Sometimes the forces between edges and nodes in the graph cause it to spread out widely, the temporary fix for this is the "Pause Animation" button.
Not allowing websites to load sometimes results in two copies of the same link appearing in the Session Trail panel.
Favicon extraction isn't the best, but it should work for most sites.
Browsing with multiple windows has not been significantly tested.
The colours don't match in any sense.

CURIOUS?:

To view the source code rename BreadCrumbs.xpi to BreadCrumbs.zip, and unzip. If you want to see where everything is being stored go to your Firefox profile directory and look for breadCrumbs.sqlite. I use the Firefox extension SQLite Manager to check the contents of breadCrumbs.sqlite and the edgeLog table to see what's what.

Wednesday, July 8, 2009

Abstraction and Separation of Concern

Tuesday's lunchtime meeting was about Abstraction and Separation of Concern, with a lot of emphasis on open-collaborative science. I thought the discussion was very compelling, so I'd like to discuss (very informally) some thoughts on the issues.

Issues for "Open Collaborative Science":

1) Miss use and Misinterpretation

Science is generally a messy process. The steps that have been taken to reach a conclusion and deduce an explanation or otherwise may not always be completely rigorous, in the full sense of the word. This is just the way science works. However, the average person does not need to know the stumbles that scientists went through - and they shouldn't. Allowing this information to be visible to the public, all of the steps and tests and experiments that scientists did to reach their conclusion, adds the possibility for people to poke holes in the work based on, as they set it, "bad science". This goes beyond just trying to do "good science" and give collaborative support or constructive criticism. This could come from people just possibly being afraid of wary of the results or even a result of ideological principles. Anyone who does not like a new scientific development because it contradicts their preset ideology that they may have could go to the online notebook and see all the potential flaws, without fully understanding the heuristics and assumptions that scientists work under, the constraints they have to work around.

To combat this flaw I feel that the system should have multiple levels of openness - you can only dig ass deep as your qualifications will permit. I call this Limitation by Qualification. In order to effectively maintain a level of qualification that each user has the system would have to be maintained by one or more governing authorities from multiple scientific fields. Users would have their scentific qualifications for their field (and other fields) through an application process. The process could be simple: Do they have articles in respected journals? Do they have advanced degree(s) in their claimed area of expertise? Question of this nature would have to be asked. Once this is done, when scientists make entries or updates to their online notebooks data files, results, conditions, etc can all be locked to only be viewable by someone with the proper qualifications. The idea would be to have the abstract and a general idea of what the scientist is doing available to the public, with a finer level of detail of their works available to those who understand it.

2) Trust: How do I know this won't be used against me, or stolen?

Ideologies and fear of change are not the only drives for potential misuse of this data. Sometimes there are rivalries between scientists, and theft of work does occur. Limitation by Qualification would not work here - both scientists are equally qualified. To solve this an idea similar to Limitation by Qualification would be need, possibly something along the lines of a Limitation by Intentions. While it is easy for a human to figure out what I mean by Limitation by Intentions, this is not a simple concept to represent in a computer system. A rating system could possibly be used, to "blacklist" unethical scientists by majority opinion. This has the potential for abuse though, as scientists who merely disagree with the majority view could be "kept quiet" through blacklisting. Also, this relies on catching the scientist in the act, and thus does not prevent the initial theft. Another option would be to allow the owners of each online lab book to control who can view the bottom layer of their work. This would prevent theft, but defeats the very purpose of the system.

If the above issues can be overcome, then the implementation of tools to facilitate open collaborative science need a couple things:

1) Standardization for Accessibility by all Disciplines

The main reason (that I see) for persuing the idea of open collaborative science is to have scientists easily share information, to remove all the "red tape" that slows down advancements in technologies and sciences. Scientists should be able to share, but how do we help them find the information they need in an area that has many foreign concepts and details? My idea would be to use an Interfacer. The Interfacer would be, at a basic level, an expert system. It would take the users request along with the data objects of the users current experiment to extract a list of relevant information and data sources for them. It would have to be highly modular, with two (possibly more?) input modules required - one consisting of an expert system for the input, tailored to understand concepts and terminology of the user's area of expertise, and one consisting of an expert system for the output, tailored to understand concepts and terminology of the discipline of the desired information. How these two modules would interact is a difficult question to answer. One problem is that the issue being looked at is not well defined; there is no clear mapping of terms from one discipline into coherent terms of another discipline, for example. The Interfacer would have to be able to inferr based on what information the user thinks they need to what information they actually need. This issue was highlighted in the discussion with the example Steve gave about a scientist interested in plant growth ( I think?) wanted information from climate simulations about the state of the climate at a specific location, 50 years from now. The scientist got the results and then would draw conclusions from them. Steve pointed out that the conclusions drawn this way would not be scientifically sound since the scientist does not fully understand the assumptions that the climate model was built upon. For something of this nature to work clear and concise assumptions would need to be stated for an experiment or observation - like disclaimers for use by others. To have an Interfacer efficiently do its job each piece of data through the whole network of open notebooks would have to be extractable and have relevant information for the extraction process, be it tags or otherwise. This could be maintained by storing semantic information for each data object in the system, to come up with an "intelligent" description.

2) Collaboration Tools

I can't say much for this part, but in my view Google Wave is the right idea. Real time alteration by multiple users is exactly the sort of collaboration that people need. To fully utilize this standards would have to be adopted, so that tables of data, multiple file types, and any other experimental data could be easily maintained. Also, Google Wave presents a solution for how each piece of data can be properly tagged on the fly. The Google Wave demo showed a real time spellchecked which analyzes the content of the sentence in real time to fix typos and grammatical errors. This could be applied to the content of the data or the abstract, to come up with a relevant "blurb" describing the piece of information in question (as mentioned in the point #1). However, this turns the issue of referencing into developing an network of semantically linked objects and thus it falls into the niche of a natural language processing problem.

Now that I've gotten those thoughts out of the way I have to get back to writing release notes for my extension.

Thursday, July 2, 2009

Tuesday's Meeting and Event Listeners

When I demoed my application during the Tuesday meeting the grad students seemed interested; I was glad for all the feedback and suggestions. I've already finished altering it to as Jon Pipitone and others suggested, to represent link-following as solid lines in the graph and other visits (typing in a URL, clicking Home, a Bookmark, etc) be dotted lines. I thought this was going to be hard but Johnathon's suggestion of using document.referrer allowed me to make this change in less than 20 minutes. One thing I really want to do that was suggested is make the size of the node correlate to the length of time spent on that website, this would be a great way to better display meaningful sites to the user. One of my immediate concerns would be if someone visited one site many times, say 6-7 times in one session, and each time was only a minute or so, the overall node size may be a false representation that does not agree with their view of the "importantness" of the site.

Having the sidebar of the trail of sites visited was not overly helped, since each site is only listed once it really just states the time of the fist visit. So I instead opted to list the edges, to give a complete overview of the order of links followed, which conveys more relevant information. But I'm finding it very difficult to keep a coherent graph while keeping the number of edges to a minimum and adding as little clutter to the screen as possible. For example, hitting the Back button does not log an edge, so if a user has a large graph, then hits the Back button a few times and then clicks a new link the resulting graph gives no indication that the user backtracked. The only way to tell is to scroll through the sidebar of links followed in order and watch the links of the graph highlight. It works, but visually the graph looks a bit odd when two links directly after one another in the sidebar list are on opposite sides of the graph.

My most recent accomplishment has been the move from a timer loop to an event listener for loads. It still loads multiple times, even after using the Mozilla suggestions to have the appropriate event only fire on documents loading. I just added a few if statements to prevent some boundary cases from getting through; it works but it looks sort of messy - "if (on && aEvent.originalTarget.nodeName == "#document" && !(curURL== pastURL) && !(curURL == 'about:blank'))" is what I currently have. The last part is for opening a new tab and then typing in a URL. One of the main reasons for wanting to switch to event listeners (aside from being the standard way of doing something like this) is that with a timer browsing with multiple tabs and tab switching caused an incoherent mess of a graph. Now looking back and forth between tabs does not cause an edge to be created in the graph. My next goal is to fix the issue of clicking a link and selecting "open in new tab". The problem is that the tab opens in the background (not in focus) and the load event doesn't seem to register with the listener I have added to the gBrowser object. My hope is that this is an issue Mozilla has dealt with and I just have to dig through their documentation to find the right event to listen for.

Although I have tabbed browsing sort of working I'm still behind on my schedule - I guess giving my self less than a week to release my extension for live testing for a bit short sighted. I think by next Wednesday I should be ready. I'll stop adding new features and instead focus on making it as understandable as possible.

Climate Change Tool Development