Initial Idea: Obviously, use a graph structure where each node is a URL and the edges represent a click from that URL to another or the other way around (directed could be optional). The edge could easily be remembered: when the user clicks a link make the edge between that link and the last URL. This implementation seemed fairly intuitive to me, however I quickly ran into the problem of how the graph itself would be presented to the user. Using Javascript's rudimentary graphics tools to draw a dynamic and coherent web graph would be a painstaking task. Nodes would have to be evenly spread so that there is room for each branch, and even then I would have to add some complex geometric computations to ensure that each branch has enough space (relative to pixel real estate) to display all of its own edges and nodes without inadvertently overlapping other branches, reducing the graph to an unreadable mess of lines and thumbnails. However nice this option may sound I just cannot justify devoting a large portion of my work period to the development of algorithms to generate aesthetically pleasing graphs.
Second Idea: A more reasonable approach to visualizing the users web browsing session would be through a tree. A tree structure allows for a clear representation similar to web graph, but the initial computation to generate children of a node seems far more straight forward. As long as each node keeps track of how many children it has, centering these children in a straight row below their parent node requires multiplying the set size of each node (relative to pixels) by the number of nodes, with a few pixels keeping the nodes from touching.
My idea for the structure: Each Node object would have five attributes: Its URL, Date/Time of initial visit (this session), an Array of other Node objects, a count of how many children this node has, and a Depth attribute to show its level in the tree.
URL: Straight forward, this is the most important bit of information to keep track of.
Date/Time: By giving a Date/Time attribute to each node I can give the ordering of a Node's children some significance, for example the ordering of children from left to right could represent the earliest visit to the most recent visit.
Array: For the recursive building of the tree.
Number of Children: This number is required to make computing the pixel real estate needed an easier processes, as briefly outlined above.
Depth: The root would be considered depth 0. This property is a late edition that I thought of last night to try to solve the problem of overlapping. Even though I can easily compute where children will be located I still had no way to prevent children from overlapping. However, if I can search for each node at a certain depth I can list them side by side such that none overlap - and spread them out in an even manner.
Doing this implementation I would have to have a set number of pixels to consider the maximum width. This wouldn't be all that hard, but I would then need to consider what would be a good bound on the maximum expected number of Nodes at each level, and make sure that my max pixel-to-Node-size-ratio is able to accommodate this.
Some Potential Problems:
1) Currently I have no way to deal with a cycle in the tree. A possible solution would be while building the graph flag a Node if that Node has already been entered into the tree but is a child of another Node - instead of adding it to the tree again or creating a path back to that Node which would most likely overlap other Nodes the next Node in the tree could be a Node with no children and arbitrary properties that says "See Node _____". But this creates a loss of relevance in what the tree is trying to convey (see next problem).
2) How can I distinguish Site A being reached from Site B from, say, being at Site A then typing a completely new website into the search bar - Google, for example. In my structure there would be a path from Site A to Google even though there was no direct link following. This could potentially be a problem, but it could also be the right thing to do - the Google query could be provoked by the content of Site A and therefore there really should be a link between them. Would this be a good design choice? The flaw is that in most cases Google (or the users preferred search engine) would be visited multiple times per session; if I use the possible solution to problem 1 then all relevance between Site A and the result of the Google query would be lost.
Closer look at WebReview and thoughts of it:
Since WebReview is a Firefox extension that has a history graphing utility similar to what I am researching I decided to take a closer look at it, well more accurately a closer look at its source code. After digging through the collection of Javascript files to find the ones related to the graph function I was greeted to nearly 2,500 lines of code with 80% of the documentation written in German - and that's only for the visualization of the graph, not the data structure of it. Unfortunately, even though it took this company 2,500 lines of code to display their graph the were doing it what I thought was the EASY way, in a tree structure. To be fair, their tree has nice graphics such as children of a parent not being displayed until the parent is clicked, causing the window to slide into focus of that node.
What WebReview is missing is an annotation feature. I feel I could implement this reasonably well, except to maximize its usefulness a Session and its resulting graph would have to have the ability to be saved. Saving into a html document as WebReview does is a complicated process, so instead an option I could pursue would be to save just the Nodes and their properties into a CSV file. Loading would only require parsing the file and rebuilding.
One useful bit of information I discovered is that WebReview uses an sqlite database to store all of the extracted history from the Firefox history manager. I would personally rather use an abstract data type to store all of the Nodes in, something sortable like a Heap, but I'm unsure if this is even possible to do in a Javascript. I know I can create the data structure, but is maintaining it with a potentially large amount of data too intensive for Javascript? I may have to use a database like sqlite but that would require spending another week or so learning the ins and outs of it. One problem I foresee is that the building of the graph would be very slow with a Heap structure since it would have to be sorted with respect to the Depth property, so that computing the layout. (Note that using a tree structure does not require the Heap since everything is referenced to each other through each Node's Array of children). Also, my idea for the structure would be a complete waste if I had to end up using sqlite to store everything.
Similar Projects:
I found
this great site that lists some projects that were being developed about a decade ago, similar to what I am attempting. What I envision is something similar to the SurfSerf application listed there (but less 90's Geocities-ish, I hope). Unfortunately the site is no longer maintained and none of the links work - but the pictures are representative enough.
To Accomplish Today:
- Research the limitations of Javascript in terms of how large data structures can be managed.
- Create a test extension that adds URLs to an array, with a button to open a window listing each of the stored URLs.
- Maybe learn some sqlite.