Friday, May 29, 2009

Thumbnails, Thumbnails, Thumbnails

    Today I decided to work from home, since all I'm really doing is research. I've been looking at a few FireFox extensions that involve web page thumbnails in one way or another, trying to determine how they generate the thumbnail.

    After jumping back and forth between each method that involves any "thumbnail" reference I think I've found the source, the following SQLite statement 

  "SELECT moz_historyvisits.id, title, domain, visit_count, screenshot, dayvisits, moz_places.url, visit_date " +
  "FROM moz_historyvisits " +
  "JOIN moz_places ON moz_historyvisits.place_id = moz_places.id " +
  "LEFT JOIN wr_places ON wr_places.url = moz_places.url " +
  "WHERE moz_historyvisits.id = ?1 " +
  "AND visit_type NOT IN (4,7);";

    The origin of all thumbnail references come from this SQLite statement and, in particular, there is,

  var thumbnail = createRootNodeStatement.getUTF8String(4);

     which gets the 4th variable returned from executing this statement - whatever is in the "screenshot" column. I was hoping that screenshot would be a picture, but according to the database initialization the screenshot column is just text. What really confused me the first few times I read over all the source code is that not once, in any of the Javascript files in WebReview, does any data ever get Inserted into the database. I know most of the tables dealt with here are built in to FireFox and it's history management system, but there is still the table made by WebReview

webreviewDBConn.executeSimpleSQL("CREATE TABLE IF NOT EXISTS wr.wr_places ( url TEXT PRIMARY KEY, frequency REAL, dayvisits INTEGER, domain TEXT, subdomain TEXT, screenshot TEXT, daysession INTEGER );");

    which never receives any data directly from a call in the JavaScripts. The only conclusion I can reach is that in the long SQLite statement, where the screenshot is extracted, the line "LEFT JOIN wr_places ON wr_places.url = moz_places.url " synchronizes the data on the moz_places table with wr_places to somehow generate useful data for the screenshot column. I found this site  which gives a rough idea of what each of the tables in the places.sqlite database contain, but none show any obvious screenshot related information. What I really need is an SQL expert to decode that statement for me, in hopes of getting more clues as to where the screenshot data comes from.

    Screengrab uses some Java methods to generate it's screenshots, the key ones being

var image = java.ava.awt.Robot().createScreenCapture(new java.awt.Rectangle(box.x, box.y, box.width, box.height));
  Packages.javax.imageio.ImageIO.write(image, "png", b64os);
  b64os.close();
  return "data:image/png;base64," + baos.toString();

    where box has a references to the screen dimensions. Screengrab uses its own custom class, Base64$OutputStreem, which I may have to decompile and read. The program basically gets the screen capture through the Robot's createScreenCapture method, and saves it to their custom Base64 OutputStreem. What's returned is the Base64 string representation of the image, which can be applied to any html img object through img.src = the returned string. I like this way of getting the screenshot - nothing is actually being saved, only the raw data of the screenshot is stored in the image src attribute. But would it be okay to decompile this custom made Base64$OutputStream file, understand it, then use it for my own extension?

    One extra piece of information I found while reading the source code is that there appears to be a method to save files built in to Mozilla. nsIFilePicker  creates an open/save dialog box, so this would be very useful for implementing a future save/load function to the graph. Also, nsIFile XPCOM should help in creating temporary files if needed.

    Showcase produces a nice thumbnail view of all the pages in tabs you currently have open. The source was pretty daunting to rummage through - one file had nearly 7,000 lines - but when I found how they actually make the thumbnails I was surprised, since it was so easy. It uses a drawWindow method for canvas's which renders the entire web content, given the dimensions, into a canvas object. I don't know why the previous two extensions didn't use this method, as opposed to their elaborate work around. I suppose there is the possibility that, for the Screengrab extension, saving it to a canvas does not help with actually saving the image. Using a canvas seems to be the most reasonable route to create the thumbnails. However, JSViz uses a CSS style property backgroundImage to set the image background, and it takes a URL as the parameter. This means I won't be able to use the JSViz library to create the graphs if I use canvas' to draw the thumbnails on. 
    On closer inspection it appears that WebReview uses a canvas to draw it's thumbnails as well, except it uses the drawImage method instead of the drawWindow method. This could be useful if I have problems with the drawWIndow method. However, I would still need to know what information is contained in the screenshot column of the database table. Based on what I learned from the Screengrab extension, I think that since it is a string it is most likely the base 64 encoding of the image that gets store in that column. How the information gets there with no table inserts is a mystery to me, though.
  
    As it stands I've made one step forward - I think I could do thumbnails! - but one step back - no more JSViz. I'll have an interesting time on Monday trying to find a graphing application/library that is compatible with Javascript canvas object (Steve told me not to design my own algorithm - thank god!). There is one more option I found though, which would work with JSViz. PageGlimpse offers a service to that offers developers access to thumbnails of any web page. Sounds perfect, right? Well, almost. It is indeed free, but only if I use under 300gb/month. This is a lot of bandwidth for just some small thumbnails, but to access the site through an application I write I would have to send my developer key (obtained through a free sign up) so the amount transferred would be linked to me. Also, since this is a FireFox extension, my hopes is that many people would be using it but the more people use it the more potential there is to go over the limit. Plus the site is still only in the beta stages; tying my extension to a site that I have no control over and could possibly shut down or decide to charge a fee does not sit well with me. Still, it is an interesting service to offer.
 
    The last thing I'd like to make a note about is this Python script using Mozilla to create a thumbnail of any URL. It's deceptively simple and only a hundred lines or less, but if it works correctly (haven't bothered testing) it would be a good option to look into if I have to cut my application away from being a FireFox extension and default to be a stand alone browser application. 
 
    I'll spend the rest of the day now cleaning up my test extension, which is now almost doesn't require clicking to log every website.

Wednesday, May 27, 2009

Making progress

    Throughout the morning I convinced myself that I'd finally choose which route to go - learning Flash and Flare or JSViz. Well, JSViz won. I know that in my last post I said I really wanted to use Flash, but that seems unreasonable at the moment. Maybe once I have the guts of the extension finished I could experiment with different visualization options. Plus, I figure that a lot people aren't very fond of Flash, and I'd have to spend quite a long time learning essentially two new sets of syntax - Flash probably isn't very similar to any languages I already know. JSViz has some nice examples that I should be able to cut and paste from, to get the general structure of how it does things. Also, it can create pretty nice looking graphs. But JSViz requires an XML file of the node data to build the graph from - Javascript cannot write to local files so I cant create this file on the fly. I'm hoping that I could have a standard XML file to load, with a built in Javascript that will automatically run and edit the DOM properties of the file to create the node structure that JAViz's XMLTreeGraphBuilder requires. After looking through the XMLTreeGraphBuilder source code I'm still unsure if it reads in the document with or without loading it, just scanning the lines. If it loads the document then my approach should work, however if it just scans the file I may be out of luck.

    I've finally worked most the kinks out of my test extension. It properly loads and displays the logged URLs, clears the display field when Clear Display is clicked, removes an entry from the database by selecting one in the list box and clicking Remove, and goes to the currently selected URL when Go To... is clicked. I'm having a bit of trouble re-displaying the list of URLs after one is Removed - I think the window.location.reload() function stops any function executing after it, since reload would reload the Javascript. But that's only a usability annoyance which I'll solve once the main project is near completion.

Tuesday, May 26, 2009

Beginning Version 2 - Integrating SQLite

    I've decided to bite the bullet and learn to use SQLite; I figure it's easier to do this than dig through Mozilla's documentation for solutions to my data management problem. Suprisngly, SQLite isn't that complex to learn. The Mozilla wrapper functions to deal with SQLite, on the other hand, need a bit of cleaning up, in my opinion. Current references:

    My most valuable reference though is the source code for WebReview. Without this I would be completely stuck, as Mozilla documentation covers only the very basics. The biggest problem I'm running in to is how every error I receive in the Firefox Error Console is unlisted anywhere, except for a few sparse posts by people working on various Mozilla projects. At least I've been able to solve most of the problems by copying the WebReview source code and editing it line by line to suit my needs. 

    While looking through the WebReview source code I discovered a whole separate class dealing specifically with the thumbnails - over 5,000 lines of code. This just reinforces my belief that adding thumbnails to the graph is just too far beyond my skill level. Since a final version would lack some visual flare to it I'm really going to focus on integrating SQLite and then learning some Flash to use the Flare API, creating really nice and interactive graphs.

What I accomplished today:

  • Learned the basics of SQL, enough to create and manipulate tables.
  • Learned some of the Mozilla API to access databases, execute commands, and save data.
  • Managed to finally get Components.utils.import to work, after altering my chrome.manifest file.
  • Started version two of my basic Firefox extension: now URLs are logged into a database to solve the problem of sharing data between two windows.

To do tomorrow:

  • Test to make sure everything works appropriately.
  • Implement "Go to..." and "Remove" - Remove is now possible since I can manipulate the database.
  • Start thinking about how to implement "Start" and "Stop" so that the Firefox status bar icon doesn't have to be clicked to log a URL.
  • Figure out why this SQL trigger causes an error in the exectuteSimpleSQL method:

 CREATE TRIGGER insert_url_time AFTER INSERT ON url_log
 BEGIN
 UPDATE url_log SET timeEnter = DATETIME('NOW')
  WHERE rowid = new.rowid;  
 END;

That's all for today.


Monday, May 25, 2009

Visualization

Today I put my test Firefox extension on hold and instead decided to dive in to the problem of creating the graph - and it is indeed a problem. From what I've found so far embedding thumbnails is going to be the most challenging aspect which I may have to simply not do. Even so, creating the graph myself is going to be difficult. 

Option 1: JViz library

This is a Javascript library to create graphs through DOM objects. It seems easy to use in the sense that coordinates of nodes do no have to be specified. However, this option is not very visually pleasing. In this tutorial I've found that the images for the nodes have to be supplied through a URL - I would either need a generic node image packaged with the extension and hope JSViz accepts relative URLs or find a way to save thumbnails of webpage. The later option does not seem possible, as Javascript does not have access to the local harddisk.

Option  2: Flare  

I really like the visualization options of Flare, it seems like it would be perfect to make up for the loss of thumbnails. The downside to using Flare is that I have to learn ActionScript and Flash, two things I've never used before. Also, so only data that can be passed from a Javascript to an ActionScript (to pass the logged URLs and associated data to the visualization) is a string. Now I could find a way to express the array of URL objects in a specially formatted string and parse it within the Flash file to generate the data, or I can learn something like SQlite and maintain a database of the information. Both options don't seem to pleasing to me, but I will most likely have to go the database route. 

I would really love to use Flare - if I end up doing this browsing history visualizer - but the sheer amount of learning I would have to do would slow the programming down a lot. What I have to learn completely is: Javascript, XUL, Mozilla API, Flash, ActionScript, Flare API, and SQlite. That is a lot of topics to learn for one relatively straight forward - in principle - Firefox extension. What's really holding me back is the actual graph part - dynamically creating a graph of data is considerably hard; dynamically creating a graph of data stored in a Javascript object which cannot be exported to, say, a temporary file is frustratingly harder.

If I use Flare I think my first step would be to decide whether I want to pass the data from the Javascript through a specially formatted string or just bite the bullet and learn how to use SQlite

Friday, May 22, 2009

I dislike Mozilla

    I'd just like to say that Mozilla has terrible documentation/API. They provide basic examples of what you can do with their functions, but whats the point if they don't work correctly? None of the options the Mozilla development site listed for sharing data between windows actually worked, every one I tried gave me some sort of error that wasn't explained anywhere. The mechanic I'm currently using only shares a copy of the objects, so editing them is pointless. Javascript core modules kept giving me "Permission Denied to access Components.utils.import" errors and the only reason I can think of would be because two instances of the javascript are run at the same time attempting to open the same module - but that's the very problem that it says its supposed to solve: "JavaScript code modules New in Firefox 3 is a simple method for creating shared global singleton objects that can be imported into any other JavaScript scope. The importing scope will have access to the objects and data in the code module. Since the code module is cached, all scopes get the same instance of the code module and can share the data in the module." 
    Mozilla has something called an XPCOM Component  which can be defined to contain any information and accessible from anywhere. However, the simplest example, the standard HelloWorld.js, is 106 lines - 106 lines for the simplest example. Learning how to make my own XPCOM components may simply be not worth it; I might as well learn something like sqlite and go the database route, since it could be helpful in future courses I take. 
 Now I'll begin trying to make some sort of a presentation for the beginning of next week.

Thursday, May 21, 2009

May 21: My First Firefox Extension

    I've spent today trying to develop a test extension, one that also involves saving visited URLs but doing nothing really fancy with them. My goal is to have a status bar icon that can be clicked and save the current URL. Then on a menu window will have an area to show the logged URLs and a few extra options. I'll just list a few problems/solutions I ran into while trying to develop it:

  • Some initial XUL research into the types of elements and objects it offers led me to use a listbox to display the URLs in, since it has an appendItem() method which I should be able to access through a Javascript to build up the list on screen.
  • It took me awhile to figure out how to properly edit XUL elements through Javascripts, since I initially had no understanding of the the DOM objects that XUL uses. I eventually learned the odd commands needed, such as the all important document.getElementById(id) method. 
  • After a frustrating hour of getElementById constantly returning null I found discovered that the document wasn't fully being loaded by the time my Javascript was called, so the object I was trying to reach wasn't in existence yet. I add to change the script to have a small function, init(), and then add onload="init();" to the window XUL object. Such frustrating details.
  • Once I finally figured that out I could begin designing a layout for the menu window. 
 This part wasn't too hard but required learning a lot more XUL to get a semi decent layout, mostly utilizing boxes to get things sort of lining up. I don't know if I can use the Start and Stop buttons as I wish yet - those were added under the assumption that the script would run constantly and logging URLs on its own. Right now that's just too advanced for me. The Go To... and Remove buttons are some things I thought of which I might add as a challenge to myself to understand more XUL and Javascript. I have an idea of how to implement them, but only once I get the basics of the extension actually working.

 

  • I wrote a relatively short Javascript to log the current URL and saving it to an array - rather simple. The problem I have is that the status bar icon can easily grab the URL of the browser, since they're embedded as the same window, but my options window (pictured above) has no access to that window. Unfortunately, to have access to the functions in my Javascript, and the array of URLs, I need to include a call in the option's XUL file. Doing this executes another instance of my Javascript, hence leaving me with no access to the built up array in the other instance of the script. This means I have no way to share data between the two windows- and I need access to the array in the options window to display the URLs when the Display button is clicked. I have the feeling that I'll have to maintain everything to a database, but I would much rather find a way that only involves Javascript, so let the Googling commence.
  • It seems like I'm finally getting a bit of luck. It turns out that the function to open a window from the main browser window allows for Javascript objects to be passed - albeit in a slightly odd manner. So as it stands my extension is working as intended. I'm unsure if my "Remove" button will be able to properly function though, because the solution I found passes a copy of the array of URLs, so I could remove a specific URL from being displayed, but that would not remove it from the actual structure; there is no way to pass pointers or references from a Javascript to a new XUL window. There are a few options here for working with objects between multiple windows, which would be a better solution to my problem.

Wednesday, May 20, 2009

Interesting Article

I was reading this article about provenance and browser history which I found really interesting. A couple thoughts:

  • They presented an interesting solution to the problem cycles: versioning and time stamping. The authors suggest that a graph of the browser history may contain multiple versions of the same site, depending on when it was visited following which link.
  • Current browsers do not register any meta-data between two pages in their history files when those pages are opened together - their is not timing data saved.
  • An interesting concept for history management would be to integrate a prolific search engine such as Google when searching browser history. If you can't remember an important site that you visited earlier but you know generally what it was about coming up with the search query to find the page again may be very difficult. If a search engine is integrated into the searching, when a query is entered it could go through the search engine but filter the results to display all relevant pages that also appear in the browser's history.
  • While this article discusses many points about using the actual stored browser history as the meta-data, the same ideas can be applied to a real time relational capturing tool.
  • Surprisingly, to date no browsers have integrated graph support for displaying user history.

May 20: A Review

Initial Idea: Obviously, use a graph structure where each node is a URL and the edges represent a click from that URL to another or the other way around (directed could be optional). The edge could easily be remembered: when the user clicks a link make the edge between that link and the last URL. This implementation seemed fairly intuitive to me, however I quickly ran into the problem of how the graph itself would be presented to the user. Using Javascript's rudimentary graphics tools to draw a dynamic and coherent web graph would be a painstaking task. Nodes would have to be evenly spread so that there is room for each branch, and even then I would have to add some complex geometric computations to ensure that each branch has enough space (relative to pixel real estate) to display all of its own edges and nodes without inadvertently overlapping other branches, reducing the graph to an unreadable mess of lines and thumbnails. However nice this option may sound I just cannot justify devoting a large portion of my work period to the development of algorithms to generate aesthetically pleasing graphs. 


Second Idea: A more reasonable approach to visualizing the users web browsing session would be through a tree. A tree structure allows for a clear representation similar to web graph, but the initial computation to generate children of a node seems far more straight forward. As long as each node keeps track of how many children it has, centering these children in a straight row below their parent node requires multiplying the set size of each node (relative to pixels) by the number of nodes, with a few pixels keeping the nodes from touching.
 My idea for the structure: 
  Each Node object would have five attributes: Its URL, Date/Time of initial visit (this session), an Array of other Node objects, a count of how many children this node has, and a Depth attribute to show its level in the tree.
  URL: Straight forward, this is the most important bit of information to keep track of.
  Date/Time: By giving a Date/Time attribute to each node I can give the ordering of a Node's children some significance, for example the ordering of children from left to right could represent the earliest visit to the most recent visit.
  Array: For the recursive building of the tree.
  Number of Children: This number is required to make computing the pixel real estate needed an easier processes, as briefly outlined above.
  Depth: The root would be considered depth 0. This property is a late edition that I thought of last night to try to solve the problem of overlapping. Even though I can easily compute where children will be located I still had no way to prevent children from overlapping. However, if I can search for each node at a certain depth I can list them side by side such that none overlap - and spread them out in an even manner.

 Doing this implementation I would have to have a set number of pixels to consider the maximum width. This wouldn't be all that hard, but I would then need to consider what would be a good bound on the maximum expected number of Nodes at each level, and make sure that my max pixel-to-Node-size-ratio is able to accommodate this.
 
 Some Potential Problems:
  1) Currently I have no way to deal with a cycle in the tree. A possible solution would be while building the graph flag a Node if that Node has already been entered into the tree but is a child of another Node - instead of adding it to the tree again or creating a path back to that Node which would most likely overlap other Nodes the next Node in the tree could be a Node with no children and arbitrary properties that says "See Node _____". But this creates a loss of relevance in what the tree is trying to convey (see next problem).

  2) How can I distinguish Site A being reached from Site B from, say, being at Site A then typing a completely new website into the search bar - Google, for example. In my structure there would be a path from Site A to Google even though there was no direct link following. This could potentially be a problem, but it could also be the right thing to do - the Google query could be provoked by the content of Site A and therefore there really should be a link between them. Would this be a good design choice? The flaw is that in most cases Google (or the users preferred search engine) would be visited multiple times per session; if I use the possible solution to problem 1 then all relevance between Site A and the result of the Google query would be lost.

 
Closer look at WebReview and thoughts of it:
 Since WebReview is a Firefox extension that has a history graphing utility similar to what I am researching I decided to take a closer look at it, well more accurately a closer look at its source code. After digging through the collection of Javascript files to find the ones related to the graph function I was greeted to nearly 2,500 lines of code with 80% of the documentation written in German - and that's only for the visualization of the graph, not the data structure of it. Unfortunately, even though it took this company 2,500 lines of code to display their graph the were doing it what I thought was the EASY way, in a tree structure. To be fair, their tree has nice graphics such as children of a parent not being displayed until the parent is clicked, causing the window to slide into focus of that node. 
 What WebReview is missing is an annotation feature. I feel I could implement this reasonably well, except to maximize its usefulness a Session and its resulting graph would have to have the ability to be saved. Saving into a html document as WebReview does is a complicated process, so instead an option I could pursue would be to save just the Nodes and their properties into a CSV file. Loading would only require parsing the file and rebuilding.
 One useful bit of information I discovered is that WebReview uses an sqlite database to store all of the extracted history from the Firefox history manager. I would personally rather use an abstract data type to store all of the Nodes in, something sortable like a Heap, but I'm unsure if this is even possible to do in a Javascript. I know I can create the data structure, but is maintaining it with a potentially large amount of data too intensive for Javascript? I may have to use a database like sqlite but that would require spending another week or so learning the ins and outs of it. One problem I foresee is that the building of the graph would be very slow with a Heap structure since it would have to be sorted with respect to the Depth property, so that computing the layout. (Note that using a tree structure does not require the Heap since everything is referenced to each other through each Node's Array of children). Also, my idea for the structure would be a complete waste if I had to end up using sqlite to store everything. 
 

Similar Projects:
 I found this great site that lists some projects that were being developed about a decade ago, similar to what I am attempting. What I envision is something similar to the SurfSerf application listed there (but less 90's Geocities-ish, I hope). Unfortunately the site is no longer maintained and none of the links work - but the pictures are representative enough.

 
To Accomplish Today:
 - Research the limitations of Javascript in terms of how large data structures can be managed.
 - Create a test extension that adds URLs to an array, with a button to open a window listing each of the stored URLs.
 - Maybe learn some sqlite.

Previous Wiki Entries

May 19
Graphical Representation of Browser History 
      Today I went through a few tutorials on creating Firefox extensions - learning XUL, understanding Javascript, getting the file structure and chrome options straightened out, etc. I found that making a javascript run continually while the browser is open is not all that hard. However, getting a function to run once when a webpage is loaded is frustratingly hard as Javascript's event listeners for on load execute multiple times. After over an hour of searching I have yet to find a solution. So far my work around is having a status bar button that has to be clicked to log the url, but as said before that would have to be clicked for every webpage that the user wants to log into the graph. 
    Once I began searching for immediately ran into WebReview 
        WebReview is a Firefox extension that has a number of features, most relevant to me being its WebGraph feature 
        "WebReview Graph cannot only display the complete tree you browsed currently, but visualizes also all browsing sessions with every visited web page in your history in a graph. You can find out, how you found are very interesting page even if it was months before (if your history of Firefox goes back that far). In addition, you will be able to export an entire graph to a single file if you want to archive it or to share." 
        I decided to test this out and it is exactly what Steve was looking for. It creates a complete graph of websites visited, what led to what, and it does this for the entire browser history. Sessions are all automatically saved. It even has thumbnails of each website for the graph nodes. Should I continue trying to create my own version of this? 
        A few changes I would make: 
            This extension lacks the ability to make annotations between two webpages in the graph. 
            Building multiple graphs based on Firefox's history seems very process intensive, I would change it so that the user can start/stop the script whenever they would like. Once stopped the graph could be viewed, annotated (optionally), and saved. 

May 15
Graphical Representation of Browser History 
        This project seems like it could not be efficiently done in Django, since Django is tied to a specific set of web pages. To get the most use of this I would need to have it generalizable to all web pages - a browser plug in. So today I will be researching about the development of Firefox browser extensions and learning the languages it uses (XUL and JavaScript). However, this would limit the application's usability to only Firefox, unless Firefox extensions can be ported to other browsers with ease. Should I continue this route and forget about Django or continue with Django and try to develop this as a web app to integrate into a site for finding relevant research articles? As of now I am still unsure, so for the time being I will researching the difficulty of Firefox extension development then examine the restrictions of presenting graphs with Django - to keep my options open. 

        So I spent most of the afternoon reading about Javascript and XUL. A Firefox extension might not be so bad, but after about an hour of browsing I'm still unsure if I can make a Javascript which will run the entire time that the browser is open. I initially thought that I could build it as a Greasemonkey script, but the limitations of Greasemonkey are just too much. So far my options seem to be to build the extension and add a button to the toolbar which when clicked automatically extracts the current URL and store it in some sort of abstract data structure. The problem is that I would need the javascript to run continually so that the data isn't lost, and I'm unsure if this is even possible to do efficiently with javascript. If Django could be used to make browser extensions then that would be great, since it seems like I need some sort of database to store the URLs. Also, if I implement it in this fashion it requires the user to click the button on every page they want logged. One button click isn't so much, but after a few dozen pages it may get annoying - having it automated would be much better. I may have to see how a client-side database could be managed when a Firefox session is opened and manipulated by Javascript, but this seems like an awkward work-around to solve the simple problem of saving the URL obtained from the Javascript. Unfortunately my knowledge of Javascript consists of the minimal amount that I've learned today, and dynamically managing information in a data structure, or even the simple problem of trying to have the script running continually through the session, have escaped me. 

Social Network through Subversion 
    Idea: Divide each file into Sections(divide the line number by a certain amount, say 10)then upon committing see which sections the altered lines fall within. Or divide by Method/Routines as opposed to sections. These options seems like the most challenging to develope since it would require a lot of process-intensive code scanning, but is going by module/file a useful option for anyone? 
    Can see how they relate through: 
        Bug Tickets (Does Trac/DrProject list which files contain the bug?) 
        Bug Fixes (Possibly examine the date/time that the bug ticket was closed and who closed it, then find the files and changes committed to the repository at that time period by that user) 
        Time Frame (Simply based on files being committed at the same time by the same user) 
        Overall (Combining all of the above) 
    Structure: 
    We need a data structure which can easily be interpreted to a visual representation. This would require maintaining a database of all of the relations, if we want the user to be able to sort and filter out certain ways that files relate - maybe the user doesn't care about which files were committed when a bug ticket was issued. We could have two Django models, Relations and Files. Each File would have to maintain a reference to all of the Relations it has, and the file name in the repository that it references. 
    Each Relation could have a number of fields: the two File models it relates (Only relate two files, so that in essence Relations constitute the edges of the graph), the user who caused the relation, the date/time it occurred, and the type (bug ticket, bug fix, commit, any more?). 
    With an implementation like this there could be many Relations between the same two Files. This could be represented in the graph as the weight of the edge (the Relation) and this we can filter what portion of the overall weight of the edge comes from which type of Relation. 
    This allows to do normal graph traversal routines such as Depth First Search from a specific File in the repository, or finding if there are any groups of Files (Connected Components) which are disjoint from other groups of Files. 

May 12

A few things I was thinking about during our meeting today with the grad students.

Integrating Branches in Subversion 
Does the Subversion client have access to each branch for cross checking multiple branches against each other? 

This ties in heavily with the awareness concept; once we have an idea of which sections of code and which modules appear to have a strong connection then people with multiple branches of the main project can be informed when another person on another branch has altered the same section of code/portion that has a strong connection. 

When person A commits changes they made to their branch to the repository, Subversion's logs could be used to automatically scan the other branches (assuming Subversion has access to each branch). If the changes person A made involve strongly connected code portions/modules of another current branch person B is working on, then automatically send a message to person B informing them that person A should be contracted to clear up any possible errors or conflicts. (This could be an e-mail or a notification through a blog widget). 

Issues: What counts as a "current branch"? How often are these branches re-integrated back into the main project? Would abandoned side projects be removed or left as dead branches forever - many open branches could result in unneeded stress on the system, having to cross-check many branches which no longer matter to the overall project. 

Code Awareness 

Related sections of code could be represented through graphs - the "strength" of the relation between two code sections/modules can be measured through their distance from one another in the graph, along edges. 

Sessions? A "play by play" of what person A did on date B with file C. 
Issues: Potentially lots of useless data to filter (typos, syntactic errors, etc). 
Either person A has to write a blurb about what they were doing or whoever is reading the session has to try to deduce what person A was fixing/altering/implementing. 

Indicate related code with line-by-line references, clicking as if to set a debugging break point? (Problem: This would have to be done through an IDE plugin) 

Use Subversion's logs to establish differences upon committing to see which lines of code have been altered or which files have been changed together in each submission. 

What would be the nodes of the graph? Individual lines? Functions/Routines? Do it by module/class? 

Possibly try two levels of interaction - module/file based (to give a broad idea of the interaction) and code-block based(to give a detailed level of the relation). But, how can we efficiently identify a code block? 

This seems like it would either require an extensive amount of meta data to build up a reliable graph of connected components or a large initial investment of time on the part of the people who know/wrote the code. 

Keep track of what sections of code/files each person on the project team has worked on. This would allow for searching by user. 

This would also help person A see the history of their own changes they made (easier than manually looking through Subversion logs) - enter their name, a specific file, and a date range to search within, then the output could be sorted by, say, most frequently edited code block or into chronological order or by file.

This could be implemented through indexing Subversion logs for each user on the system. 

Versions of Data Sets 

Tag crucial sections of the code that begin the initial processing of the data set. 

Crucial sections must be specified to begin with. 

Possibility: Any section that deals directly with processing raw data should be considered a "crucial section". 

Default: Load latest version of the data. 

Fallback: Upon errors, check if the error occurred within any of the crucial section, then re-run the program using the next latest version of the data. Continue this process until we have a successful run or there are no more versions (no more versions means its a problem with their code, not a problem in loading the proper data set). 

What changes in newer versions of the data sets - Simply more data? New syntax? Additional properties for each data entry?