matt.griffith - thinking out loud

Jog – my personal Google & Wayback Machine

Every day I see something on the net and I think “that’s cool; I need to remember that”. And every day I try to remember a site or a page that I saw before. But every day I am frustrated by the fact that Google doesn’t know me. Google doesn’t know what I’ve seen. Google doesn’t know what I like. Because of that, finding stuff that I know I’ve seen before is harder than finding something for the first time.

Early in my internet career I used “Bookmarks” and “Favorites” to remember sites. But they didn’t help because there were too many. I could never remember why I saved them. Worse still, my Favorites were bound to the computer I saved them on. It seems I never need find something on the same computer I first see it on.

Next came Backflip. Backflip let me save my favorites on a central server – a blessing and a curse. It made my favorites portable – which was great – but it also made my favorites susceptible to the problems of a free internet service early in the 21st century. Without fail; I always need my favorites at the exact same time that Backflip decides to go offline for days. Ouch.

What about my browser history you ask? That has helped me exactly 4 times in the last 7 years – not a great record. The reason is simple: browser history is organized around the sites that I visit; or the time I visited them. If I knew the site I was trying to remember I wouldn’t be trying to remember it. If I could remember the exact time I visited it then the history might help. I could look at every page I visited that day until I found the one I was looking for. But I rarely remember exactly when I saw something. Also, the browser history is bound to a single computer the same way favorites are. Not much help when I use several different computers on a regular basis.

Is there a better way? I’ve been thinking about that question for a while. I imagine a lot of other people have too. I started a virtual project – virtual because it only exists as a thought experiment right now – to solve some of these problems. I’m calling my project Jog.

What would Jog do?

Jog would help me remember stuff. It would save the pages that I visit or that I’m interested in. It would allow me to search those pages later. It would allow me to browse through the pages as I saw them.

How would it do this? It could act as my web proxy. As I visit pages it could work in the background squirreling away all of the nuts that I might want to get back to later. Acting as a proxy, Jog would know every page I visit. Jog would even have a copy of every page I visit. It would be easy for Jog to save those pages at the same time.

What should Jog save?

Jog could save every page I visit. Is that too much? It’s better to have too many pages than not enough – disk space is cheap after all.

I could tell Jog what I want it to save. I could create a list of sites that I always want it to save – a sort of Jog whitelist. I could also tell Jog which sites I never want it to save – a Jog blacklist. For all other sites it could remember them for a relatively short time and then forget them if I don’t tell it otherwise.

Jog could use a Bayesian filter to make an educated guess about which pages I might want to save. It could use Google related sites and backlink searches to save sites that are related to the sites on my whitelist. Jog could automatically save the neighbors of the sites on my whitelist.

What about Privacy?

If Jog saves everything, how will it protect my privacy? Do I really want every page I visit saved for all time? Do other people want every page they visit saved for all time? Disk space isn’t the only consideration when deciding what Jog should save.

Where should Jog store the data?

Jog could store everything on one my computers. It could be split into separate client and server pieces. The server could run on a single Internet accessible machine. The Jog client could run on each of the machines I use. The local client could act as the proxy for that machine. It could then forward the pages that I visit to the Jog server.

The Jog server would be responsible for saving and indexing the pages. It would also provide central access to the stored pages.

Another option: use a central Jog server as my proxy regardless of where I’m surfing from.

The biggest problem with a central server is the lack of an offline mode. I wouldn’t be able to search Jog unless I was connected to the Jog server.

A decentralized P2P system would solve some problems. But the only ready-for-primetime decentralized P2P system that I’m aware of is Groove. Groove is attractive because it is secure and it can handle synchronizing the data between multiple computers. It also provides an easy way to securely share my data if I want to. The Groove license allows you to use it on as many as 5 computers and that should be plenty for most users.

But how would I index files stored in Groove? What about non-windows users? What about the price of Groove? Will the Groove Web Services be enough; or would I need to create a custom Groove tool? What about performance – is Groove up to the challenge there?

To be continued…

Related reading



http://www.matthew-jones.com/publish/index.php/link/bayesian
http://www.paulgraham.com/spam.html
#a258


http://www.glennf.com/gmblog/archives/00000005.html

Related Products


http://www.enfish.com/desktop/features.asp [found via related:www.glennf.com/gmblog/archives/00000005.html … Google never ceases to amaze me!]
http://www.aimingtech.com/aimatsite/home.htm
http://www.harmonyhollow.net/fsearch.shtml