puf -- TODO ------------- Parts denoted with ! discuss (the benefit of) a feature. This list is supposed to be sorted roughly by implementation order (not necessarily by importance). + read ~/.pufrc. just like implicit -i file. + lower memory usage - use a tree structure - optimize alignments - don't use sizeof, as it extends to struct alignment + redirs other than appending '/' should create a symlink + Wget-like -k switch. this should also rewrite file extensions - .php is pretty pointless in a local copy. + spread requests over both -ib and -iy, not only -ib via -iy. opt -xm. + simply machine-parsable logging format + support multioffset + better Proxy support: - make CGI-proxies actually work; escape some chars ("@" is common) - proxy readiness wait queue + handle content-/transfer-encodings + header templates for client faking + cookie handling + robots.txt handling + support SSL ! does anybody do recursive or parallel fetching from secure sites? + support FTP + write better documentation NOT TODO -------- - support gopher IDEAS ----- consider huffman-encoding url fragments. nuke request queue. instead, put dirty marks on not-yet-fetched leaves and dirty sub-node counts on non-leaf nodes. the tree would be traversed until the dirty counts are all zero. there are not that many different url param blocks, so put them in a hash indexed by "key nodes" and look them up while traversing the nodes. considerations: - summed up, are lingering inlined urld_t's really more efficient than a separate queue that shrinks? - if not, could file_t's be shrunk and relocated? - take care of references, like referer backlinks - freeing in pools does not work, so make per-directory pools which are shrunk at once param block switches work in a separate linear queue as well; flag presence of switch with a bit. use fewer pointers in tree: - no next ptrs: serialize lists into (fragmented) per-hierarchy-level pools. -> search/add performance will suffer, with name compression extremely. unless memory is reserved (=wasted), the pools would have to be constantly resized for additions -> relocation problem, again. - per-hierarchy-level parent ptrs. determine current stream with binary search (cache last hit(s)). -> quite slow (by default,) don't save referers for on-host refs - use parent dir instead. possibly save redirection referers for cloaking redirs in -xO dump. rethink where -xh headers are saved. consider partial downloads. merge http_req.c & http_rsp.c to http.c. extract stuff from http_conn.c & http_rsp.c to file.c. decouple disposition from aurl. pool open multi-src dispositions. ref-count all kinds of option sub-structs; dispose in time. optimize adden() by "finalizing" options only if no urls ref it yet. rethink path shortening magic - clashes when multiple urls on command line. coalesce identical auths - there can be plenty of them from a huge -i file. use threads instead of processes for dns helpers. use an async dns resolver lib. store dns ttls. maybe ignore -l for requisites. otoh, frames are considered requisites as well, and we certainly want neither no nor unlimited recursion for them. the proper solution is to parse the tags properly to know what needs to be recursed; this is important for -A as well. frames should be considered both links and requisites, actually. we should not follow references in pure requisites, as this easily causes host spanning. java applets don't seem to be fetched? -l will lead to different results depending on which route was taken to a page. fix: re-do the recursion decision on every addition attempt ... anonymous index.html discovery; symlink to proper file. use ETag for this. consider nuking au->reloc; use au->http_result_code directly. add -xD switch: "Regard \"Disposition:\" HTTP headers" rethink -Q: downloaded or written bytes? Basic auth should be automatically sent for subdirs. come.to & co. don't propagate redirects, but serve the redirected content directly. would have to rename foo to foo/index.html if foo/bar.html shows up. http 1.1 servers can host multiple domains on one ip, so ip-based alias detection is doomed to fail. add options to a) diasable the ip magic (and auto-alias only www*, etc. style host names) and b) manually specify domain groups. if an url is redirected to the same path on a different host (not an alias), check if the root path is redirected as well. if so, remember the host redirect. handle tags. iirc, this is a bit controversial, though. after adding support for other protocols, add switches -Ap/-Rp to disable them. merge -A/-R, -D/-Dx/-nD/-nDx & -I/-Ix/-X/-Xx into one chain. add full-url filtering. -/Ix/-Xx with non-slash start should be fine. for -xO, print "EOT" when all queues are empty (-i might still be open). for -xO, maybe print ETA when starting download? add bandwidth limits? total, per host or per connection?